Describe the bug
When inferring a schema, the list_all_files uses an object store to list the files. No sorting is passed.
When the object store is a LocalFileSystem, there isn't an insurance of any file sorting (the return list of a macOs has a different sort of windows). This means that the inferred schema can be different for the same set of files.
We contact the object store (apache/arrow-rs-object-store#178) that point it out that the solution should be implemented in the caller of the method, applying a sort of any type, to maintain consistency between file systems.
To Reproduce
Having two parquet files in the filesystem with the schema:
{
"type" : "record",
"name" : "root",
"fields" : [ {
"name" : "year",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "description",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "code",
"type" : [ "null", "long" ],
"default" : null
} ]
}
{
"type" : "record",
"name" : "root",
"fields" : [ {
"name" : "description",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "code",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "year",
"type" : [ "null", "int" ],
"default" : null
} ]
}
and executing:
#[tokio::test]
async fn infer_schema() {
let path = ListingTableUrl::parse("./files").unwrap();
let ctx = SessionContext::new();
let state = ctx.state();
let options = ListingOptions::new(Arc::new(ParquetFormat::default()));
let schema = options.infer_schema(&state, &path).await.unwrap();
schema.fields.iter().for_each(|field| println!("{0}", field.name()));
}
the result in macOs Ventura:
the first file pickup was the file3.parquet
and using windows
the first file pickup was the file1.parquet
Expected behavior
The same schema independently the OS where the code is run. A sort should be forced or at least given the possibility of passing a sort function
Additional context
No response
Describe the bug
When inferring a schema, the list_all_files uses an object store to list the files. No sorting is passed.
When the object store is a LocalFileSystem, there isn't an insurance of any file sorting (the return list of a macOs has a different sort of windows). This means that the inferred schema can be different for the same set of files.
We contact the object store (apache/arrow-rs-object-store#178) that point it out that the solution should be implemented in the caller of the method, applying a sort of any type, to maintain consistency between file systems.
To Reproduce
Having two parquet files in the filesystem with the schema:
and executing:
the result in macOs Ventura:
the first file pickup was the file3.parquet
and using windows
the first file pickup was the file1.parquet
Expected behavior
The same schema independently the OS where the code is run. A sort should be forced or at least given the possibility of passing a sort function
Additional context
No response