Cannot query parquet files generated by Apache Spark from datafusion-cli

**Describe the bug**

I have a data set created by Apache Spark and I tried to query it from the DataFusion CLI. It failed, saying that a parquet file was corrupt.

```
 CREATE EXTERNAL TABLE store_sales STORED AS PARQUET LOCATION 'store_sales.dat';
0 rows in set. Query took 0.002 seconds.
❯ select count(*) from store_sales;
Parquet reader thread terminated due to error: ParquetError(General("Invalid Parquet file. Corrupt footer"))
```

I added some debug logging and found that it was actually trying to read the following file, which is not a Parquet file.

```
store_sales.dat/.part-00005-5142b177-bacb-499d-b14f-12de4b94d9d9-c000.snappy.parquet.crc
```

**To Reproduce**
Create a non-Parquet file with a non-Parquet extension and put it in a directory along with some valid parquet files.

**Expected behavior**
Should only try and read files with file extension `.parquet`.

**Additional context**
None


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot query parquet files generated by Apache Spark from datafusion-cli #1648

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cannot query parquet files generated by Apache Spark from datafusion-cli #1648

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions