tpch conversion generates parquet files that cannot be queried on simple predicates

**Describe the bug**

I converted tpch csv files to parquet and found that I cannot query with simple predicate on a string column.

```
> create external table customer stored as parquet location '/mnt/bigdata/tpch-sf1000-parquet/customer';
0 rows in set. Query took 0.000 seconds.

> SELECT c_mktsegment, COUNT(*) FROM customer GROUP BY c_mktsegment;
+--------------+-----------------+
| c_mktsegment | COUNT(UInt8(1)) |
+--------------+-----------------+
| HOUSEHOLD    | 30003565        |
| BUILDING     | 29998146        |
| FURNITURE    | 29999758        |
| MACHINERY    | 30003128        |
| AUTOMOBILE   | 29995355        |
+--------------+-----------------+
5 rows in set. Query took 0.758 seconds.

> SELECT concat('[', c_mktsegment, ']') from customer limit 5;
+------------------------------------------+
| concat(Utf8("["),c_mktsegment,Utf8("]")) |
+------------------------------------------+
| [MACHINERY]                              |
| [HOUSEHOLD]                              |
| [BUILDING]                               |
| [AUTOMOBILE]                             |
| [HOUSEHOLD]                              |
+------------------------------------------+
5 rows in set. Query took 0.028 seconds.

> SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'BUILDING';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 0               |
+-----------------+
1 row in set. Query took 0.398 seconds.

> SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'HOUSEHOLD';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 0               |
+-----------------+
1 row in set. Query took 0.386 seconds.

> SELECT COUNT(*) FROM customer WHERE c_mktsegment LIKE 'HOUSEHOLD';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 30003565        |
+-----------------+
1 row in set. Query took 0.663 seconds.
```

The odd thing is that I see the behavior when querying from Spark, so I guess something is wrong with the parquet file.

```
scala> spark.sql("SELECT c_mktsegment, COUNT(*) FROM customer GROUP BY c_mktsegment").show
+------------+--------+                                                         
|c_mktsegment|count(1)|
+------------+--------+
|   MACHINERY|30003128|
|  AUTOMOBILE|29995355|
|    BUILDING|29998146|
|   HOUSEHOLD|30003565|
|   FURNITURE|29999758|
+------------+--------+


scala> spark.sql("SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'BUILDING'").show
+--------+
|count(1)|
+--------+
|       0|
+--------+
```


**To Reproduce**
Run the above queries in datafusion-cli

**Expected behavior**
Simple predicates should work.

**Additional context**
The parquet files were generated by the conversion utility in the tpch benchmarks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tpch conversion generates parquet files that cannot be queried on simple predicates #863

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

tpch conversion generates parquet files that cannot be queried on simple predicates #863

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions