Describe the bug
I converted tpch csv files to parquet and found that I cannot query with simple predicate on a string column.
> create external table customer stored as parquet location '/mnt/bigdata/tpch-sf1000-parquet/customer';
0 rows in set. Query took 0.000 seconds.
> SELECT c_mktsegment, COUNT(*) FROM customer GROUP BY c_mktsegment;
+--------------+-----------------+
| c_mktsegment | COUNT(UInt8(1)) |
+--------------+-----------------+
| HOUSEHOLD | 30003565 |
| BUILDING | 29998146 |
| FURNITURE | 29999758 |
| MACHINERY | 30003128 |
| AUTOMOBILE | 29995355 |
+--------------+-----------------+
5 rows in set. Query took 0.758 seconds.
> SELECT concat('[', c_mktsegment, ']') from customer limit 5;
+------------------------------------------+
| concat(Utf8("["),c_mktsegment,Utf8("]")) |
+------------------------------------------+
| [MACHINERY] |
| [HOUSEHOLD] |
| [BUILDING] |
| [AUTOMOBILE] |
| [HOUSEHOLD] |
+------------------------------------------+
5 rows in set. Query took 0.028 seconds.
> SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'BUILDING';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 0 |
+-----------------+
1 row in set. Query took 0.398 seconds.
> SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'HOUSEHOLD';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 0 |
+-----------------+
1 row in set. Query took 0.386 seconds.
> SELECT COUNT(*) FROM customer WHERE c_mktsegment LIKE 'HOUSEHOLD';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 30003565 |
+-----------------+
1 row in set. Query took 0.663 seconds.
The odd thing is that I see the behavior when querying from Spark, so I guess something is wrong with the parquet file.
scala> spark.sql("SELECT c_mktsegment, COUNT(*) FROM customer GROUP BY c_mktsegment").show
+------------+--------+
|c_mktsegment|count(1)|
+------------+--------+
| MACHINERY|30003128|
| AUTOMOBILE|29995355|
| BUILDING|29998146|
| HOUSEHOLD|30003565|
| FURNITURE|29999758|
+------------+--------+
scala> spark.sql("SELECT COUNT(*) FROM customer WHERE c_mktsegment = 'BUILDING'").show
+--------+
|count(1)|
+--------+
| 0|
+--------+
To Reproduce
Run the above queries in datafusion-cli
Expected behavior
Simple predicates should work.
Additional context
The parquet files were generated by the conversion utility in the tpch benchmarks.
Describe the bug
I converted tpch csv files to parquet and found that I cannot query with simple predicate on a string column.
The odd thing is that I see the behavior when querying from Spark, so I guess something is wrong with the parquet file.
To Reproduce
Run the above queries in datafusion-cli
Expected behavior
Simple predicates should work.
Additional context
The parquet files were generated by the conversion utility in the tpch benchmarks.