When I was reading a parquet file into RecordBatches using ParquetFileArrowReader that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error
ParquetError("Parquet error: Not all children array length are the same!")
Upon investigation, I found that when reading with ParquetFileArrowReader, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read
Visually:
+-----+
| RG1 |
| |
+-----+ <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
| |
+-----+
A reproducer is attached. 20 values should be read by the ParquetFileArrowReader regardless of the batch size. However, when using batch sizes such as 5 or 3 (which fall on a boundary between row groups) not all the rows are read.
To run the reproducer, decompress the attachment parquet_file_arrow_reader.zip and do cargo run
The output is as follows:
wrote 20 rows in 4 row groups to /tmp/repro.parquet
Size when reading with batch_size 100 : 20
Size when reading with batch_size 7 : 20
Size when reading with batch_size 5 : 5
The expected output is as follows (should always read 20 rows, regardless of the batch size):
wrote 20 rows in 4 row groups to /tmp/repro.parquet
Size when reading with batch_size 100 : 20
Size when reading with batch_size 7 : 20
Size when reading with batch_size 5 : 20
Workaround
Use a different batch size that will not fall on record batch boundaries
Reporter: Andrew Lamb / @alamb
Assignee: Andrew Lamb / @alamb
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-9790. Please see the migration documentation for further details.
When I was reading a parquet file into RecordBatches using
ParquetFileArrowReaderthat had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this errorUpon investigation, I found that when reading with
ParquetFileArrowReader, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are readVisually:
A reproducer is attached. 20 values should be read by the
ParquetFileArrowReaderregardless of the batch size. However, when using batch sizes such as5or3(which fall on a boundary between row groups) not all the rows are read.To run the reproducer, decompress the attachment parquet_file_arrow_reader.zip and do
cargo runThe output is as follows:
The expected output is as follows (should always read 20 rows, regardless of the batch size):
Workaround
Use a different batch size that will not fall on record batch boundaries
Reporter: Andrew Lamb / @alamb
Assignee: Andrew Lamb / @alamb
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-9790. Please see the migration documentation for further details.