[Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries

When I was reading a parquet file into RecordBatches using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error

```Java

 ParquetError("Parquet error: Not all children array length are the same!")
```

Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read

Visually:
```Java

+-----+
| RG1 |
|     |
+-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
|     |
+-----+
```

A reproducer is attached. 20 values should be read by the `ParquetFileArrowReader` regardless of the batch size. However, when using batch sizes such as `5` or `3` (which fall on a boundary between row groups) not all the rows are read. 

To run the reproducer, decompress the attachment  [parquet_file_arrow_reader.zip](parquet_file_arrow_reader.zip) and do `cargo run`

The output is as follows:

```Java

wrote 20 rows in 4 row groups to /tmp/repro.parquet
Size when reading with batch_size 100 : 20
Size when reading with batch_size 7 : 20
Size when reading with batch_size 5 : 5
```

The expected output is as follows (should always read 20 rows, regardless of the batch size):
```Java

wrote 20 rows in 4 row groups to /tmp/repro.parquet
Size when reading with batch_size 100 : 20
Size when reading with batch_size 7 : 20
Size when reading with batch_size 5 : 20
```

## Workaround
Use a different batch size that will not fall on record batch boundaries

**Reporter**: [Andrew Lamb](https://issues.apache.org/jira/browse/ARROW-9790) / @alamb
**Assignee**: [Andrew Lamb](https://issues.apache.org/jira/browse/ARROW-9790) / @alamb
#### Original Issue Attachments:
- [parquet_file_arrow_reader.zip](https://issues.apache.org/jira/secure/attachment/13010060/parquet_file_arrow_reader.zip)
#### PRs and other links:
- [GitHub Pull Request #8007](https://github.com/apache/arrow/pull/8007)
- [GitHub Pull Request #8009](https://github.com/apache/arrow/pull/8009)

<sub>**Note**: *This issue was originally created as [ARROW-9790](https://issues.apache.org/jira/browse/ARROW-9790). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries #25836

Workaround

Original Issue Attachments:

PRs and other links:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries #25836

Description

Workaround

Original Issue Attachments:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions