Skip to content

Implement write.parquet.row-group-size-bytes in the pyarrow writer#12

Open
stephrb wants to merge 1 commit into
mainfrom
sbuck/implement-row-group-size-bytes
Open

Implement write.parquet.row-group-size-bytes in the pyarrow writer#12
stephrb wants to merge 1 commit into
mainfrom
sbuck/implement-row-group-size-bytes

Conversation

@stephrb
Copy link
Copy Markdown
Collaborator

@stephrb stephrb commented Jun 1, 2026

The pyiceberg writer has historically ignored
write.parquet.row-group-size-bytes (logging 'not implemented') and used only write.parquet.row-group-limit (rows). For wide tables that means a single row group ends up at gigabytes — e.g. 337 cols × 1,048,576 default rows ≈ 1.7 GiB uncompressed per row group — which drives the polars / pyarrow reader's decode peak into the tens of GiB on production reads.

Now write_file resolves row_group_size as
min(row_group_limit, row_group_size_bytes / bytes_per_row), where bytes_per_row is approximated from the in-memory arrow_table's nbytes. This matches Spark / parquet-mr 'whichever limit fires first' semantics and lets the existing PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT (128 MiB) actually take effect.

The pyiceberg writer has historically ignored
write.parquet.row-group-size-bytes (logging 'not implemented') and used
only write.parquet.row-group-limit (rows). For wide tables that means a
single row group ends up at gigabytes — e.g. 337 cols × 1,048,576 default
rows ≈ 1.7 GiB uncompressed per row group — which drives the polars /
pyarrow reader's decode peak into the tens of GiB on production reads.

Now write_file resolves row_group_size as
min(row_group_limit, row_group_size_bytes / bytes_per_row), where
bytes_per_row is approximated from the in-memory arrow_table's nbytes.
This matches Spark / parquet-mr 'whichever limit fires first' semantics
and lets the existing PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT (128 MiB)
actually take effect.
@stephrb stephrb force-pushed the sbuck/implement-row-group-size-bytes branch from a707099 to 571647d Compare June 2, 2026 03:57
@stephrb stephrb self-assigned this Jun 2, 2026
@stephrb stephrb requested review from rj-imc and wallacms June 2, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant