feat: enforce physical column ordering in Parquet files for two-GET streaming merge#6281
Merged
g-talbot merged 4 commits intogtt/docs-claude-mdfrom Apr 10, 2026
Merged
Conversation
Sort schema columns are written first (in their configured sort order), followed by all remaining data columns in alphabetical order. This physical layout enables a two-GET streaming merge during compaction: the footer GET provides the schema and offsets, then a single streaming GET from the start of the row group delivers sort columns first — allowing the compactor to compute the global merge order before data columns arrive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sanity check only asserted presence, not ordering. Now it verifies that host appears before service in the input (scrambled) which is the opposite of the sort-schema order (service before host). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
@codex review |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
mattmkim
approved these changes
Apr 9, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
g-talbot
added a commit
that referenced
this pull request
Apr 13, 2026
* feat: replace fixed MetricDataPoint fields with dynamic tag HashMap
* feat: replace ParquetField enum with constants and dynamic validation
* feat: derive sort order and bloom filters from batch schema
* feat: union schema accumulation and schema-agnostic ingest validation
* feat: dynamic column lookup in split writer
* feat: remove ParquetSchema dependency from indexing actors
* refactor: deduplicate test batch helpers
* lint
* feat(31): sort schema foundation — proto, parser, display, validation, window, TableConfig
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: rustdoc link errors — use backticks for private items
* feat(31): compaction metadata types — extend split metadata, postgres model, field lookup
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(31): wire TableConfig into sort path, add compaction KV metadata
Wire TableConfig-driven sort order into ParquetWriter and add
self-describing Parquet file metadata for compaction:
- ParquetWriter::new() takes &TableConfig, resolves sort fields at
construction via parse_sort_fields() + ParquetField::from_name()
- sort_batch() uses resolved fields with per-column direction (ASC/DESC)
- SS-1 debug_assert verification: re-sort and check identity permutation
- build_compaction_key_value_metadata(): embeds sort_fields, window_start,
window_duration, num_merge_ops, row_keys (base64) in Parquet kv_metadata
- SS-5 verify_ss5_kv_consistency(): kv_metadata matches source struct
- write_to_file_with_metadata() replaces write_to_file()
- prepare_write() shared method for bytes and file paths
- ParquetWriterConfig gains to_writer_properties_with_metadata()
- ParquetSplitWriter passes TableConfig through
- All callers in quickwit-indexing updated with TableConfig::default()
- 23 storage tests pass including META-07 self-describing roundtrip
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(31): PostgreSQL migration 27 + compaction columns in stage/list/publish
Add compaction metadata to the PostgreSQL metastore:
Migration 27:
- 6 new columns: window_start, window_duration_secs, sort_fields,
num_merge_ops, row_keys, zonemap_regexes
- Partial index idx_metrics_splits_compaction_scope on
(index_uid, sort_fields, window_start) WHERE split_state = 'Published'
stage_metrics_splits:
- INSERT extended from 15 to 21 bind parameters for compaction columns
- ON CONFLICT SET updates all compaction columns
list_metrics_splits:
- PgMetricsSplit construction includes compaction fields (defaults from JSON)
Also fixes pre-existing compilation errors on upstream-10b-parquet-actors:
- Missing StageMetricsSplitsRequestExt import
- index_id vs index_uid type mismatches in publish/mark/delete
- IndexUid binding (to_string() for sqlx)
- ListMetricsSplitsResponseExt trait disambiguation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(31): close port gaps — split_writer metadata, compaction scope, publish validation
Close critical gaps identified during port review:
split_writer.rs:
- Store table_config on ParquetSplitWriter (not just pass-through)
- Compute window_start from batch time range using table_config.window_duration_secs
- Populate sort_fields, window_duration_secs, parquet_files on metadata before write
- Call write_to_file_with_metadata(Some(&metadata)) to embed KV metadata in Parquet
- Update size_bytes after write completes
metastore/mod.rs:
- Add window_start and sort_fields fields to ListMetricsSplitsQuery
- Add with_compaction_scope() builder method
metastore/postgres/metastore.rs:
- Add compaction scope filters (AND window_start = $N, AND sort_fields = $N) to list query
- Add replaced_split_ids count verification in publish_metrics_splits
- Bind compaction scope query parameters
ingest/config.rs:
- Add table_config: TableConfig field to ParquetIngestConfig
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(31): final gap fixes — file-backed scope filter, META-07 test, dead code removal
- file_backed_index/mod.rs: Add window_start and sort_fields filtering
to metrics_split_matches_query() for compaction scope queries
- writer.rs: Add test_meta07_self_describing_parquet_roundtrip test
(writes compaction metadata to Parquet, reads back from cold file,
verifies all fields roundtrip correctly)
- fields.rs: Remove dead sort_order() method (replaced by TableConfig)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(31): correct postgres types for window_duration_secs and zonemap_regexes
Gap 1: Change window_duration_secs from i32 to Option<i32> in both
PgMetricsSplit and InsertableMetricsSplit. Pre-Phase-31 splits now
correctly map 0 → NULL in PostgreSQL, enabling Phase 32 compaction
queries to use `WHERE window_duration_secs IS NOT NULL` instead of
the fragile `WHERE window_duration_secs > 0`.
Gap 2: Change zonemap_regexes from String to serde_json::Value in
both structs. This maps directly to JSONB in sqlx, avoiding ambiguity
when PostgreSQL JSONB operators are used in Phase 34/35 zonemap pruning.
Gap 3: Add two missing tests:
- test_insertable_from_metadata_with_compaction_fields: verifies all 6
compaction fields round-trip through InsertableMetricsSplit
- test_insertable_from_metadata_pre_phase31_defaults: verifies pre-Phase-31
metadata produces window_duration_secs: None, zonemap_regexes: json!({})
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: rustfmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test(31): add metrics split test suite to shared metastore_test_suite! macro
11 tests covering the full metrics split lifecycle:
- stage (happy path + non-existent index error)
- stage upsert (ON CONFLICT update)
- list by state, time range, metric name, compaction scope
- publish (happy path + non-existent split error)
- mark for deletion
- delete (happy path + idempotent non-existent)
Tests are generic and run against both file-backed and PostgreSQL backends.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(31): read compaction columns in list_metrics_splits, fix cleanup_index FK
* fix(31): correct error types for non-existent metrics splits
- publish_metrics_splits: return NotFound (not FailedPrecondition) when
staged splits don't exist
- delete_metrics_splits: succeed silently (idempotent) for non-existent
splits instead of returning FailedPrecondition
- Tests now assert the correct error types on both backends
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: rustfmt metastore tests and postgres
* fix(31): address PR review — align metrics_splits with splits table
- Migration 27: add maturity_timestamp, delete_opstamp, node_id columns
and publish_timestamp trigger to match the splits table (Paul's review)
- ListMetricsSplitsQuery: adopt FilterRange<i64> for time_range (matching
log-side pattern), single time_range field for both read and compaction
paths, add node_id/delete_opstamp/update_timestamp/create_timestamp/
mature filters to close gaps with ListSplitsQuery
- Use SplitState enum instead of stringly-typed Vec<String> for split_states
- StoredMetricsSplit: add create_timestamp, node_id, delete_opstamp,
maturity_timestamp so file-backed metastore can filter on them locally
- File-backed filter: use FilterRange::overlaps_with() for time range and
window intersection, apply all new filters matching log-side predicate
- Postgres: intersection semantics for window queries, FilterRange-based
SQL generation for all range filters
- Fix InsertableMetricsSplit.window_duration_secs from Option<i32> to i32
- Rename two-letter variables (ws, sf, dt) throughout
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: fix rustfmt nightly formatting
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(31): add shared invariants module to quickwit-dst
Extract duplicated invariant logic into a shared `invariants/` module
within `quickwit-dst`. This is the "single source of truth" layer in
the verification pyramid — used by stateright models, production
debug_assert checks, and (future) Datadog metrics emission.
Key changes:
- `invariants/registry.rs`: InvariantId enum (20 variants) with Display
- `invariants/window.rs`: shared window_start_secs(), is_valid_window_duration()
- `invariants/sort.rs`: generic compare_with_null_ordering() for SS-2
- `invariants/check.rs`: check_invariant! macro wrapping debug_assert
- stateright gated behind `model-checking` feature (optional dep)
- quickwit-parquet-engine uses shared functions and check_invariant!
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(31): check invariants in release builds, add pluggable recorder
The check_invariant! macro now always evaluates the condition — not just
in debug builds. This implements Layer 4 (Production) of the verification
stack: invariant checks run in release, with results forwarded to a
pluggable InvariantRecorder for Datadog metrics emission.
- Debug builds: panic on violation (debug_assert, Layer 3)
- All builds: evaluate condition, call recorder (Layer 4)
- set_invariant_recorder() wires up statsd at process startup
- No recorder registered = no-op (single OnceLock load)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(31): wire invariant recorder to DogStatsD metrics
Emit cloudprem.pomsky.invariant.checked and .violated counters with
invariant label via the metrics crate / DogStatsD exporter at process
startup, completing Layer 4 of the verification stack.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: license headers + cfg(not(test)) for quickwit-dst and quickwit-cli
* chore: regenerate third-party license file
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: fix rustfmt nightly formatting for quickwit-dst and quickwit-parquet-engine
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add CLAUDE.md and docs/internals architecture documentation
Ports CLAUDE.md (development guide, coding standards, known pitfalls) and
the full docs/internals tree including ADRs, gap analyses, TLA+ specs,
verification guides, style references, and compaction architecture.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: split CLAUDE.md into repo context + opt-in /sesh-mode skill
- Move verification-first workflow (TLA+, DST, formal specs) to /sesh-mode skill
- Keep repo knowledge in CLAUDE.md (pitfalls, reliability rules, testing, docker, commands)
- Remove Crate Map (derivable from filesystem)
- Remove Coding Style bullet summary (CODE_STYLE.md is linked)
- Fix relative links in SKILL.md for .claude/skills/sesh-mode/ path
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: add machete, cargo doc, and fmt details to CI checklist in CLAUDE.md
* review: parquet_file singular, proto doc link, fix metastore accessor
* style: fix rustfmt nightly comment wrapping in split metadata
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use plain code span for proto reference to avoid broken rustdoc link
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Update quickwit/quickwit-parquet-engine/src/table_config.rs
Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>
* Update quickwit/quickwit-parquet-engine/src/table_config.rs
Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>
* style: rustfmt long match arm in default_sort_fields
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: make parquet_file field backward-compatible in MetricsSplitMetadata
Pre-existing splits were serialized before the parquet_file field was
added, so their JSON doesn't contain it. Adding #[serde(default)]
makes deserialization fall back to empty string for old splits.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: handle empty-column batches in accumulator flush
When the commit timeout fires and the accumulator contains only
zero-column batches, union_fields is empty and concat_batches fails
with "must either specify a row count or at least one column".
Now flush_internal treats empty union_fields the same as empty
pending_batches — resets state and returns None.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: rustfmt check_invariant macro argument
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* revert: move code changes to separate PR
Reverts c8bf8d7, cafcac5, a088f53 — these are code changes
(delete_metrics_splits error handling, doc comment tweaks) that
don't belong in a docs-only PR. They will land in a separate PR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* remove ADR-004 from upstream PR — moving to DataDog/pomsky
This ADR contains company-specific information and should live
in the private fork, not in the upstream quickwit-oss repo.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: rebrand for upstream — remove Datadog/Pomsky/Quickhouse references
- Rewrite CLAUDE.md as generic Quickwit AI development guide
- Replace Quickhouse-Pomsky -> Quickwit branding across all docs
- Replace "Datadog" observability references with generic
"production observability" language
- Remove "Husky (Datadog)" qualifier from gap docs (keep Husky
citations — the blog post is public)
- Generalize internal knowledge (query rate numbers, product-specific
lateness guarantees)
- Remove PomChi reference, private Google Doc link
- Add docs/internals/UPSTREAM-CANDIDATES.md for pomsky tracking
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: remove ClickHouse references, track aspirational items
- Remove all ClickHouse/ClickStack references from gap docs and ADRs
(keep Prometheus, Mimir, InfluxDB, Husky as prior art)
- Restore gap-005 Option C (compaction-time dedup) without ClickHouse citation
- Mark /sesh-mode reference in CLAUDE.md as aspirational
- Add aspirational items section to UPSTREAM-CANDIDATES.md tracking
items described in docs but not yet implemented (TLA+ specs, DST,
Kani, Bloodhound, performance baselines, benchmark binaries)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: fix aspirational items — TLA+ specs and Stateright models exist
UPSTREAM-CANDIDATES.md incorrectly stated TLA+ specs and Stateright
models don't exist. They do (contributed in #6246): ParquetDataModel.tla,
SortSchema.tla, TimeWindowedCompaction.tla, plus quickwit-dst invariants
and Stateright model tests. Updated to accurately reflect that the
remaining aspirational piece is the simulation infrastructure (SimClock,
FaultInjector, etc.).
Also removed the /sesh-mode aspirational entry — it's actively being
used and the underlying specs/models are real.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: add .planning to .gitignore
Prevents GSD planning artifacts from being committed to the repository.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* revert: remove pomsky-specific Makefile changes
Reverts test env vars (CP_ENABLE_REVERSE_CONNECTION) and
load-cloudprem-ui target — these are pomsky-specific and
don't belong in upstream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add pitfall rule against silently swallowing unexpected state
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: enforce physical column ordering in Parquet files for two-GET streaming merge (#6281)
* feat: enforce physical column ordering in Parquet files
Sort schema columns are written first (in their configured sort order),
followed by all remaining data columns in alphabetical order. This
physical layout enables a two-GET streaming merge during compaction:
the footer GET provides the schema and offsets, then a single streaming
GET from the start of the row group delivers sort columns first —
allowing the compactor to compute the global merge order before data
columns arrive.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: verify input column order is actually scrambled
The sanity check only asserted presence, not ordering. Now it
verifies that host appears before service in the input (scrambled)
which is the opposite of the sort-schema order (service before host).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: rustfmt test code
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: collapse nested if to satisfy clippy::collapsible_if
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Matthew Kim <matthew.kim@datadoghq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Verdonk Lucas <lucas.verdonk@datadoghq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implementation
reorder_columns()method inParquetWriterthat reorders aRecordBatch's columns before writingprepare_write()after sorting rows but before buildingWriterPropertiesSortingColumnmetadata indices automatically reflect the reordered schema since they're computed on the reordered batchTest plan
test_column_ordering_sort_columns_first_then_alphabetical— verifies in-memory reordering logictest_column_ordering_preserved_in_parquet_file— reads back a written Parquet file and verifies physical column order from the schema descriptor🤖 Generated with Claude Code