[metrics] Support DDSketch in the parquet pipeline by mattmkim · Pull Request #6257 · quickwit-oss/quickwit

mattmkim · 2026-03-31T15:14:07Z

Description

This PR can be reviewed commit by commit.

This PR updates the parquet pipeline to process DDSketches. See https://datadoghq.atlassian.net/wiki/spaces/QKHS/pages/6291357728/DDSketch+in+Parquet for more information about the DDSketch spec.

How was this PR tested?

Describe how you tested this PR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 727f085864

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

fulmicoton-dd · 2026-04-01T08:44:11Z

i cannot review this PR due to lack of context. I know what DDSketch are, but I do not know what they are used for in the context the metrics ingestion pipeline, why they are stored in different files, etc.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e7d6a90ae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ppknap · 2026-04-10T14:55:32Z

+
+            let keys_inner = keys_builder.values();
+            for &k in &dp.keys {
+                keys_inner.append_value(k);


Where do we store the length of keys/counts? I.e. how do we recover the number of elements per data_point from _inner arrays?

they are stored implicitly by Arrow - it stores an array of offsets, so you can get the length of row i with offsets[i+1] - offsets[i]

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37f4298f66

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a38c47b7c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dff8e08fa0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40cfe5a158

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T23:06:43Z

        for metadata in &splits_metadata {
-            let insertable =
-                InsertableMetricsSplit::from_metadata(metadata, MetricsSplitState::Staged)
-                    .map_err(|e| MetastoreError::JsonSerializeError {
-                        struct_name: "MetricsSplitMetadata".to_string(),
-                        message: e.to_string(),
-                    })?;
+            let insertable = InsertableParquetSplit::from_metadata(metadata, SplitState::Staged)
+                .map_err(|err| MetastoreError::JsonSerializeError {


Reject split-kind mismatches during parquet staging

stage_parquet_splits_impl chooses the destination table from the RPC (kind) but never verifies that each ParquetSplitMetadata.kind matches that table. A mixed or misrouted request can therefore persist sketch metadata into metrics_splits (or vice versa), violating table invariants and causing downstream routing/listing logic that relies on split kind to query the wrong API/table. Add an explicit metadata.kind == kind check and fail fast on mismatch before building insert rows.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a35fdb8600

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3bbfb7ff26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3bbfb7ff26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d430881b0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-15T00:23:37Z

+        let index_uid = request.index_uid().clone();
+        let splits_metadata = request.deserialize_splits_metadata()?;
+


Enforce request index UID for staged sketch metadata

stage_sketch_splits deserializes and forwards splits_metadata_json without checking that each ParquetSplitMetadata.index_uid matches request.index_uid, so a caller can submit StageSketchSplitsRequest.index_uid = A while embedding metadata for B. In file-backed mode this stores rows under index A whose serialized metadata claims B, which breaks request-scoped invariants and can cause later list/publish responses to carry inconsistent index identities; this path should validate equality (or overwrite metadata index UID from the request) before staging.

Useful? React with 👍 / 👎.

rishabh · 2026-04-16T03:27:51Z

+
+/// Returns whether the given index ID uses the Parquet/DataFusion pipeline.
+pub fn is_parquet_pipeline_index(index_id: &str) -> bool {
+    is_metrics_index(index_id) || is_sketches_index(index_id)


I think this is fine for now, but we'd be adding every parquet pipeline here every time we add one.

Also, every index that starts with sketches- doesn't necessarily have to be a Parquet pipeline.

The "fix" seems to be creating something like a ParquetMapping, and that's like a whole thing, so we can get back to it later.

rishabh · 2026-04-16T15:19:13Z

+    update_timestamp TIMESTAMP NOT NULL DEFAULT (CURRENT_TIMESTAMP AT TIME ZONE 'UTC'),
+    publish_timestamp TIMESTAMP,
+
+    FOREIGN KEY(index_uid) REFERENCES indexes(index_uid)


Should we add a ON DELETE CASCADE here? If we ever need to delete this, we'd have to clean up the table ourselves first.

The metrics table also doesn't include this.

yeah, will add

will need another migration for metrics https://datadoghq.atlassian.net/browse/CLOUDPREM-556

…ketches

chatgpt-codex-connector

💡 Codex Review

quickwit/quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs

Lines 2419 to 2421 in 65e525e

    
           Bound::Included(_) => { 
        
               sql.push_str(&format!(" AND {} >= ${}", range_col_end, param_idx)); 
        
               param_idx += 1;

Use strict overlap bounds for parquet time-range filtering

time_range_end/window_start + window_duration_secs are stored as exclusive ends (e.g. extract_time_range writes end + 1), but this overlap predicate uses >= for the lower bound. That includes adjacent, non-overlapping splits whose end equals the query start (notably previous compaction windows), so list queries can return out-of-scope splits and drive compaction/planning on the wrong window.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T21:46:58Z

+        .values()
+        .filter(|split| parquet_split_matches_query(split, query))
+        .cloned()
+        .collect()
+}


Enforce query limit in file-backed parquet split listing

The file-backed parquet listing path ignores ListParquetSplitsQuery.limit and always collects every matching split. Large indexes therefore return unbounded results (memory/latency regression) and behavior diverges from the Postgres implementation, which applies LIMIT.

Useful? React with 👍 / 👎.

…ketches

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b3444148c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-20T15:07:16Z

-    /// The staged and uploaded splits.
-    pub new_splits: Vec<MetricsSplitMetadata>,
+    /// The staged and uploaded splits (metrics or sketches).
+    pub new_splits: Vec<ParquetSplitMetadata>,


Route sketch updates through sketch publish RPC

Fresh evidence: this update is now documented as carrying sketch splits, but the only Handler<ParquetSplitsUpdate> in metrics_pipeline/publisher_impl.rs still constructs PublishMetricsSplitsRequest unconditionally. In a sketches-* pipeline, the uploader stages/uploads to sketch_splits, then the publisher calls publish_metrics_splits with those sketch IDs, so they are never published in the sketch table and the call can fail with NotFound; carry the split/index kind through this update and dispatch to publish_sketch_splits for sketch pipelines.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5bf77ba44

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-20T15:23:07Z

-        let mut builder = MetricsSplitMetadata::builder()
+        let mut builder = match self.kind {
+            ParquetSplitKind::Metrics => ParquetSplitMetadata::metrics_builder(),
+            ParquetSplitKind::Sketches => ParquetSplitMetadata::sketches_builder(),


Populate sketch split metadata with non-service tags

When this new sketch writer path builds Sketches metadata, it still only adds the service tag below, so sketch batches that contain supported low-cardinality columns like env, datacenter, region, or host get staged with those metadata fields empty. Both Postgres and file-backed list_sketch_splits filter on metadata.low_cardinality_tags/the derived tag columns, so queries constrained by those tags will miss otherwise valid sketch splits; extract and add all supported low-cardinality tag columns when building sketch split metadata.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

quickwit/quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs

Lines 2432 to 2436 in cbd7561

    
               sql.push_str(&format!(" AND {} <= ${}", range_col_start, param_idx)); 
        
               param_idx += 1; 
        
           } 
        
           Bound::Excluded(_) => { 
        
               sql.push_str(&format!(" AND {} < ${}", range_col_start, param_idx));

Use split-start predicates for upper time bounds

For bounded parquet list queries, this upper-bound predicate is reversed: a query like [1500, 3500] against a split [1000, 2000) requires 3500 <= time_range_start, so the overlapping split is filtered out, while later non-overlapping splits can pass the lower-bound check. This affects the new shared Postgres list path used for sketch splits as well; the upper bound should constrain the split start against the query end (range_col_start <= $param, and the analogous operator for excluded bounds), matching the file-backed FilterRange::overlaps_with behavior.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f196d3a0c1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-20T17:07:58Z

+                Ok(_) => {
+                    debug!("committing transaction");
+                    tx.commit().await?;


Roll back partially staged parquet splits

For sketch staging requests where one split already exists outside Staged but another split is new, the INSERT ... ON CONFLICT ... WHERE ... RETURNING succeeds for only the new subset, and this transaction is committed before the later upserted_split_ids.len() check reports FailedPrecondition. The uploader then aborts before uploading files, leaving staged metadata for parquet files that were never uploaded; keep the returned-count validation inside the transaction so the partial insert rolls back.

Useful? React with 👍 / 👎.

mattmkim force-pushed the matthew.kim/parquet-sketches branch from db0e0db to 86c034b Compare March 31, 2026 21:14

mattmkim changed the title ~~[draft] parquet ddsketch engine~~ [metrics] Support DDSketch in the parquet pipeline Mar 31, 2026

mattmkim marked this pull request as ready for review March 31, 2026 21:30

mattmkim force-pushed the matthew.kim/parquet-sketches branch from c3fc790 to 2261237 Compare March 31, 2026 21:38

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread quickwit/quickwit-indexing/src/actors/publisher.rs Outdated

Comment thread quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs Outdated

Base automatically changed from matthew.kim/metrics-wide-schema to main April 6, 2026 19:48

mattmkim force-pushed the matthew.kim/parquet-sketches branch from 2261237 to 6e7d6a9 Compare April 7, 2026 18:50

chatgpt-codex-connector Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs Outdated

mattmkim force-pushed the matthew.kim/parquet-sketches branch from 6e7d6a9 to 7490fe3 Compare April 8, 2026 19:15

g-talbot mentioned this pull request Apr 9, 2026

feat: compute deterministic timeseries_id column at ingest #6286

Merged

3 tasks

ppknap reviewed Apr 10, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread quickwit/quickwit-parquet-engine/src/storage/split_writer.rs Outdated

Comment thread quickwit/quickwit-parquet-engine/src/split/metadata.rs

Comment thread quickwit/quickwit-metastore/migrations/postgresql/28_create-sketch-splits.up.sql

chatgpt-codex-connector Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread quickwit/quickwit-parquet-engine/src/storage/writer.rs Outdated

Comment thread quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs

chatgpt-codex-connector Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs Outdated

chatgpt-codex-connector Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread quickwit/quickwit-metastore/src/metastore/file_backed/file_backed_index/mod.rs

mattmkim force-pushed the matthew.kim/parquet-sketches branch from a35fdb8 to 3bbfb7f Compare April 14, 2026 23:27

chatgpt-codex-connector Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread quickwit/quickwit-metastore/src/metastore/file_backed/mod.rs Outdated

Comment thread quickwit/quickwit-metastore/src/metastore/postgres/parquet_model.rs

chatgpt-codex-connector Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread quickwit/quickwit-metastore/src/metastore/file_backed/mod.rs Outdated

chatgpt-codex-connector Bot reviewed Apr 15, 2026

View reviewed changes

mattmkim force-pushed the matthew.kim/parquet-sketches branch from 1d43088 to ad84838 Compare April 15, 2026 05:27

mattmkim added 5 commits April 14, 2026 22:36

generalize metric splits to parquet splits

d2888d2

add sketch schema + processor

5712a06

sketch split support in metastore

c911f0b

wire sketch pipeline into indexing actors and control plane

a514964

drive by fix: pass sort fields to bloom filter config

0655391

mattmkim force-pushed the matthew.kim/parquet-sketches branch from ad84838 to 0655391 Compare April 15, 2026 05:37

rishabh approved these changes Apr 16, 2026

View reviewed changes

rishabh reviewed Apr 16, 2026

View reviewed changes

mattmkim added 2 commits April 16, 2026 14:38

Merge remote-tracking branch 'origin/main' into matthew.kim/parquet-s…

3af7351

…ketches

address comments, lint

65e525e

chatgpt-codex-connector Bot reviewed Apr 16, 2026

View reviewed changes

mattmkim added 3 commits April 16, 2026 14:53

address comments, lint

2e44ba4

linter

e6d03ad

Merge remote-tracking branch 'origin/main' into matthew.kim/parquet-s…

5b34441

…ketches

chatgpt-codex-connector Bot reviewed Apr 20, 2026

View reviewed changes

linter

c5bf77b

chatgpt-codex-connector Bot reviewed Apr 20, 2026

View reviewed changes

fix unit test

cbd7561

chatgpt-codex-connector Bot reviewed Apr 20, 2026

View reviewed changes

mattmkim enabled auto-merge (squash) April 20, 2026 16:57

Merge branch 'main' into matthew.kim/parquet-sketches

f196d3a

mattmkim merged commit c2a0235 into main Apr 20, 2026
8 checks passed

chatgpt-codex-connector Bot reviewed Apr 20, 2026

View reviewed changes

mattmkim deleted the matthew.kim/parquet-sketches branch April 20, 2026 17:07

		let index_uid = request.index_uid().clone();
		let splits_metadata = request.deserialize_splits_metadata()?;

	Bound::Included(_) => {
	sql.push_str(&format!(" AND {} >= ${}", range_col_end, param_idx));
	param_idx += 1;

	sql.push_str(&format!(" AND {} <= ${}", range_col_start, param_idx));
	param_idx += 1;
	}
	Bound::Excluded(_) => {
	sql.push_str(&format!(" AND {} < ${}", range_col_start, param_idx));

Conversation

mattmkim commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How was this PR tested?

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

fulmicoton-dd commented Apr 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ppknap Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

mattmkim Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

rishabh Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

rishabh Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

mattmkim Apr 16, 2026

mattmkim commented Mar 31, 2026 •

edited

Loading