[metrics] Support DDSketch in the parquet pipeline#6257
Conversation
db0e0db to
86c034b
Compare
c3fc790 to
2261237
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 727f085864
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
i cannot review this PR due to lack of context. I know what DDSketch are, but I do not know what they are used for in the context the metrics ingestion pipeline, why they are stored in different files, etc. |
2261237 to
6e7d6a9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6e7d6a90ae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
6e7d6a9 to
7490fe3
Compare
|
|
||
| let keys_inner = keys_builder.values(); | ||
| for &k in &dp.keys { | ||
| keys_inner.append_value(k); |
There was a problem hiding this comment.
Where do we store the length of keys/counts? I.e. how do we recover the number of elements per data_point from _inner arrays?
There was a problem hiding this comment.
they are stored implicitly by Arrow - it stores an array of offsets, so you can get the length of row i with offsets[i+1] - offsets[i]
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37f4298f66
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a38c47b7c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dff8e08fa0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 40cfe5a158
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for metadata in &splits_metadata { | ||
| let insertable = | ||
| InsertableMetricsSplit::from_metadata(metadata, MetricsSplitState::Staged) | ||
| .map_err(|e| MetastoreError::JsonSerializeError { | ||
| struct_name: "MetricsSplitMetadata".to_string(), | ||
| message: e.to_string(), | ||
| })?; | ||
| let insertable = InsertableParquetSplit::from_metadata(metadata, SplitState::Staged) | ||
| .map_err(|err| MetastoreError::JsonSerializeError { |
There was a problem hiding this comment.
Reject split-kind mismatches during parquet staging
stage_parquet_splits_impl chooses the destination table from the RPC (kind) but never verifies that each ParquetSplitMetadata.kind matches that table. A mixed or misrouted request can therefore persist sketch metadata into metrics_splits (or vice versa), violating table invariants and causing downstream routing/listing logic that relies on split kind to query the wrong API/table. Add an explicit metadata.kind == kind check and fail fast on mismatch before building insert rows.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a35fdb8600
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
a35fdb8 to
3bbfb7f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3bbfb7ff26
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3bbfb7ff26
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d430881b0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let index_uid = request.index_uid().clone(); | ||
| let splits_metadata = request.deserialize_splits_metadata()?; | ||
|
|
There was a problem hiding this comment.
Enforce request index UID for staged sketch metadata
stage_sketch_splits deserializes and forwards splits_metadata_json without checking that each ParquetSplitMetadata.index_uid matches request.index_uid, so a caller can submit StageSketchSplitsRequest.index_uid = A while embedding metadata for B. In file-backed mode this stores rows under index A whose serialized metadata claims B, which breaks request-scoped invariants and can cause later list/publish responses to carry inconsistent index identities; this path should validate equality (or overwrite metadata index UID from the request) before staging.
Useful? React with 👍 / 👎.
1d43088 to
ad84838
Compare
ad84838 to
0655391
Compare
|
|
||
| /// Returns whether the given index ID uses the Parquet/DataFusion pipeline. | ||
| pub fn is_parquet_pipeline_index(index_id: &str) -> bool { | ||
| is_metrics_index(index_id) || is_sketches_index(index_id) |
There was a problem hiding this comment.
I think this is fine for now, but we'd be adding every parquet pipeline here every time we add one.
Also, every index that starts with sketches- doesn't necessarily have to be a Parquet pipeline.
The "fix" seems to be creating something like a ParquetMapping, and that's like a whole thing, so we can get back to it later.
| update_timestamp TIMESTAMP NOT NULL DEFAULT (CURRENT_TIMESTAMP AT TIME ZONE 'UTC'), | ||
| publish_timestamp TIMESTAMP, | ||
|
|
||
| FOREIGN KEY(index_uid) REFERENCES indexes(index_uid) |
There was a problem hiding this comment.
Should we add a ON DELETE CASCADE here? If we ever need to delete this, we'd have to clean up the table ourselves first.
The metrics table also doesn't include this.
There was a problem hiding this comment.
will need another migration for metrics https://datadoghq.atlassian.net/browse/CLOUDPREM-556
There was a problem hiding this comment.
💡 Codex Review
quickwit/quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs
Lines 2419 to 2421 in 65e525e
time_range_end/window_start + window_duration_secs are stored as exclusive ends (e.g. extract_time_range writes end + 1), but this overlap predicate uses >= for the lower bound. That includes adjacent, non-overlapping splits whose end equals the query start (notably previous compaction windows), so list queries can return out-of-scope splits and drive compaction/planning on the wrong window.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .values() | ||
| .filter(|split| parquet_split_matches_query(split, query)) | ||
| .cloned() | ||
| .collect() | ||
| } |
There was a problem hiding this comment.
Enforce query limit in file-backed parquet split listing
The file-backed parquet listing path ignores ListParquetSplitsQuery.limit and always collects every matching split. Large indexes therefore return unbounded results (memory/latency regression) and behavior diverges from the Postgres implementation, which applies LIMIT.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5b3444148c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| /// The staged and uploaded splits. | ||
| pub new_splits: Vec<MetricsSplitMetadata>, | ||
| /// The staged and uploaded splits (metrics or sketches). | ||
| pub new_splits: Vec<ParquetSplitMetadata>, |
There was a problem hiding this comment.
Route sketch updates through sketch publish RPC
Fresh evidence: this update is now documented as carrying sketch splits, but the only Handler<ParquetSplitsUpdate> in metrics_pipeline/publisher_impl.rs still constructs PublishMetricsSplitsRequest unconditionally. In a sketches-* pipeline, the uploader stages/uploads to sketch_splits, then the publisher calls publish_metrics_splits with those sketch IDs, so they are never published in the sketch table and the call can fail with NotFound; carry the split/index kind through this update and dispatch to publish_sketch_splits for sketch pipelines.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c5bf77ba44
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let mut builder = MetricsSplitMetadata::builder() | ||
| let mut builder = match self.kind { | ||
| ParquetSplitKind::Metrics => ParquetSplitMetadata::metrics_builder(), | ||
| ParquetSplitKind::Sketches => ParquetSplitMetadata::sketches_builder(), |
There was a problem hiding this comment.
Populate sketch split metadata with non-service tags
When this new sketch writer path builds Sketches metadata, it still only adds the service tag below, so sketch batches that contain supported low-cardinality columns like env, datacenter, region, or host get staged with those metadata fields empty. Both Postgres and file-backed list_sketch_splits filter on metadata.low_cardinality_tags/the derived tag columns, so queries constrained by those tags will miss otherwise valid sketch splits; extract and add all supported low-cardinality tag columns when building sketch split metadata.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
quickwit/quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs
Lines 2432 to 2436 in cbd7561
For bounded parquet list queries, this upper-bound predicate is reversed: a query like [1500, 3500] against a split [1000, 2000) requires 3500 <= time_range_start, so the overlapping split is filtered out, while later non-overlapping splits can pass the lower-bound check. This affects the new shared Postgres list path used for sketch splits as well; the upper bound should constrain the split start against the query end (range_col_start <= $param, and the analogous operator for excluded bounds), matching the file-backed FilterRange::overlaps_with behavior.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f196d3a0c1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Ok(_) => { | ||
| debug!("committing transaction"); | ||
| tx.commit().await?; |
There was a problem hiding this comment.
Roll back partially staged parquet splits
For sketch staging requests where one split already exists outside Staged but another split is new, the INSERT ... ON CONFLICT ... WHERE ... RETURNING succeeds for only the new subset, and this transaction is committed before the later upserted_split_ids.len() check reports FailedPrecondition. The uploader then aborts before uploading files, leaving staged metadata for parquet files that were never uploaded; keep the returned-count validation inside the transaction so the partial insert rolls back.
Useful? React with 👍 / 👎.
Description
This PR can be reviewed commit by commit.
This PR updates the parquet pipeline to process DDSketches. See https://datadoghq.atlassian.net/wiki/spaces/QKHS/pages/6291357728/DDSketch+in+Parquet for more information about the DDSketch spec.
How was this PR tested?
Describe how you tested this PR.