Skip to content

perf(query): fuse no-WHERE multi-key count-only group-by#218

Merged
singaraiona merged 2 commits into
perf/clickbench-improvementsfrom
serhii/improve1
Jun 2, 2026
Merged

perf(query): fuse no-WHERE multi-key count-only group-by#218
singaraiona merged 2 commits into
perf/clickbench-improvementsfrom
serhii/improve1

Conversation

@ser-vasilich
Copy link
Copy Markdown
Collaborator

@ser-vasilich ser-vasilich commented May 29, 2026

Relax the fused group-by planner gate so a no-WHERE multi-key
count-only shape routes onto exec_filtered_group_multi instead of
the unfused exec_group radix path. ray_filtered_group already
accepts a NULL predicate (worker runs with a const-true mask); the
only blocker was where_expr && in the gate.

Gate now fires no-WHERE only when n_keys >= 2 && has_only_count.
Single-key no-WHERE and multi-agg over near-unique composites stay on
exec_group — at very high cardinality the radix path's
per-(worker, partition) scatter beats a single linear-probe shard.

Follow-up commit: narrow I64 results of known-small temporal extracts
(minute / hh / ss / dd / dow / mm / doy / yyyy) to I16 before adding
to the table. Brings q18's composite under the 16-byte mk_compile
budget so it fuses too.

ClickBench 10M:

  • q16 744 → 154 ms
  • q18 1748 → 449 ms
  • total 8.0 → 5.2 s

ser-vasilich and others added 2 commits May 30, 2026 15:12
The fused multi-key path already accepts a NULL predicate; only the
planner gate required where_expr.  Allow no-WHERE when n_keys >= 2 AND
count-only.  Single-key no-WHERE and multi-agg over near-unique
composites stay on exec_group's radix — fusing them regresses at very
high cardinality.

ClickBench 10M:
  q16  744 → 154 ms
  total 8.0 → 7.3 s

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mk_compile packs the composite by-key into a 16-byte slot.  An I64
column for minute() (values 0..59) blows the budget on q18's
{UserID, minute, SearchPhrase} composite (~20 bytes) and the query
drops to exec_group.

After eval'ing a computed by-val whose AST head is minute / hh / ss /
dd / dow / mm / doy / yyyy, downcast the I64 result to I16 before
adding it to the table.  I16 is the smallest type that holds every
output range (year up to 32767, doy up to 366) and still prints as
decimal (U8 prints hex, unreadable for a minute value).

Skipped when the source column has nulls.

ClickBench 10M:
  q18  1748 → 449 ms
  total 6.6 → 5.2 s

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@singaraiona singaraiona merged commit 04884dc into perf/clickbench-improvements Jun 2, 2026
ser-vasilich added a commit that referenced this pull request Jun 3, 2026
…verse

Following the rebase onto PR #218/#219/#220 master (new attribute
system, asof fast-path, RAY_IDX_PART, HLL routing, MG top-K for
TIMESTAMP), targeted the still-large branch-coverage gaps:

  query.c         62.54 → 63.54% (+1.00pp, -107 missed)
  fused_group.c   65.69 → 67.26% (+1.57pp, -55 missed)
  group.c         67.50 → 67.95% (+0.45pp, -39 missed)
  traverse.c      60.16 → 60.68% (+0.52pp, -12 missed)
  eval.c          60.73 → 60.87% (+0.14pp, -4 missed)

Additions:
- query_branch_cov.rfl    +670 lines (§19-§63: 2-stage count-distinct
  rewrite for I64/I32/TIMESTAMP, match_group_desc_count_take per-op,
  wide-key fused, asof wrapper, narrow_known_small_extract, HLL
  inner-type cascade, prefilter computed-by + WHERE + desc:count)
- fused_group_branch_cov.rfl  +190 lines + 1156 C lines (chunk_zone
  fast path EQ/GT/LT/NE/LE/GE, IN/EQ masked dispatch, BOOL/SYM key
  topk, U8/I16 hash-eq kbits, strlen agg input)
- group_branch_cov.rfl    +488 lines §21-§38 (maxmin/pearson rowform
  with null x/y/k, per-partition STDDEV/VAR/FIRST/LAST, multi-key
  heavy-hitter, v2 multi-key TIMESTAMP+I64 / DATE+TIME+I64,
  count_distinct STR/GUID/LIST, accum_from_entry skip path)
- eval_branch_cov.rfl     +300 lines §9-§30 (OP_STOREGLOBAL error,
  lambda dispatch errors, try handler dispatch, try_sum_affine bail
  paths, nested-try depth, raise vec/dict/table payload survival)
- test_traverse.c         +760 lines / 18 C tests (A* relax fail,
  cluster_coeff parallel/asym, SIP dir2 neg/oob src, betweenness/
  closeness sample-clamp)
- traverse_branch_cov.rfl +311 lines (bidirectional cliques, parallel
  edges, K4, disjoint comps, diamond/2-cycle/back-edge fixtures)

Suite: 3231 of 3233 pass under ASan+UBSan. Unreachable branches
documented inline per file (OOM-injection, VM trap stack, restricted-
mode, MAPCOMMON/PARTED I/O-only, CSR invariants).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants