feat(#546,#550): defines-first tier0 candidates + call-graph distance ranking by justrach · Pull Request #608 · justrach/codedb

justrach · 2026-06-11T09:05:44Z

First structural slice of the ranking lane (#546 / #550). Does not close #550 (git co-change still to come); #546 is closed manually with this PR's evidence.

1. Defines-first Tier 0 candidate selection (the #546 starvation fix)

Tier 0 ordered candidate files by raw hit count with a max_results/5 per-file cap — five mention-dense test files ate the entire budget. Concretely, on the current index, codedb search indexFile returned neither src/explore.zig nor src/index.zig anywhere in 200 results; the defining files were structurally invisible no matter what the reranker did.

Files that define a symbol named by the query (case-insensitive outline scan — works on snapshot fast-load where symbol_index is deferred) now class above other code files, which class above docs. After: src/index.zig (4 indexFile definitions) leads, src/explore.zig:906 pub fn indexFile(...) right behind.

2. Query-specific call-graph distance (#550 signal 1)

Undirected BFS — callers and callees — from every call-graph node whose name equals a query term; each reached file's min hop distance becomes a ≥1 multiplier (×1.5 at the definition, halving per hop, ×1.0 unreached) in both searchContentRanked and rerankAndFinalize. Never a filter, mirroring centralityBoost.

Cost discipline:

Graph builds only when a raw query token exactly names a known symbol — NL-only queries never pay.
Identifier-shaped queries ensure the perf: snapshot fast-load eagerly builds the symbol index — 33% of load time and ~43MB heap that plain search never uses #564-deferred symbol index pre-shared-lock (mirrors the lazy word-index rebuild); plain-word queries skip that too.
CODEDB_NO_GRAPH_DISTANCE disables the signal; CODEDB_NO_CENTRALITY disables both graph signals (ensureCallGraph would repopulate call_centrality).

Benchmark evidence (engram codedb-report, 30 git-derived queries, A/B against a clean release-tip binary)

	baseline (`e78f704`)	this branch
gold not a lexical candidate	6/30	5/30
`indexFile` → src/explore.zig	MISS (not in 200)	#2
aggregate MRR	0.441/24 comparable (≈0.423 normalized to 25)	0.416/25
vs engram's structural reranker	0.441 vs 0.293	0.416 vs 0.322

The ~2% aggregate dip is one test-file gold slipping a place: this repo's git gold is test-heavy (every fix ships a failing test) and the defines class deliberately prefers implementation files — the behavior an agent wants. #546's own evidence rows are either already fixed by prior slices (navigation/searchContent/retrieval #1, daemon #3) or stale gold (src/explore.zig no longer lexically contains "correctness"/"validation"/"prefilter" — no lexical ranker can surface it; that's the #550 co-change signal's job). The MRR gap #546 was filed on (engram 0.58 vs codedb 0.36) is now inverted (codedb 0.42 vs engram 0.29–0.32).

Tests

New: issue-550: call-graph distance ranks structurally-near files above equal-lexical noise — pins the undirected walk (helper file is reached via the reverse edge only).
All ranking-pinning tests pass unchanged (issue-393/400, bm25-recall-*, issue-429/580/598, lex-freq).
Suite: 811/811.

🤖 Generated with Claude Code

…-graph distance Two ranking changes attacking the same failure: structurally-relevant files losing to mention-dense lexical hits. Tier 0 candidate selection (#546): files were ordered by raw hit count with a max_results/5 per-file cap, so five mention-dense test files consumed the whole budget — 'codedb search indexFile' did not surface src/explore.zig or src/index.zig anywhere in 200 results. Files that define a symbol named by the query (outline scan, fast-load safe) now class above other code, which classes above docs. The defining files rank first; the reranker's +5 def-line bonus then works with real candidates. Call-graph distance (#550): undirected BFS (callers and callees) from every graph node whose name equals a query term, min hop distance per file, folded in as a >=1 multiplier (x1.5 at the definition, halving per hop) in both searchContentRanked and rerankAndFinalize. Gated so NL-only queries never pay: the graph builds only when a raw token exactly names a known symbol; identifier-shaped queries ensure the deferred (#564) symbol index pre-lock, mirroring the lazy word-index rebuild. CODEDB_NO_GRAPH_DISTANCE kills the signal; CODEDB_NO_CENTRALITY kills both graph signals. Benchmark (engram codedb-report, 30 git-derived queries): rescues 'indexFile' -> src/explore.zig from not-a-candidate to #2; misses drop 6/30 to 5/30. Aggregate MRR 0.42 vs 0.42 baseline (one test-file gold slips one place — the defines class deliberately prefers impl files over the test files this repo's fix-with-failing-test history uses as gold). codedb now outranks engram's structural reranker (0.42 vs 0.29), inverting the gap #546 was filed on. Suite: 811/811. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-11T09:08:55Z

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool	Base (ns)	Head (ns)	Delta	Abs Delta (ns)	Status
`codedb_bundle`	112316	108820	-3.11%	-3496	OK
`codedb_changes`	10055	10172	+1.16%	+117	OK
`codedb_context`	1689965	1718927	+1.71%	+28962	OK
`codedb_deps`	329	3268	+893.31%	+2939	NOISE
`codedb_edit`	36072	39252	+8.82%	+3180	OK
`codedb_find`	9203	9562	+3.90%	+359	OK
`codedb_hot`	25603	26966	+5.32%	+1363	OK
`codedb_outline`	40421	35171	-12.99%	-5250	OK
`codedb_read`	17455	16934	-2.98%	-521	OK
`codedb_search`	31027	27120	-12.59%	-3907	OK
`codedb_snapshot`	67306	84447	+25.47%	+17141	NOISE
`codedb_status`	11039	8950	-18.92%	-2089	OK
`codedb_symbol`	51480	49026	-4.77%	-2454	OK
`codedb_tree`	28509	26597	-6.71%	-1912	OK
`codedb_word`	11949	10851	-9.19%	-1098	OK

justrach merged commit 033eb90 into release/0.2.5825 Jun 11, 2026
2 checks passed

justrach mentioned this pull request Jun 11, 2026

feat(#550): git co-change ranking signal #609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#546,#550): defines-first tier0 candidates + call-graph distance ranking#608

feat(#546,#550): defines-first tier0 candidates + call-graph distance ranking#608
justrach merged 1 commit into
release/0.2.5825from
feat/issue-550-graph-distance

justrach commented Jun 11, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justrach commented Jun 11, 2026

1. Defines-first Tier 0 candidate selection (the #546 starvation fix)

2. Query-specific call-graph distance (#550 signal 1)

Benchmark evidence (engram codedb-report, 30 git-derived queries, A/B against a clean release-tip binary)

Tests

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Benchmark Regression Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant