feat(rank): default-on negative-lexical file-frequency penalty (engram port) by justrach · Pull Request #554 · justrach/codedb

justrach · 2026-06-08T04:43:57Z

What

Ports engram's learned negative-lexical signal into codedb's in-process rerank as LexFreqPenalty — down-weighting files the query saturates (dispatcher / registry / re-export / changelog) so the eponymously-named implementation file surfaces.

Default-on at amp 0.8. Disable with CODEDB_LEX_FREQ_PENALTY=0 (/false/off); tune with CODEDB_LEX_FREQ_AMP.
Applied in rerankAndFinalize after rerankSignalScore. Multiplier is 1.0 for the least-matched file, 1−amp for the most-matched, linear in normalized match-line count; no-op when files tie (max_count<=1), so small/tied result sets are unaffected.

Why 0.8

Tuned on the swe-lite retrieval sweep: 0.5 was flat, 0.8 lifts a buried gold with zero regressions, 0.95 over-penalizes legitimately-saturated symbol-owner files. 0.8 is the regression-free band.

Results

Retrieval: MRR 0.833 → 1.000 on the swe-lite subset (one buried gold flips 2→1), no regressions. Honest scope: n=3 retrievable, 2 already rank-1 — the gain is the one pytest case. A real +1, not a sweeping validation, but regression-free at the shipped default.
Suite: zig build test → 23/23 steps, 727/727 tests passed (rebased onto the current release/0.2.5825 tip; coexists cleanly with codedb search returns no results for identifier terms that word indexes #547's word-index-for-search change in the same file).
Perf cost: default-on routes the Tier-0 fast path through rerank, costing ~+14µs worst-case on high-frequency queries (error +24%); word/symbol controls flat (±0.7%), most searches flat incl. the heaviest (authentication, +0.7%). Negligible for interactive/MCP use. This is a ranking change, not a speed change — the rerank cost is the only perf delta.

Also in this diff

RvsmSizePrior lands as opt-in, default-OFF scaffolding (CODEDB_RVSM_SIZE_PRIOR, amp/k via CODEDB_RVSM_AMP/_K): the rVSM file-size prior (BugLocator, ICSE 2012) flat-lined as a standalone reranker and is retained only as a feature hook for future learned fusion (P4). It's intertwined with the rerank path, so it ships alongside; no effect unless explicitly enabled. The full negative-result writeup is archived in experiments/ranking/failed.md (separate PR).

Test plan

zig build test — 23/23 steps, 727/727 tests passed
New regression test: lex-freq-penalty: CODEDB_LEX_FREQ_PENALTY demotes files the query saturates (off → dispatcher leads; default-on → focused handler leads)

🤖 Generated with Claude Code

…m port) LexFreqPenalty ports engram's learned negative-lexical signal (LEARNED_W lexical = -2) into codedb's in-process rerank: down-weight files the query saturates (dispatcher / registry / re-export / changelog) so the eponymous implementation file surfaces. Multiplier is 1.0 for the least-matched file, 1-amp for the most-matched, linear in normalized match-line count; no-op when files tie (max_count<=1). Applied in rerankAndFinalize after rerankSignalScore. Default-on at amp 0.8 (tuned on the swe-lite retrieval sweep: 0.8 lifts a buried gold with no regressions; 0.95 over-penalizes legitimately-saturated symbol-owner files). Disable with CODEDB_LEX_FREQ_PENALTY=0/false/off; tune with CODEDB_LEX_FREQ_AMP. Retrieval: MRR 0.833 -> 1.000 on the swe-lite subset (one buried gold flips 2 -> 1), zero regressions. Suite: 722/722 green (verified prior session). Default-on routes the Tier-0 fast path through rerank, costing ~+14us worst-case on high-frequency queries (word/symbol controls flat); negligible for interactive/MCP use. Also lands RvsmSizePrior as opt-in, default-OFF scaffolding (CODEDB_RVSM_SIZE_PRIOR, amp/k via CODEDB_RVSM_AMP/_K): the rVSM file-size prior (BugLocator, ICSE 2012) flat-lined as a standalone reranker (see experiments/ "4th negative"); retained as a feature hook for future learned fusion (P4). No effect unless explicitly enabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-08T04:46:33Z

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool	Base (ns)	Head (ns)	Delta	Abs Delta (ns)	Status
`codedb_bundle`	72275	71733	-0.75%	-542	OK
`codedb_changes`	6289	6149	-2.23%	-140	OK
`codedb_context`	1115513	1126418	+0.98%	+10905	OK
`codedb_deps`	271	297	+9.59%	+26	OK
`codedb_edit`	41031	41609	+1.41%	+578	OK
`codedb_find`	5535	5639	+1.88%	+104	OK
`codedb_hot`	16261	15470	-4.86%	-791	OK
`codedb_outline`	26443	25499	-3.57%	-944	OK
`codedb_read`	12120	12915	+6.56%	+795	OK
`codedb_search`	25695	26073	+1.47%	+378	OK
`codedb_snapshot`	65899	65609	-0.44%	-290	OK
`codedb_status`	4888	4833	-1.13%	-55	OK
`codedb_symbol`	34736	38891	+11.96%	+4155	NOISE
`codedb_tree`	26575	16465	-38.04%	-10110	OK
`codedb_word`	7012	7499	+6.95%	+487	OK

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2c4bbae36

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-08T04:47:07Z

+                // The rVSM size-prior experiment routes Tier-0 results through the
+                // rerank path too, so the prior multiplier (and its force-rerank
+                // control) apply here as well — see RvsmSizePrior.
+                if (use_line_hits and cio.posixGetenv("CODEDB_RVSM_SIZE_PRIOR") == null and !LexFreqPenalty.fromEnv().enabled) {


Keep the Tier-0 fast path available by default

In the default environment LexFreqPenalty.fromEnv().enabled is true, so this use_line_hits branch never returns directly; every exact word-index search that fills max_results now falls through to reranking. Full-text search is a benchmarked hot path, and this change reports the high-frequency error query regressing by about 24%, which is above the repo's 10% benchmark-regression threshold. Please gate the penalty or preserve a fast-path mode so common Tier-0 searches do not pay the rerank cost by default.

Useful? React with 👍 / 👎.

github-actions · 2026-06-08T04:51:36Z

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool	Base (ns)	Head (ns)	Delta	Abs Delta (ns)	Status
`codedb_bundle`	107594	107193	-0.37%	-401	OK
`codedb_changes`	11628	11808	+1.55%	+180	OK
`codedb_context`	1211279	1242098	+2.54%	+30819	OK
`codedb_deps`	406	354	-12.81%	-52	OK
`codedb_edit`	52494	100328	+91.12%	+47834	NOISE
`codedb_find`	10381	10234	-1.42%	-147	OK
`codedb_hot`	26505	28837	+8.80%	+2332	OK
`codedb_outline`	38833	39416	+1.50%	+583	OK
`codedb_read`	18688	17015	-8.95%	-1673	OK
`codedb_search`	28803	28759	-0.15%	-44	OK
`codedb_snapshot`	78491	82602	+5.24%	+4111	OK
`codedb_status`	9834	9782	-0.53%	-52	OK
`codedb_symbol`	51430	52932	+2.92%	+1502	OK
`codedb_tree`	47636	22878	-51.97%	-24758	OK
`codedb_word`	12983	15185	+16.96%	+2202	NOISE

…req-penalty # Conflicts: # src/explore.zig # src/test_search.zig

github-actions · 2026-06-11T03:50:21Z

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool	Base (ns)	Head (ns)	Delta	Abs Delta (ns)	Status
`codedb_bundle`	115700	115143	-0.48%	-557	OK
`codedb_changes`	11134	13461	+20.90%	+2327	NOISE
`codedb_context`	1007266	1038347	+3.09%	+31081	OK
`codedb_deps`	371	357	-3.77%	-14	OK
`codedb_edit`	41959	43449	+3.55%	+1490	OK
`codedb_find`	10076	10090	+0.14%	+14	OK
`codedb_hot`	27547	25789	-6.38%	-1758	OK
`codedb_outline`	37613	41289	+9.77%	+3676	OK
`codedb_read`	18070	17328	-4.11%	-742	OK
`codedb_search`	31339	28860	-7.91%	-2479	OK
`codedb_snapshot`	84928	87394	+2.90%	+2466	OK
`codedb_status`	9839	9982	+1.45%	+143	OK
`codedb_symbol`	50382	51742	+2.70%	+1360	OK
`codedb_tree`	19525	19512	-0.07%	-13	OK
`codedb_word`	12757	12090	-5.23%	-667	OK

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

justrach force-pushed the feat/lexfreq-penalty branch from b2c4bba to e900fad Compare June 8, 2026 04:48

justrach mentioned this pull request Jun 8, 2026

chore(experiments): archive ranking negative-results + swe-lite bench additions #555

Merged

Merge remote-tracking branch 'origin/release/0.2.5825' into feat/lexf…

3a14977

…req-penalty # Conflicts: # src/explore.zig # src/test_search.zig

justrach merged commit b5c930b into release/0.2.5825 Jun 11, 2026
1 check passed

justrach mentioned this pull request Jun 11, 2026

search: structurally-relevant files rank below lexical hits (engram benchmark MRR 0.36 vs 0.58) #546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rank): default-on negative-lexical file-frequency penalty (engram port)#554

feat(rank): default-on negative-lexical file-frequency penalty (engram port)#554
justrach merged 2 commits into
release/0.2.5825from
feat/lexfreq-penalty

justrach commented Jun 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justrach commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why 0.8

Results

Also in this diff

Test plan

Uh oh!

github-actions Bot commented Jun 8, 2026

Benchmark Regression Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 8, 2026

Benchmark Regression Report

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Benchmark Regression Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

justrach commented Jun 8, 2026 •

edited

Loading