feat(rank): default-on negative-lexical file-frequency penalty (engram port)#554
Conversation
…m port) LexFreqPenalty ports engram's learned negative-lexical signal (LEARNED_W lexical = -2) into codedb's in-process rerank: down-weight files the query saturates (dispatcher / registry / re-export / changelog) so the eponymous implementation file surfaces. Multiplier is 1.0 for the least-matched file, 1-amp for the most-matched, linear in normalized match-line count; no-op when files tie (max_count<=1). Applied in rerankAndFinalize after rerankSignalScore. Default-on at amp 0.8 (tuned on the swe-lite retrieval sweep: 0.8 lifts a buried gold with no regressions; 0.95 over-penalizes legitimately-saturated symbol-owner files). Disable with CODEDB_LEX_FREQ_PENALTY=0/false/off; tune with CODEDB_LEX_FREQ_AMP. Retrieval: MRR 0.833 -> 1.000 on the swe-lite subset (one buried gold flips 2 -> 1), zero regressions. Suite: 722/722 green (verified prior session). Default-on routes the Tier-0 fast path through rerank, costing ~+14us worst-case on high-frequency queries (word/symbol controls flat); negligible for interactive/MCP use. Also lands RvsmSizePrior as opt-in, default-OFF scaffolding (CODEDB_RVSM_SIZE_PRIOR, amp/k via CODEDB_RVSM_AMP/_K): the rVSM file-size prior (BugLocator, ICSE 2012) flat-lined as a standalone reranker (see experiments/ "4th negative"); retained as a feature hook for future learned fusion (P4). No effect unless explicitly enabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2c4bbae36
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // The rVSM size-prior experiment routes Tier-0 results through the | ||
| // rerank path too, so the prior multiplier (and its force-rerank | ||
| // control) apply here as well — see RvsmSizePrior. | ||
| if (use_line_hits and cio.posixGetenv("CODEDB_RVSM_SIZE_PRIOR") == null and !LexFreqPenalty.fromEnv().enabled) { |
There was a problem hiding this comment.
Keep the Tier-0 fast path available by default
In the default environment LexFreqPenalty.fromEnv().enabled is true, so this use_line_hits branch never returns directly; every exact word-index search that fills max_results now falls through to reranking. Full-text search is a benchmarked hot path, and this change reports the high-frequency error query regressing by about 24%, which is above the repo's 10% benchmark-regression threshold. Please gate the penalty or preserve a fast-path mode so common Tier-0 searches do not pay the rerank cost by default.
Useful? React with 👍 / 👎.
b2c4bba to
e900fad
Compare
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
…req-penalty # Conflicts: # src/explore.zig # src/test_search.zig
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
What
Ports engram's learned negative-lexical signal into codedb's in-process rerank as
LexFreqPenalty— down-weighting files the query saturates (dispatcher / registry / re-export / changelog) so the eponymously-named implementation file surfaces.CODEDB_LEX_FREQ_PENALTY=0(/false/off); tune withCODEDB_LEX_FREQ_AMP.rerankAndFinalizeafterrerankSignalScore. Multiplier is 1.0 for the least-matched file,1−ampfor the most-matched, linear in normalized match-line count; no-op when files tie (max_count<=1), so small/tied result sets are unaffected.Why 0.8
Tuned on the swe-lite retrieval sweep: 0.5 was flat, 0.8 lifts a buried gold with zero regressions, 0.95 over-penalizes legitimately-saturated symbol-owner files. 0.8 is the regression-free band.
Results
zig build test→ 23/23 steps, 727/727 tests passed (rebased onto the currentrelease/0.2.5825tip; coexists cleanly with codedbsearchreturns no results for identifier terms thatwordindexes #547's word-index-for-search change in the same file).error+24%); word/symbol controls flat (±0.7%), most searches flat incl. the heaviest (authentication, +0.7%). Negligible for interactive/MCP use. This is a ranking change, not a speed change — the rerank cost is the only perf delta.Also in this diff
RvsmSizePriorlands as opt-in, default-OFF scaffolding (CODEDB_RVSM_SIZE_PRIOR, amp/k viaCODEDB_RVSM_AMP/_K): the rVSM file-size prior (BugLocator, ICSE 2012) flat-lined as a standalone reranker and is retained only as a feature hook for future learned fusion (P4). It's intertwined with the rerank path, so it ships alongside; no effect unless explicitly enabled. The full negative-result writeup is archived inexperiments/ranking/failed.md(separate PR).Test plan
zig build test— 23/23 steps, 727/727 tests passedlex-freq-penalty: CODEDB_LEX_FREQ_PENALTY demotes files the query saturates(off → dispatcher leads; default-on → focused handler leads)🤖 Generated with Claude Code