fix(#598): pre-boost occurrence cap stops mention-dense tooling files saturating the path prior#599
Conversation
…ath prior Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The x0.5 tooling prior is multiplicative against the raw per-line count, so a bench script repeating a term six times (6.0x0.5=3.0) still beat the implementation's 2.0. Cap the occurrence base at 2.0 for tooling paths BEFORE the stem/symbol boosts: density cannot dominate, while an eponymous lookup (query 'install' -> install/install.sh) keeps its +15 stem boost and still ranks first. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e8c65a20c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const is_tooling_path = pathHasSegment(r.path, "bench") or pathHasSegment(r.path, "benchmarks") or | ||
| pathHasSegment(r.path, "scripts") or pathHasSegment(r.path, "website") or | ||
| pathHasSegment(r.path, "install"); | ||
| if (is_tooling_path) score = @min(score, 2.0); |
There was a problem hiding this comment.
Apply the cap to the MCP fast path too
This cap only affects results that flow through searchContentAuto/rerankSignalScore; the MCP handler first tries renderPlainSearch for single-token searches with no glob/compact options (src/mcp.zig:1777-1782), and that renderer has its own Tier 0 ordering based on raw hit counts without this tooling cap. In that common codedb_search path, a dense bench/... or benchmarks/... file still sorts ahead of an implementation file for queries like capture, so the issue this change is meant to fix remains visible to MCP clients unless the same cap/prior is applied there or the fast path is bypassed.
Useful? React with 👍 / 👎.
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Fixes #598 — the last evidenced ranking failure mode from the audit rounds.
The ×0.5 tooling prior (#557) is multiplicative against the raw per-line occurrence count, so density shrugs it off: live,
codedb search capturereturnedbenchmarks/search-shootout/shootout.pyin every top-8 slot. A naive total-score cap (the doc-penalty approach) would destroy eponymy —codedb search installmust keep rankinginstall/install.shfirst.Fix: cap the occurrence BASE at 2.0 for tooling paths before the stem/symbol boosts. Density can't dominate (6 mentions → 2.0×0.5=1.0, below any 2-mention source line), while the +15 stem boost applies after the cap so eponymous lookups still win (cap 2 + 15 = 17 → ×0.5 = 8.5).
Failing test committed red (dense bench file ranked first; eponymy case pinned in the same test). Full suite green.
🤖 Generated with Claude Code