fix(#546): CLI multi-word ranked search end-to-end (incl. first-cold-run) + audit bundle — callee comment stripping, codedb_query UAF, secret filters#572
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c79484ac6d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| i += 1; // land on '/' of '*/'; the loop's i += 1 then moves past it | ||
| continue; | ||
| } | ||
| if (c == '"' or c == '\'') { |
There was a problem hiding this comment.
Avoid skipping Rust lifetimes as char literals
When analyzing Rust functions with lifetime annotations such as fn parse<'a>(...) { helper(); }, this branch treats the lifetime apostrophe as the start of a quoted literal and, if there is no matching ', advances to the end of the function body. That means real calls after the annotation are never extracted, dropping call graph edges/centrality/callpath for common Rust generic functions; the quote skipping needs to be language-aware or only apply to actual character literals.
Useful? React with 👍 / 👎.
c79484a to
b4952f7
Compare
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
9d0f05d to
94fc30b
Compare
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
…ty, -22 LOC) Pure style: brace-wrapped single-statement for/while loops in 11 bool helpers collapsed to single-line form. No behavior change; test parity preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…in searchContentRanked The engram MRR headline (0.36 vs 0.58) is a benchmark artifact (it models codedb as lexical-only), but it masked two real defects: 1. CLI `search` (runQuery — shared by the cold path and the warm cli-daemon) always called the UNRANKED searchContent, so multi-word queries came back in recall order; only the MCP handler routed multi-word queries to the BM25+centrality ranker. New Explorer.searchContentAuto centralises the query-shape decision (multi-word -> searchContentRanked, single token -> searchContent for exact-identifier lookups); runQuery and MCP handleSearch both call it, so CLI and MCP rank identically. 2. searchContentRanked was the only word-index reader that never lazily rebuilt an incomplete (mmap/snapshot-loaded) index — unlike searchContent, searchWord and renderWord. So even wired up, cold-CLI/daemon multi-word ranked search returned NOTHING: the index served `word` but BM25's N collapsed to 0. Added the same lazy rebuild (before the shared lock) so ranked search works for every caller. Before: `codedb . search "parse token"` -> no results. After: -> BM25+centrality-ranked hits (`search "search content"` ranks explore.zig #1). Tests (test_search.zig): searchContentAuto routes multi-word to the ranker; and searchContentRanked rebuilds an incomplete word index instead of returning empty (verified red by neutering the rebuild). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xtraction extractCallees paired every ( with the preceding identifier with no comment or string stripping, so a name mentioned only in a // line comment, a /* block */ comment, or a string/char literal surfaced as a resolved callee in codedb_context. Strip those spans before call detection. Audit (2026-06-09) latent-issue sweep; failing tests added first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e + leak) The deps op appended dependency strings into file_set that were freed each iteration alongside the per-file deps_result (use-after-free), and leaked the standalone seed-path dupe. Own both in a scoped arena freed at pipeline end. Audit (2026-06-09) latent-issue sweep; testing.allocator catches the poisoned read and the leak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fast-restore import cap Security: isSensitivePath (watcher.zig and snapshot.zig, parity-tested) missed *.env-suffix files such as production.env and staging.env, plus .git-credentials, so those secrets were indexed and read. Add the suffix match and exact name to both copies. Fast-restore: loadOutlineStateMap read OUTLINE_STATE imports with a 4096-byte cap while the writer allows u16 max, so a longer import silently disabled the borrow-path fast-restore. Widen the read cap to maxInt(u16). Audit (2026-06-09) latent-issue sweep; failing tests added first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Six latent issues in explore.zig: - parseDelimitedImport: strip a trailing `as <alias>` so Kotlin/Swift import X as Y keys on X, not the aliased form. - getImportedBy: skip the basename fallback when two or more indexed files share a basename, so a bare import conf is not attributed to every conf.py. - parsePythonLine: record one dep per comma-separated module in import a, b (was only the first); factor out appendPythonModuleDep. - searchContent tier0: drop the use_line_hits early-return that skipped rerankAndFinalize, so the canonical basename match outranks a high-count non-canonical file (the #537/#448 inversion at small max_results). - tier0 fast-path sort: add the path-prior (basename-stem/segment match plus test/vendor demotion) so renderPlainSearch and codedb_search no longer rank raw hit-count over the canonical file. - readContentForSearch: raise the disk-read cap 512KB to 64MB to match the indexer, so a word-indexed file larger than 512KB evicted from the content cache stays searchable. Audit (2026-06-09) latent-issue sweep; failing tests added first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mpty-but-complete word index; multi-word ranked search returns nothing Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two cold-start defects made multi-word `codedb search` return nothing: 1. main.zig disables the word index for non-index commands, but word_index_complete defaults to true — committing files through a disabled index left an EMPTY index that claimed completeness, so searchContentRanked trusted the flag, skipped its lazy rebuild, and collapsed to N == 0. commitParsedFileOwnedOutline now clears word_index_complete when the disabled index can't absorb a file. 2. The first-ever cold `search` scan (no trigram on disk) used the trigram-extract-only fast path, committing neither outlines nor contents — BM25 had nothing to rank even after a rebuild. Multi-word non-regex searches now route through the full single-pass scan; single-token and --regex searches keep the trigram-only fast path. Live matrix on this repo: first-cold multi-word 0 -> 50 results (scan 722ms -> 1.7s, one-time; trigram persists so later runs stay fast); cold single-word/regex fast paths unchanged; warm paths unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…es)' summary Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…er-token fallback Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Empty lists printed a body line ' (none)' and no summary, so machine consumers that parse the list body saw one entry (engram counted in-degree 1 for every zero-importer file, degenerating its centrality features). Print '(0 files)' alongside '(none)' at all three render sites — the imported-by fast path, the general results path (forward/transitive), and handleDepsPathOnly — so the summary line is always present. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
'word' looked up the literal phrase in the inverted index, so an agent-shaped query like "gateway websocket reconnect" returned zero hits even when every token had plentiful hits. A whitespace-bearing query now routes through searchWordTokensLocked: each distinct (normalized, deduped) token is looked up and files are ranked by distinct-token coverage, then total hits, then path, returning one representative hit per file. Shared by the CLI (searchWord) and MCP codedb_word (renderWord, which labels the mode '(tokenized)'). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
94fc30b to
7d6c537
Compare
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
…nd table Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…empty symbol name not an error Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…biguity, mirroring getImportedBy The justrach#572 bundle added the same-basename ambiguity guard to getImportedBy (query-pipeline deps op) but renderImportedBy — the MCP codedb_deps reverse listing — kept the unconditional fallback, so a bare 'import conf' was still attributed to every indexed conf.py through the primary tool. Count same-basename outlines and skip the fallback when ambiguous, exactly like getImportedBy. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Problem
Two threads, rebased onto
release/0.2.5825and overlap-resolved against #557/#559/#561/#563/#565 (and re-rebased over #567/#571):1. #546 — the CLI half.
runQuery(cold CLI + warm cli-daemon) always called the unrankedsearchContent, so multi-word queries came back in recall order (usually nothing, since the literal phrase rarely appears); only the MCP handler ranked. Complements #557, which fixed the single-word rerank priors.While validating live, two further cold-start defects surfaced — multi-word
codedb searchreturned nothing on any cold start even with the routing fixed:word_index_completedefaults totrue— the scan left an empty index claiming completeness, sosearchContentRankedtrusted the flag, skipped its lazy rebuild, and collapsed toN == 0.searchscan (no trigram on disk) used the trigram-extract-only fast path, committing neither outlines nor contents — BM25 had nothing to rank even after a rebuild.2. Audit (2026-06-09) latent-issue sweep — failing tests committed first for each:
codegraph.extractCalleespaired every(with the preceding identifier with no comment/string stripping → names mentioned in//,/* */, or string literals surfaced as resolved callees incodedb_context(complements fix(#562): codedb_callers excludes full-line comment mentions from call sites #563, same class inhandleCallers)codedb_querydeps op appended freed dependency strings intofile_set(use-after-free) and leaked the seed string*.envsuffixes (prod.env) and.git-credentials; fast-restore import read cap (4096) was narrower than the write cap (65535), silently disabling the borrow pathuse_line_hitsearly-return skippedrerankAndFinalize(the search: structurally-relevant files rank below lexical hits (engram benchmark MRR 0.40 vs 0.57) #537/explore: rerank misses canonical files for basename-stem and case variants #448 inversion, reintroduced for small max_results);renderPlainSearchbasename prior; >512KB evicted-file read cap3. Appended: #568 + #569 (engram-bridge findings, red→green pairs):
depslists printed a body line(none)and no(N files)summary, so machine consumers parsing the list body saw one entry (engram counted in-degree 1 for every zero-importer file, degenerating its centrality features to all-1.0 ties).wordqueries returned zero hits (literal-phrase lookup only; the multi-word analogue of codedbsearchreturns no results for identifier terms thatwordindexes #547, whichworddidn't catch either).searchmulti-word is fixed by thread 1; this closes thewordhalf.Fix
Explorer.searchContentAuto— query-shape routing (multi-word → BM25+centrality, single token → literal) shared byrunQueryand the MCP handlercommitParsedFileOwnedOutlineclearsword_index_completewhen the disabled word index can't absorb a file, so ranked/BM25 readers lazy-rebuild instead of trusting a lying flag--regexkeep the trigram-only fast pathextractCallees;codedb_querydeps-op strings arena-owned;isSensitivePathextended + parity-kept; tier-0 results flow through rerank(0 files)printed alongside(none)at all three deps render sites (imported-by fast path, general forward/transitive path,handleDepsPathOnly) — the summary line is now unconditionalsearchWordTokensLocked— whitespace-bearing queries tokenize (normalized, deduped), look up each token, and rank files by distinct-token coverage → total hits → path, one representative hit per file; shared by CLIsearchWordand MCPrenderWord(which labels the mode(tokenized))Rebase notes
Replayed onto the release tip twice (51c3c39, then 43228cc after #567/#571 merged). Conflicts resolved keeping both sides:
searchContentAutothreaded through #561's path_glob escalation loop (both fetch sites), all EOF tests preserved pairwise, #557's tooling-path prior verified intact in the mergedrerankSignalScorechain (tests 0.6 → bench/scripts/website/install 0.5 → vendor 0.4 → doc cap).Validation
zig build test: 749/749 pass (23/23 steps)CODEDB_NO_CLI_DAEMON=1):search: 0 → 50 results (word index rebuild,snapshot load); one-time scan cost 722ms → 1.7s, trigram persists so later runs stay fastrebuildWordIndex): 9 results, fast path unchanged; cold--regex: 2 results, fast path unchangedcodedb word "word index rebuild": 0 → 176 hits, files covering more tokens firstcodedb deps src/tests.zig(zero importers): now prints(none)+(0 files)codedb_insightsre-run post-fix: codedb MRR 0.398 vs engram re-rank 0.288 on the latest 25-commit window (the historical gap that spawned search: structurally-relevant files rank below lexical hits (engram benchmark MRR 0.36 vs 0.58) #546 was 0.30 vs 0.60 the other way)🤖 Generated with Claude Code