fix: memory/RSS + daemon-lifecycle bug hunt — 8 issues (#583 #584 #586 #587 #589 #592 #593 #594)#585
Conversation
…s; ContentCache strands overflow inserts issue-583: readFromDisk/mmapFromDisk indexes run skip_file_words=true, so removeFile no-ops: post-load re-index appends postings while stale ones survive (ghost hits, stale lines, unbounded growth across saves), deletes leave all postings live, and pure-mmap removeFile never promotes. issue-584: ContentCache overflow eviction uses a global CLOCK hand, landing the new entry outside its 4-slot probe window where get() can never find it (stranded bytes + permanent miss); remove() holes early-break lookups for in-window entries and let put() insert duplicate copies of a key. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ndow Three probe-window violations in one shape: the overflow path evicted via a global CLOCK hand and wrote the new entry wherever the hand stopped — outside the key's probe window, where get() can never find it (stranded bytes, permanent misses, disk re-reads); remove() holes early-broke get() for in-window entries stored past them; and put() took the first hole before scanning the rest of the window, duplicating a key it already held. putImpl now scans the full window (update-in-place wins over first-empty), evicts in-window with second-chance ref bits when the window is truly full, and get/remove probe all slots. The global clockEvict/hand machinery is dead and removed. PROBE_LIMIT goes 4 -> 8: eviction is now real (the old code never freed on overflow, it stranded), so the window needs enough associativity that a full window at typical occupancy is vanishingly rare. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
readFromDisk/mmapFromDisk/mergeShard leave file_words empty, so removeFile early-returned for every file from a bulk path: post-load re-indexes appended postings while the stale ones survived (ghost terms at stale lines, doubled BM25 tf), deletes left a file's postings live forever, and a long-running watch session grew the index on every save. removeFile gains a slow path for tracked-but-listless files: sweep every posting list for the doc_id, prune empty buckets, fix doc_lengths / total_tokens, and free the id_to_path slot string (the owner in skip_file_words mode). Pure-mmap mode promotes first when the path is tracked — a remove is a write. indexFile now always records the per-file word list (its own key dupe in skip mode, since id_to_path owns stable_path there), so the sweep runs at most once per file before the fast path takes over; writeToDisk builds its file table from id_to_path — always complete — instead of preferring a file_words map that only covers incrementally indexed files once bulk-loaded docs exist. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
One more datapoint: the longstanding flake in 🤖 Generated with Claude Code |
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: abd58fc2bc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const fw_key = if (self.skip_file_words) | ||
| try self.allocator.dupe(u8, stable_path) | ||
| else |
There was a problem hiding this comment.
Honor skip_file_words for bulk indexing
When callers build a serial bulk word index with skip_file_words=true (for example, persistWordIndexFromSource sets it before calling indexFile for every outline), this new branch still allocates a file_words key and compact word slice and then stores them. That defeats the memory-saving path that is used for bulk rebuilds and can reintroduce the large RSS growth that skip_file_words is meant to avoid on big repositories; keep file_words empty for those bulk builders or separate the incremental cleanup mode from the skip flag.
Useful? React with 👍 / 👎.
…emoveFile leaves skip_trigram_files entry issue-586: symbol_index is keyed by sym.name slices owned by the file's outline; re-indexing (or deleting) the file that first inserted a shared name frees the key bytes while the entry survives via other files' locations — every later probe reads freed memory and the name silently drops out of the O(1) index. issue-587: Explorer.removeFile never removes the skip_trigram_files entry, whose key aliases the outlines key it frees — the tier-3 search scan then iterates a dangling path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… and removal #586: symbol_index now owns its keys — duped on first insert, freed when an emptied entry is pruned and at deinit. Keys previously aliased sym.name slices owned by the inserting file's outline; re-indexing or deleting that file freed the bytes the map hashes while shared-name entries (init/deinit/main in nearly every Zig file) survived, so every later probe read freed memory and the name silently fell out of the O(1) index, degrading lookups to the outline scans. #587: Explorer.removeFile also removes the skip_trigram_files entry, whose key aliases the outlines key being freed — the tier-3 scan no longer iterates a dangling path. Removal only, no free: the outlines loop owns that allocation, matching the re-index path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sk FIDO key names ssh-keygen's other default basenames dodge the exact-name list in both the watcher and snapshot copies: deploy/id_ecdsa is indexed and readable while deploy/id_rsa is blocked. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…SensitivePath copies All six ssh-keygen default private-key basenames are now in the exact list (they start with 'i', so the fast path already routes them there); the #528 parity test keeps the watcher and snapshot copies aligned. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
… freed prior_content restore issue-593: AnyTrigramIndex.removeFile no-ops in pure-mmap mode and the overlay never masks base entries — deleted files stay 'contained' with live candidates, edited files answer from both old and new content. issue-594: a fail-once allocator sweep over a re-index aborts the whole test binary — parseContentForIndexing builds a throwaway full Explorer per file (4096-slot cache alloc+memset each call) whose ContentCache.init PANICS on OOM; once that is an error instead, the sweep shows the trigram-failure errdefer restoring the word index from prior_content that contents.put already freed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…deinit, no poisoned entries The fail-once allocator sweep (issue-594 test) surfaced a family of failure-path defects in the indexFile flow; all are fixed here: - parseContentForIndexing built a throwaway full Explorer per file — a 4096-slot ContentCache alloc+memset per call — and ContentCache.init PANICS on OOM, so an allocation hiccup mid-indexing killed the daemon. Explorer.init now wraps a fallible initFallible, and the parse shell uses `try initFallible(allocator, 1)`. - commitParsedFileOwnedOutline stacked `errdefer owned_outline.deinit()` with `defer owned_outline.deinit()` — any post-clone error deinit'd the parsed outline twice (double free of every symbol name). - prior_content was read out of the cache and freed by contents.put two lines later while the trigram-failure errdefer still re-indexed it. contents.put is now the LAST fallible step, and one function-scoped errdefer restores the word index from the still-valid prior_content on any failure (including inside word_index.indexFile itself, whose internal removeFile has already wiped the old postings). - removeSymbolIndexFor deinit'd the in-map value (poisoning it) before the prune-list append; if that append failed, the map kept an entry whose value crashed the next deinit/iteration. Deinit now happens only at actual fetchRemove time. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ndex drops orphaned getOrPut entries on dupe failure #593: MmapOverlay gains a masked path set (owned keys). indexFile and removeFile mask the path (removeFile on pure-mmap promotes first — a remove is a write, but only when the base actually tracks the path); containsFile and the candidates/candidatesRegex merges filter base answers through the mask (shared mergeOverlayCandidates helper), and fileCount subtracts a maintained masked-in-base counter. #594 (same sweep): indexOneToken and mergeShard did getOrPut, then duped the key — a failed dupe propagated out leaving an entry whose key points at the tokenizer's stack buffer (or the shard's freed arena) with an undefined value; the next insert that landed on it dereferenced garbage. A failed dupe now removes the fresh entry before returning the error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Concurrent cold CLI calls forked a cli-daemon EACH: no mutual exclusion on spawn, every duplicate paid the full index scan before noticing the socket was taken — and the stale-path unlink in cliDaemonListen let late arrivals steal the winner's live socket, orphaning it mid-scan (observed: 9 daemons for one root, 10-23% CPU each, 0.45s benchmark -> 22s). The daemon arm now takes <data_dir>/cli-daemon.lock (exclusive, non-blocking flock) BEFORE the expensive load: losers exit immediately; the fd is held for the process lifetime and the kernel releases it on any exit, so crashes never leave a stale lock. The CLI auto-spawn path probes the same lock and skips the fork while a daemon is alive or starting — racing callers cold-serve their one call and the next call hits the winner's socket. Verified live: 8 concurrent cold calls against one root -> exactly 1 daemon (was: several, each rescanning). The contract test ships in the same commit: the lock primitive did not exist before the fix, so there is no red state to commit first (the issue's own evidence is the measured stampede). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
justrach
left a comment
There was a problem hiding this comment.
Automated review (Sonnet pass + verification), defects-only scope:
Verdict: safe to merge. All 8 filed bugs verified genuinely closed; no new memory-safety defects, double-frees, or UAFs found in the bundle.
Two low-severity, non-blocking observations:
-
#592 spawn-race fix is a window-shrink, not full TOCTOU elimination —
daemonLockAvailablereleases the probe lock before the fork, so under a true stampede N callers can still fork N daemons that racedaemonLockTryAcquire; losers exit before the expensive load. Outcome (no duplicate running daemons, no rescan churn) matches the issue, just with N−1 wasted forks in the worst case. The comment's 'losers simply cold-serve' slightly undersells that they spawn-and-exit. -
MmapOverlay.mask()swallows OOM silently — if thedupe/putfails, the base entry stays unmasked andcontainsFilekeeps answering true for stale content for the duration of memory pressure (same degraded behavior as pre-fix, so not a regression). Worth a follow-up: log the swallow or propagate soindexFilecan retry.
Cleared after explicit verification (so nobody re-checks): #594 prior_content UAF ordering and owned_outline error-path deinit; #587 skip_trigram_files key lifetime (remove-before-free, value-equality lookup); #586 symbol_index key dup/free symmetry; #583 skip_file_words separate allocations (no double-free); #584 ContentCache sentinel removal correctness for the 8-probe window; mergeOverlayCandidates frees on all paths; isSensitivePath updated in both copies (snapshot+watcher; id_rsa_sk omission is correct — no such OpenSSH name); parseContentForIndexing capacity=1 harmless; word_index_generation bump on failed trigram = extra-but-valid disk write.
🤖 Generated with Claude Code
…index-stale-postings
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Memory/RSS-focused bug hunt grown across three rounds, all red→green where a red state exists. Suite 761/761.
Round 1 — index growth & cache integrity
readFromDisk/mmapFromDisk/mergeShard— the warm-daemon config) leftremoveFilea no-op → ghost/stale postings, doubled BM25 tf, unbounded growth per save. Fix: doc_id sweep slow path, hybrid per-file word lists,writeToDiskfile table fromid_to_path, mmap promote-on-remove.Round 2 — dangling references & secrets
symbol_indexkeyed by outline-ownedsym.name: re-index/delete of the first inserter freed the key while shared-name entries survived (init/deinit/mainare shared by nearly every Zig file) → UAF on every probe, names silently dropping to slow scans. Fix: map owns its keys.Explorer.removeFileleft a dangling path inskip_trigram_files(key aliases the freed outlines key) → tier-3 scan iterated freed memory. Fix: remove alongside the other structures.isSensitivePathmissedid_ecdsa/id_dsa/*_skSSH keys:deploy/id_ecdsaindexed whiledeploy/id_rsablocked. Both copies fixed.Round 3 — mmap trigram ghosts + the indexFile OOM family + daemon stampede
AnyTrigramIndex.removeFileno-op'd in pure-mmap mode and the overlay never masked base entries: deleted files stayed "contained" with live candidates; edited files answered from both old and new content. Fix:MmapOverlay.maskedpath set (owned keys), filtered throughcontainsFile/candidates/candidatesRegex(shared merge helper),fileCountcorrection, promote-on-remove.ContentCache.initpanics on OOM (now a fallible 1-slot parser shell);errdefer+deferdouble-deinit of the parsed outline; theprior_contentUAF (cacheputnow last fallible step + one function-scoped word-index restore); restore gap for failures insideword_index.indexFile; orphanedgetOrPutentries on key-dupe failure inindexOneToken/mergeShard(stack-buffer keys, undefined values → segfault);removeSymbolIndexForpoisoning the in-map value before a fallible append.Also filed (enhancements, per #550/#564 precedent)
Notes
codedb(outline/deps/context/status) against its own tree plus targeted reads and a fail-once allocator sweep.commitParsedFileOwnedOutlinemetadata flags drift on failure paths (cosmetic); serve-vs-cli-daemon socket handover quirk (pre-existing, proxy-disabled fallback).test_mcpissue-148 EOF test and the threetest_indexperf-threshold tests flake when their binaries re-run under concurrent load; all pass consistently on green suites.Closes #583. Closes #584. Closes #586. Closes #587. Closes #589. Closes #592. Closes #593. Closes #594.
🤖 Generated with Claude Code