Skip to content

fix: memory/RSS + daemon-lifecycle bug hunt — 8 issues (#583 #584 #586 #587 #589 #592 #593 #594)#585

Merged
justrach merged 12 commits into
release/0.2.5825from
fix/word-index-stale-postings
Jun 10, 2026
Merged

fix: memory/RSS + daemon-lifecycle bug hunt — 8 issues (#583 #584 #586 #587 #589 #592 #593 #594)#585
justrach merged 12 commits into
release/0.2.5825from
fix/word-index-stale-postings

Conversation

@justrach

@justrach justrach commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Memory/RSS-focused bug hunt grown across three rounds, all red→green where a red state exists. Suite 761/761.

Round 1 — index growth & cache integrity

Round 2 — dangling references & secrets

Round 3 — mmap trigram ghosts + the indexFile OOM family + daemon stampede

Also filed (enhancements, per #550/#564 precedent)

Notes

  • Found by dogfooding codedb (outline/deps/context/status) against its own tree plus targeted reads and a fail-once allocator sweep.
  • Known + parked: commitParsedFileOwnedOutline metadata flags drift on failure paths (cosmetic); serve-vs-cli-daemon socket handover quirk (pre-existing, proxy-disabled fallback).
  • The test_mcp issue-148 EOF test and the three test_index perf-threshold tests flake when their binaries re-run under concurrent load; all pass consistently on green suites.

Closes #583. Closes #584. Closes #586. Closes #587. Closes #589. Closes #592. Closes #593. Closes #594.

🤖 Generated with Claude Code

justrach and others added 3 commits June 10, 2026 15:32
…s; ContentCache strands overflow inserts

issue-583: readFromDisk/mmapFromDisk indexes run skip_file_words=true, so
removeFile no-ops: post-load re-index appends postings while stale ones
survive (ghost hits, stale lines, unbounded growth across saves), deletes
leave all postings live, and pure-mmap removeFile never promotes.

issue-584: ContentCache overflow eviction uses a global CLOCK hand, landing
the new entry outside its 4-slot probe window where get() can never find
it (stranded bytes + permanent miss); remove() holes early-break lookups
for in-window entries and let put() insert duplicate copies of a key.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ndow

Three probe-window violations in one shape: the overflow path evicted via
a global CLOCK hand and wrote the new entry wherever the hand stopped —
outside the key's probe window, where get() can never find it (stranded
bytes, permanent misses, disk re-reads); remove() holes early-broke get()
for in-window entries stored past them; and put() took the first hole
before scanning the rest of the window, duplicating a key it already held.

putImpl now scans the full window (update-in-place wins over first-empty),
evicts in-window with second-chance ref bits when the window is truly
full, and get/remove probe all slots. The global clockEvict/hand machinery
is dead and removed. PROBE_LIMIT goes 4 -> 8: eviction is now real (the
old code never freed on overflow, it stranded), so the window needs
enough associativity that a full window at typical occupancy is
vanishingly rare.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
readFromDisk/mmapFromDisk/mergeShard leave file_words empty, so removeFile
early-returned for every file from a bulk path: post-load re-indexes
appended postings while the stale ones survived (ghost terms at stale
lines, doubled BM25 tf), deletes left a file's postings live forever, and
a long-running watch session grew the index on every save.

removeFile gains a slow path for tracked-but-listless files: sweep every
posting list for the doc_id, prune empty buckets, fix doc_lengths /
total_tokens, and free the id_to_path slot string (the owner in
skip_file_words mode). Pure-mmap mode promotes first when the path is
tracked — a remove is a write. indexFile now always records the per-file
word list (its own key dupe in skip mode, since id_to_path owns
stable_path there), so the sweep runs at most once per file before the
fast path takes over; writeToDisk builds its file table from id_to_path —
always complete — instead of preferring a file_words map that only covers
incrementally indexed files once bulk-loaded docs exist.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@justrach

Copy link
Copy Markdown
Owner Author

One more datapoint: the longstanding flake in test_snapshot "parallel freshness load re-indexes changed files, restores the rest" (fails ~1 in 6 runs with error.MissingContent) is a live manifestation of #584. The freshness loader keys the ContentCache by absolute paths, so the tmpdir prefix re-rolls the FNV hashes every run; with 1002 keys in 16384 slots and a 4-slot probe window, the expected number of full windows per run is ~0.2 — and on the runs where one filled, the old overflow path stranded the entry where get() could never find it → contents.get(abs_path)MissingContent. With in-window eviction + the 8-way window, a full window at that occupancy is ~10⁻⁶ per run. The flake should be gone with this PR.

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 70656 73633 +4.21% +2977 OK
codedb_changes 7300 7404 +1.42% +104 OK
codedb_context 830335 839218 +1.07% +8883 OK
codedb_deps 233 286 +22.75% +53 NOISE
codedb_edit 65292 42063 -35.58% -23229 OK
codedb_find 7052 7123 +1.01% +71 OK
codedb_hot 18355 21543 +17.37% +3188 NOISE
codedb_outline 20978 25721 +22.61% +4743 NOISE
codedb_read 11268 11114 -1.37% -154 OK
codedb_search 25349 25392 +0.17% +43 OK
codedb_snapshot 42560 44029 +3.45% +1469 OK
codedb_status 6762 6625 -2.03% -137 OK
codedb_symbol 31914 36443 +14.19% +4529 NOISE
codedb_tree 29812 31116 +4.37% +1304 OK
codedb_word 8197 7872 -3.96% -325 OK

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: abd58fc2bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/index.zig
Comment on lines +325 to +327
const fw_key = if (self.skip_file_words)
try self.allocator.dupe(u8, stable_path)
else

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor skip_file_words for bulk indexing

When callers build a serial bulk word index with skip_file_words=true (for example, persistWordIndexFromSource sets it before calling indexFile for every outline), this new branch still allocates a file_words key and compact word slice and then stores them. That defeats the memory-saving path that is used for bulk rebuilds and can reintroduce the large RSS growth that skip_file_words is meant to avoid on big repositories; keep file_words empty for those bulk builders or separate the incremental cleanup mode from the skip flag.

Useful? React with 👍 / 👎.

justrach and others added 2 commits June 10, 2026 16:20
…emoveFile leaves skip_trigram_files entry

issue-586: symbol_index is keyed by sym.name slices owned by the file's
outline; re-indexing (or deleting) the file that first inserted a shared
name frees the key bytes while the entry survives via other files'
locations — every later probe reads freed memory and the name silently
drops out of the O(1) index.

issue-587: Explorer.removeFile never removes the skip_trigram_files
entry, whose key aliases the outlines key it frees — the tier-3 search
scan then iterates a dangling path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… and removal

#586: symbol_index now owns its keys — duped on first insert, freed when
an emptied entry is pruned and at deinit. Keys previously aliased
sym.name slices owned by the inserting file's outline; re-indexing or
deleting that file freed the bytes the map hashes while shared-name
entries (init/deinit/main in nearly every Zig file) survived, so every
later probe read freed memory and the name silently fell out of the O(1)
index, degrading lookups to the outline scans.

#587: Explorer.removeFile also removes the skip_trigram_files entry,
whose key aliases the outlines key being freed — the tier-3 scan no
longer iterates a dangling path. Removal only, no free: the outlines
loop owns that allocation, matching the re-index path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
justrach and others added 2 commits June 10, 2026 16:22
…sk FIDO key names

ssh-keygen's other default basenames dodge the exact-name list in both
the watcher and snapshot copies: deploy/id_ecdsa is indexed and readable
while deploy/id_rsa is blocked.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…SensitivePath copies

All six ssh-keygen default private-key basenames are now in the exact
list (they start with 'i', so the fast path already routes them there);
the #528 parity test keeps the watcher and snapshot copies aligned.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@justrach justrach changed the title fix(#583,#584): word index stale postings for bulk-loaded files; ContentCache probe-window violations fix(#583,#584,#586,#587,#589): memory-bug hunt — stale word-index postings, cache probe-window violations, symbol_index UAF, dangling skip_trigram path, missing SSH key filters Jun 10, 2026
@github-actions

Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 115069 109727 -4.64% -5342 OK
codedb_changes 11017 11179 +1.47% +162 OK
codedb_context 1295506 1276506 -1.47% -19000 OK
codedb_deps 325 329 +1.23% +4 OK
codedb_edit 48771 59904 +22.83% +11133 NOISE
codedb_find 10539 11192 +6.20% +653 OK
codedb_hot 28535 30025 +5.22% +1490 OK
codedb_outline 36140 35465 -1.87% -675 OK
codedb_read 17575 20283 +15.41% +2708 NOISE
codedb_search 28995 31360 +8.16% +2365 OK
codedb_snapshot 71649 72146 +0.69% +497 OK
codedb_status 9801 10137 +3.43% +336 OK
codedb_symbol 51709 50589 -2.17% -1120 OK
codedb_tree 43435 44975 +3.55% +1540 OK
codedb_word 12501 12884 +3.06% +383 OK

@github-actions

Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 114367 109096 -4.61% -5271 OK
codedb_changes 11406 10835 -5.01% -571 OK
codedb_context 1308617 1217147 -6.99% -91470 OK
codedb_deps 330 333 +0.91% +3 OK
codedb_edit 55258 50054 -9.42% -5204 OK
codedb_find 10462 11139 +6.47% +677 OK
codedb_hot 27255 27505 +0.92% +250 OK
codedb_outline 38347 38395 +0.13% +48 OK
codedb_read 18416 19999 +8.60% +1583 OK
codedb_search 30212 29238 -3.22% -974 OK
codedb_snapshot 76367 77139 +1.01% +772 OK
codedb_status 10074 9765 -3.07% -309 OK
codedb_symbol 53439 53037 -0.75% -402 OK
codedb_tree 43701 45051 +3.09% +1350 OK
codedb_word 12614 12019 -4.72% -595 OK

justrach and others added 3 commits June 10, 2026 16:55
… freed prior_content restore

issue-593: AnyTrigramIndex.removeFile no-ops in pure-mmap mode and the
overlay never masks base entries — deleted files stay 'contained' with
live candidates, edited files answer from both old and new content.

issue-594: a fail-once allocator sweep over a re-index aborts the whole
test binary — parseContentForIndexing builds a throwaway full Explorer
per file (4096-slot cache alloc+memset each call) whose ContentCache.init
PANICS on OOM; once that is an error instead, the sweep shows the
trigram-failure errdefer restoring the word index from prior_content that
contents.put already freed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…deinit, no poisoned entries

The fail-once allocator sweep (issue-594 test) surfaced a family of
failure-path defects in the indexFile flow; all are fixed here:

- parseContentForIndexing built a throwaway full Explorer per file — a
  4096-slot ContentCache alloc+memset per call — and ContentCache.init
  PANICS on OOM, so an allocation hiccup mid-indexing killed the daemon.
  Explorer.init now wraps a fallible initFallible, and the parse shell
  uses `try initFallible(allocator, 1)`.
- commitParsedFileOwnedOutline stacked `errdefer owned_outline.deinit()`
  with `defer owned_outline.deinit()` — any post-clone error deinit'd the
  parsed outline twice (double free of every symbol name).
- prior_content was read out of the cache and freed by contents.put two
  lines later while the trigram-failure errdefer still re-indexed it.
  contents.put is now the LAST fallible step, and one function-scoped
  errdefer restores the word index from the still-valid prior_content on
  any failure (including inside word_index.indexFile itself, whose
  internal removeFile has already wiped the old postings).
- removeSymbolIndexFor deinit'd the in-map value (poisoning it) before
  the prune-list append; if that append failed, the map kept an entry
  whose value crashed the next deinit/iteration. Deinit now happens only
  at actual fetchRemove time.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ndex drops orphaned getOrPut entries on dupe failure

#593: MmapOverlay gains a masked path set (owned keys). indexFile and
removeFile mask the path (removeFile on pure-mmap promotes first — a
remove is a write, but only when the base actually tracks the path);
containsFile and the candidates/candidatesRegex merges filter base
answers through the mask (shared mergeOverlayCandidates helper), and
fileCount subtracts a maintained masked-in-base counter.

#594 (same sweep): indexOneToken and mergeShard did getOrPut, then duped
the key — a failed dupe propagated out leaving an entry whose key points
at the tokenizer's stack buffer (or the shard's freed arena) with an
undefined value; the next insert that landed on it dereferenced garbage.
A failed dupe now removes the fresh entry before returning the error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 73429 76982 +4.84% +3553 OK
codedb_changes 6042 6840 +13.21% +798 NOISE
codedb_context 1111675 1135141 +2.11% +23466 OK
codedb_deps 291 341 +17.18% +50 NOISE
codedb_edit 40333 30394 -24.64% -9939 OK
codedb_find 5122 5070 -1.02% -52 OK
codedb_hot 14210 13358 -6.00% -852 OK
codedb_outline 26051 26155 +0.40% +104 OK
codedb_read 11931 11707 -1.88% -224 OK
codedb_search 26412 26215 -0.75% -197 OK
codedb_snapshot 68862 70547 +2.45% +1685 OK
codedb_status 4776 5094 +6.66% +318 OK
codedb_symbol 40115 42044 +4.81% +1929 OK
codedb_tree 24930 27133 +8.84% +2203 OK
codedb_word 6556 6635 +1.21% +79 OK

Concurrent cold CLI calls forked a cli-daemon EACH: no mutual exclusion
on spawn, every duplicate paid the full index scan before noticing the
socket was taken — and the stale-path unlink in cliDaemonListen let late
arrivals steal the winner's live socket, orphaning it mid-scan (observed:
9 daemons for one root, 10-23% CPU each, 0.45s benchmark -> 22s).

The daemon arm now takes <data_dir>/cli-daemon.lock (exclusive,
non-blocking flock) BEFORE the expensive load: losers exit immediately;
the fd is held for the process lifetime and the kernel releases it on any
exit, so crashes never leave a stale lock. The CLI auto-spawn path probes
the same lock and skips the fork while a daemon is alive or starting —
racing callers cold-serve their one call and the next call hits the
winner's socket.

Verified live: 8 concurrent cold calls against one root -> exactly 1
daemon (was: several, each rescanning).

The contract test ships in the same commit: the lock primitive did not
exist before the fix, so there is no red state to commit first (the
issue's own evidence is the measured stampede).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@justrach justrach changed the title fix(#583,#584,#586,#587,#589): memory-bug hunt — stale word-index postings, cache probe-window violations, symbol_index UAF, dangling skip_trigram path, missing SSH key filters fix: memory/RSS + daemon-lifecycle bug hunt — 8 issues (#583 #584 #586 #587 #589 #592 #593 #594) Jun 10, 2026
@github-actions

Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 106458 108384 +1.81% +1926 OK
codedb_changes 12093 10696 -11.55% -1397 OK
codedb_context 1227728 1269907 +3.44% +42179 OK
codedb_deps 332 307 -7.53% -25 OK
codedb_edit 53582 34607 -35.41% -18975 OK
codedb_find 9279 10216 +10.10% +937 NOISE
codedb_hot 25733 25605 -0.50% -128 OK
codedb_outline 33499 38117 +13.79% +4618 NOISE
codedb_read 15810 15783 -0.17% -27 OK
codedb_search 27379 27300 -0.29% -79 OK
codedb_snapshot 72793 65661 -9.80% -7132 OK
codedb_status 9049 9615 +6.25% +566 OK
codedb_symbol 49591 50704 +2.24% +1113 OK
codedb_tree 40298 47307 +17.39% +7009 NOISE
codedb_word 10726 10757 +0.29% +31 OK

@justrach justrach left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review (Sonnet pass + verification), defects-only scope:

Verdict: safe to merge. All 8 filed bugs verified genuinely closed; no new memory-safety defects, double-frees, or UAFs found in the bundle.

Two low-severity, non-blocking observations:

  1. #592 spawn-race fix is a window-shrink, not full TOCTOU eliminationdaemonLockAvailable releases the probe lock before the fork, so under a true stampede N callers can still fork N daemons that race daemonLockTryAcquire; losers exit before the expensive load. Outcome (no duplicate running daemons, no rescan churn) matches the issue, just with N−1 wasted forks in the worst case. The comment's 'losers simply cold-serve' slightly undersells that they spawn-and-exit.

  2. MmapOverlay.mask() swallows OOM silently — if the dupe/put fails, the base entry stays unmasked and containsFile keeps answering true for stale content for the duration of memory pressure (same degraded behavior as pre-fix, so not a regression). Worth a follow-up: log the swallow or propagate so indexFile can retry.

Cleared after explicit verification (so nobody re-checks): #594 prior_content UAF ordering and owned_outline error-path deinit; #587 skip_trigram_files key lifetime (remove-before-free, value-equality lookup); #586 symbol_index key dup/free symmetry; #583 skip_file_words separate allocations (no double-free); #584 ContentCache sentinel removal correctness for the 8-probe window; mergeOverlayCandidates frees on all paths; isSensitivePath updated in both copies (snapshot+watcher; id_rsa_sk omission is correct — no such OpenSSH name); parseContentForIndexing capacity=1 harmless; word_index_generation bump on failed trigram = extra-but-valid disk write.

🤖 Generated with Claude Code

@justrach justrach merged commit 287821f into release/0.2.5825 Jun 10, 2026
1 check passed
@justrach justrach deleted the fix/word-index-stale-postings branch June 10, 2026 10:25
@github-actions

Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 114250 114426 +0.15% +176 OK
codedb_changes 14312 11638 -18.68% -2674 OK
codedb_context 1206854 1214754 +0.65% +7900 OK
codedb_deps 364 349 -4.12% -15 OK
codedb_edit 37726 39525 +4.77% +1799 OK
codedb_find 11440 10176 -11.05% -1264 OK
codedb_hot 29231 30631 +4.79% +1400 OK
codedb_outline 37387 39587 +5.88% +2200 OK
codedb_read 20367 19549 -4.02% -818 OK
codedb_search 32369 31236 -3.50% -1133 OK
codedb_snapshot 74251 72558 -2.28% -1693 OK
codedb_status 9776 9793 +0.17% +17 OK
codedb_symbol 51673 51640 -0.06% -33 OK
codedb_tree 54501 52253 -4.12% -2248 OK
codedb_word 13974 12142 -13.11% -1832 OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant