Search index: replace ComicFTS bulk_update with delete+bulk_create#683
Merged
ajslater merged 1 commit intov1.11-performancefrom May 2, 2026
Merged
Search index: replace ComicFTS bulk_update with delete+bulk_create#683ajslater merged 1 commit intov1.11-performancefrom
ajslater merged 1 commit intov1.11-performancefrom
Conversation
FTS5 makes ``bulk_update`` expensive in two stacked ways: SQLite's CASE-WHEN parser cost grows in (rows x cols), and every UPDATE on an FTS5 virtual table is internally a delete+reinsert at the segment level anyway. A multi-row INSERT is parsed once and writes straight to the segment, so swapping ``bulk_update`` for ``delete + bulk_create`` is a clear win. Synthetic benchmark ([tests/perf/bench_fts_sync.py](tests/perf/bench_fts_sync.py)) on 1k / 10k / 50k row tables, 100-10k affected rows, 1 / 3 fields: table affected fields bulk_update delete+create 10,000 1,000 1 207ms 135ms (1.5x) 10,000 1,000 3 371ms 125ms (3.0x) 10,000 10,000 1 1,929ms 1,038ms (1.9x) 10,000 10,000 3 3,290ms 991ms (3.3x) Validated on a real codex DB (10,036 ComicFTS rows from ~/Milliways/Comics/full): same shape, B wins every combo, gap widens with field count (1.1-2.1x). Two call sites converted: * ``SearchIndexSyncManyToManyImporter.sync_fts_for_m2m_updates`` - the cron-path m2m sync. Restructured so each affected comic is fetched once even when several m2m fields changed for it (the previous loop appended one ``ComicFTS`` instance per (comic, field) pair, which ``bulk_update`` then resolved unpredictably when duplicate pks landed in the same call). * ``SearchIndexCreateUpdateImporter._update_search_index_create_or_update`` - the importer's per-chunk update branch. Inherits ``_delete_then_create_comicfts`` for the atomic swap. Both wrap the swap in ``transaction.atomic`` so an interrupted run leaves the original row in place rather than a comic with no FTS row. Chunked DELETE keeps the IN-clause under SQLite's 32k host parameter limit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
ajslater
added a commit
that referenced
this pull request
May 2, 2026
Fix KeyError surfacing as ``KeyError: 'fts_updated_m2ms'`` during
``full_text_search`` when an import only created (or only updated)
comics without producing any m2m link deletions/insertions in the
link phase. ``FTS_UPDATED_M2MS`` is only populated when m2m rows
actually change, but ``sync_fts_for_m2m_updates`` runs
unconditionally as part of the post-phases and indexed the dict
directly. Use ``.get(FTS_UPDATED_M2MS, {})`` so the early-return
path fires cleanly when nothing's there.
Regression from #683 — the original ``.get(...)`` guard didn't
survive the refactor that hoisted the field-name iteration into
``_gather_m2m_field_values``.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
ajslater
added a commit
that referenced
this pull request
May 2, 2026
The JSD nightly janitor was added in #681 to clean up duplicate ``codex_comicfts`` rows produced by the (now-fixed) sync.py iteration bug. With the source bugs fixed (#681 watermark walk; #683/#684 delete+create swap), no new duplicates can land, so a permanent recurring task isn't earning its keep — it would just mask any future regression instead of surfacing it. Move the cleanup to the migration boundary instead: * ``codex/migrations/0039_…`` gains a ``RunSQL`` step that runs the same ``DELETE … WHERE rowid NOT IN (SELECT MIN(rowid) … GROUP BY comic_id)`` ahead of the existing FTS DROP+CREATE. For v1.10 -> v1.11 fresh upgrades the DROP makes the dedupe a no-op; the step keeps the migration idempotent if a future change preserves data instead of dropping. * Drop the ``JSD`` entry from ``_LIBRARIAN_STATUS_CHOICES`` since the task it referenced no longer exists. * Remove ``JanitorFTSDedupeTask``, ``JanitorDBFTSDedupeStatus``, the ``fts_dedupe`` function and method, and all wiring in ``janitor.py`` (``_NIGHTLY_TASK_CLASSES``, ``_JANITOR_METHOD_MAP``, ``_JANITOR_STATII``). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Swap
ComicFTS.objects.bulk_update(...)fordelete + bulk_create(wrapped intransaction.atomic) at the two production update sites. FTS5 makesbulk_updateexpensive in two stacked ways:CASE WHEN id THEN val …parser cost grows in (rows × cols).A multi-row
INSERTis parsed once and writes straight to the segment.Benchmarks
tests/perf/bench_fts_sync.py — synthetic seeded snapshot, 5 runs per combo, median ms.
Synthetic (1k / 10k tables)
bulk_updatedelete+createReal DB (Milliways, 10,036 ComicFTS rows)
bulk_updatedelete+createPattern holds across both: B wins every combo, gap widens with field count. The real-DB win is more modest at 1 field (~10–20%) but doubles at 3 fields, and the m2m sync case routinely touches multiple fields per affected comic.
Changes
SearchIndexSyncManyToManyImporter.sync_fts_for_m2m_updates— restructured so each affected comic is fetched exactly once even when several m2m fields changed for it. The previous loop appended oneComicFTSinstance per (comic, field) pair, whichbulk_updatethen resolved unpredictably when duplicate pks landed in the same call. New_gather_m2m_field_valuescollapses to{pk: {field: value}}, then_build_replacement_objsdoes onefilter(pk__in=...)and applies all field updates in memory before the swap.SearchIndexCreateUpdateImporter._update_search_index_create_or_update— update branch now calls the inherited_delete_then_create_comicftsinstead ofbulk_update._delete_then_create_comicfts(new, on the m2m sync importer) — does the chunked DELETE +bulk_createinsidetransaction.atomicso an interrupted run leaves the original row in place rather than a comic with no FTS row at all._FTS_BATCH_SIZE = 500keeps thepk__inclause under SQLite's 32k host parameter limit.Test plan
make test-python T=tests/importer/— green.ruff check,basedpyright— clean.~/Milliways/Comics/full(10,036 ComicFTS rows, deduped via the janitor from Importer search: handle duplicate FTS5 rows / missing FTS_UPDATE entries gracefully #681). Same pattern.🤖 Generated with Claude Code