Search index: replace ComicFTS bulk_update with delete+bulk_create by ajslater · Pull Request #683 · ajslater/codex

ajslater · 2026-05-02T00:01:06Z

Summary

Swap ComicFTS.objects.bulk_update(...) for delete + bulk_create (wrapped in transaction.atomic) at the two production update sites. FTS5 makes bulk_update expensive in two stacked ways:

SQLite's CASE WHEN id THEN val … parser cost grows in (rows × cols).
Every UPDATE on an FTS5 virtual table is internally a delete-then-reinsert at the segment level anyway, so the CASE WHEN overhead is purely waste.

A multi-row INSERT is parsed once and writes straight to the segment.

Benchmarks

tests/perf/bench_fts_sync.py — synthetic seeded snapshot, 5 runs per combo, median ms.

Synthetic (1k / 10k tables)

Table	Affected	Fields	A `bulk_update`	B `delete+create`	B vs A
1,000	1,000	1	199ms	105ms	1.9×
1,000	1,000	3	361ms	104ms	3.5×
10,000	1,000	1	207ms	135ms	1.5×
10,000	1,000	3	371ms	125ms	3.0×
10,000	10,000	1	1,929ms	1,038ms	1.9×
10,000	10,000	3	3,290ms	991ms	3.3×

Real DB (Milliways, 10,036 ComicFTS rows)

Affected	Fields	A `bulk_update`	B `delete+create`	B vs A
100	1	33ms	29ms	1.1×
100	3	46ms	29ms	1.6×
1,000	1	219ms	193ms	1.1×
1,000	3	381ms	184ms	2.1×
5,000	1	1,013ms	854ms	1.2×
5,000	3	1,726ms	863ms	2.0×

Pattern holds across both: B wins every combo, gap widens with field count. The real-DB win is more modest at 1 field (~10–20%) but doubles at 3 fields, and the m2m sync case routinely touches multiple fields per affected comic.

Changes

SearchIndexSyncManyToManyImporter.sync_fts_for_m2m_updates — restructured so each affected comic is fetched exactly once even when several m2m fields changed for it. The previous loop appended one ComicFTS instance per (comic, field) pair, which bulk_update then resolved unpredictably when duplicate pks landed in the same call. New _gather_m2m_field_values collapses to {pk: {field: value}}, then _build_replacement_objs does one filter(pk__in=...) and applies all field updates in memory before the swap.
SearchIndexCreateUpdateImporter._update_search_index_create_or_update — update branch now calls the inherited _delete_then_create_comicfts instead of bulk_update.
_delete_then_create_comicfts (new, on the m2m sync importer) — does the chunked DELETE + bulk_create inside transaction.atomic so an interrupted run leaves the original row in place rather than a comic with no FTS row at all. _FTS_BATCH_SIZE = 500 keeps the pk__in clause under SQLite's 32k host parameter limit.

Test plan

make test-python T=tests/importer/ — green.
ruff check, basedpyright — clean.
Synthetic bench across 1k / 10k / 50k tables, 100–10k affected, 1 / 3 fields, 5 runs each. Strategy B wins every combo.
Real-DB validation against ~/Milliways/Comics/full (10,036 ComicFTS rows, deduped via the janitor from Importer search: handle duplicate FTS5 rows / missing FTS_UPDATE entries gracefully #681). Same pattern.

🤖 Generated with Claude Code

FTS5 makes ``bulk_update`` expensive in two stacked ways: SQLite's CASE-WHEN parser cost grows in (rows x cols), and every UPDATE on an FTS5 virtual table is internally a delete+reinsert at the segment level anyway. A multi-row INSERT is parsed once and writes straight to the segment, so swapping ``bulk_update`` for ``delete + bulk_create`` is a clear win. Synthetic benchmark ([tests/perf/bench_fts_sync.py](tests/perf/bench_fts_sync.py)) on 1k / 10k / 50k row tables, 100-10k affected rows, 1 / 3 fields: table affected fields bulk_update delete+create 10,000 1,000 1 207ms 135ms (1.5x) 10,000 1,000 3 371ms 125ms (3.0x) 10,000 10,000 1 1,929ms 1,038ms (1.9x) 10,000 10,000 3 3,290ms 991ms (3.3x) Validated on a real codex DB (10,036 ComicFTS rows from ~/Milliways/Comics/full): same shape, B wins every combo, gap widens with field count (1.1-2.1x). Two call sites converted: * ``SearchIndexSyncManyToManyImporter.sync_fts_for_m2m_updates`` - the cron-path m2m sync. Restructured so each affected comic is fetched once even when several m2m fields changed for it (the previous loop appended one ``ComicFTS`` instance per (comic, field) pair, which ``bulk_update`` then resolved unpredictably when duplicate pks landed in the same call). * ``SearchIndexCreateUpdateImporter._update_search_index_create_or_update`` - the importer's per-chunk update branch. Inherits ``_delete_then_create_comicfts`` for the atomic swap. Both wrap the swap in ``transaction.atomic`` so an interrupted run leaves the original row in place rather than a comic with no FTS row. Chunked DELETE keeps the IN-clause under SQLite's 32k host parameter limit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix KeyError surfacing as ``KeyError: 'fts_updated_m2ms'`` during ``full_text_search`` when an import only created (or only updated) comics without producing any m2m link deletions/insertions in the link phase. ``FTS_UPDATED_M2MS`` is only populated when m2m rows actually change, but ``sync_fts_for_m2m_updates`` runs unconditionally as part of the post-phases and indexed the dict directly. Use ``.get(FTS_UPDATED_M2MS, {})`` so the early-return path fires cleanly when nothing's there. Regression from #683 — the original ``.get(...)`` guard didn't survive the refactor that hoisted the field-name iteration into ``_gather_m2m_field_values``. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The JSD nightly janitor was added in #681 to clean up duplicate ``codex_comicfts`` rows produced by the (now-fixed) sync.py iteration bug. With the source bugs fixed (#681 watermark walk; #683/#684 delete+create swap), no new duplicates can land, so a permanent recurring task isn't earning its keep — it would just mask any future regression instead of surfacing it. Move the cleanup to the migration boundary instead: * ``codex/migrations/0039_…`` gains a ``RunSQL`` step that runs the same ``DELETE … WHERE rowid NOT IN (SELECT MIN(rowid) … GROUP BY comic_id)`` ahead of the existing FTS DROP+CREATE. For v1.10 -> v1.11 fresh upgrades the DROP makes the dedupe a no-op; the step keeps the migration idempotent if a future change preserves data instead of dropping. * Drop the ``JSD`` entry from ``_LIBRARIAN_STATUS_CHOICES`` since the task it referenced no longer exists. * Remove ``JanitorFTSDedupeTask``, ``JanitorDBFTSDedupeStatus``, the ``fts_dedupe`` function and method, and all wiring in ``janitor.py`` (``_NIGHTLY_TASK_CLASSES``, ``_JANITOR_METHOD_MAP``, ``_JANITOR_STATII``). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ajslater merged commit 0a6d0b2 into v1.11-performance May 2, 2026
1 check failed

ajslater mentioned this pull request May 2, 2026

Importer FTS m2m sync: guard FTS_UPDATED_M2MS against missing key #684

Merged

3 tasks

ajslater mentioned this pull request May 2, 2026

Janitor: drop FTS dedup task; fold dedupe into 0039 migration #685

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search index: replace ComicFTS bulk_update with delete+bulk_create#683

Search index: replace ComicFTS bulk_update with delete+bulk_create#683
ajslater merged 1 commit intov1.11-performancefrom
claude/fts-sync-delete-create

ajslater commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajslater commented May 2, 2026

Summary

Benchmarks

Synthetic (1k / 10k tables)

Real DB (Milliways, 10,036 ComicFTS rows)

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant