Skip to content

Search index: replace ComicFTS bulk_update with delete+bulk_create#683

Merged
ajslater merged 1 commit intov1.11-performancefrom
claude/fts-sync-delete-create
May 2, 2026
Merged

Search index: replace ComicFTS bulk_update with delete+bulk_create#683
ajslater merged 1 commit intov1.11-performancefrom
claude/fts-sync-delete-create

Conversation

@ajslater
Copy link
Copy Markdown
Owner

@ajslater ajslater commented May 2, 2026

Summary

Swap ComicFTS.objects.bulk_update(...) for delete + bulk_create (wrapped in transaction.atomic) at the two production update sites. FTS5 makes bulk_update expensive in two stacked ways:

  1. SQLite's CASE WHEN id THEN val … parser cost grows in (rows × cols).
  2. Every UPDATE on an FTS5 virtual table is internally a delete-then-reinsert at the segment level anyway, so the CASE WHEN overhead is purely waste.

A multi-row INSERT is parsed once and writes straight to the segment.

Benchmarks

tests/perf/bench_fts_sync.py — synthetic seeded snapshot, 5 runs per combo, median ms.

Synthetic (1k / 10k tables)

Table Affected Fields A bulk_update B delete+create B vs A
1,000 1,000 1 199ms 105ms 1.9×
1,000 1,000 3 361ms 104ms 3.5×
10,000 1,000 1 207ms 135ms 1.5×
10,000 1,000 3 371ms 125ms 3.0×
10,000 10,000 1 1,929ms 1,038ms 1.9×
10,000 10,000 3 3,290ms 991ms 3.3×

Real DB (Milliways, 10,036 ComicFTS rows)

Affected Fields A bulk_update B delete+create B vs A
100 1 33ms 29ms 1.1×
100 3 46ms 29ms 1.6×
1,000 1 219ms 193ms 1.1×
1,000 3 381ms 184ms 2.1×
5,000 1 1,013ms 854ms 1.2×
5,000 3 1,726ms 863ms 2.0×

Pattern holds across both: B wins every combo, gap widens with field count. The real-DB win is more modest at 1 field (~10–20%) but doubles at 3 fields, and the m2m sync case routinely touches multiple fields per affected comic.

Changes

  • SearchIndexSyncManyToManyImporter.sync_fts_for_m2m_updates — restructured so each affected comic is fetched exactly once even when several m2m fields changed for it. The previous loop appended one ComicFTS instance per (comic, field) pair, which bulk_update then resolved unpredictably when duplicate pks landed in the same call. New _gather_m2m_field_values collapses to {pk: {field: value}}, then _build_replacement_objs does one filter(pk__in=...) and applies all field updates in memory before the swap.
  • SearchIndexCreateUpdateImporter._update_search_index_create_or_update — update branch now calls the inherited _delete_then_create_comicfts instead of bulk_update.
  • _delete_then_create_comicfts (new, on the m2m sync importer) — does the chunked DELETE + bulk_create inside transaction.atomic so an interrupted run leaves the original row in place rather than a comic with no FTS row at all. _FTS_BATCH_SIZE = 500 keeps the pk__in clause under SQLite's 32k host parameter limit.

Test plan

🤖 Generated with Claude Code

FTS5 makes ``bulk_update`` expensive in two stacked ways: SQLite's
CASE-WHEN parser cost grows in (rows x cols), and every UPDATE on
an FTS5 virtual table is internally a delete+reinsert at the
segment level anyway. A multi-row INSERT is parsed once and writes
straight to the segment, so swapping ``bulk_update`` for
``delete + bulk_create`` is a clear win.

Synthetic benchmark
([tests/perf/bench_fts_sync.py](tests/perf/bench_fts_sync.py)) on
1k / 10k / 50k row tables, 100-10k affected rows, 1 / 3 fields:

  table    affected fields   bulk_update   delete+create
  10,000      1,000      1         207ms         135ms (1.5x)
  10,000      1,000      3         371ms         125ms (3.0x)
  10,000     10,000      1       1,929ms       1,038ms (1.9x)
  10,000     10,000      3       3,290ms         991ms (3.3x)

Validated on a real codex DB (10,036 ComicFTS rows from
~/Milliways/Comics/full): same shape, B wins every combo, gap
widens with field count (1.1-2.1x).

Two call sites converted:

* ``SearchIndexSyncManyToManyImporter.sync_fts_for_m2m_updates`` -
  the cron-path m2m sync. Restructured so each affected comic is
  fetched once even when several m2m fields changed for it (the
  previous loop appended one ``ComicFTS`` instance per
  (comic, field) pair, which ``bulk_update`` then resolved
  unpredictably when duplicate pks landed in the same call).

* ``SearchIndexCreateUpdateImporter._update_search_index_create_or_update``
  - the importer's per-chunk update branch. Inherits
  ``_delete_then_create_comicfts`` for the atomic swap.

Both wrap the swap in ``transaction.atomic`` so an interrupted run
leaves the original row in place rather than a comic with no FTS
row. Chunked DELETE keeps the IN-clause under SQLite's 32k host
parameter limit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajslater ajslater merged commit 0a6d0b2 into v1.11-performance May 2, 2026
1 check failed
ajslater added a commit that referenced this pull request May 2, 2026
Fix KeyError surfacing as ``KeyError: 'fts_updated_m2ms'`` during
``full_text_search`` when an import only created (or only updated)
comics without producing any m2m link deletions/insertions in the
link phase. ``FTS_UPDATED_M2MS`` is only populated when m2m rows
actually change, but ``sync_fts_for_m2m_updates`` runs
unconditionally as part of the post-phases and indexed the dict
directly. Use ``.get(FTS_UPDATED_M2MS, {})`` so the early-return
path fires cleanly when nothing's there.

Regression from #683 — the original ``.get(...)`` guard didn't
survive the refactor that hoisted the field-name iteration into
``_gather_m2m_field_values``.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ajslater added a commit that referenced this pull request May 2, 2026
The JSD nightly janitor was added in #681 to clean up duplicate
``codex_comicfts`` rows produced by the (now-fixed) sync.py
iteration bug. With the source bugs fixed (#681 watermark walk;
#683/#684 delete+create swap), no new duplicates can land, so a
permanent recurring task isn't earning its keep — it would just
mask any future regression instead of surfacing it.

Move the cleanup to the migration boundary instead:

* ``codex/migrations/0039_…`` gains a ``RunSQL`` step that runs the
  same ``DELETE … WHERE rowid NOT IN (SELECT MIN(rowid) … GROUP BY
  comic_id)`` ahead of the existing FTS DROP+CREATE. For v1.10 ->
  v1.11 fresh upgrades the DROP makes the dedupe a no-op; the
  step keeps the migration idempotent if a future change preserves
  data instead of dropping.
* Drop the ``JSD`` entry from ``_LIBRARIAN_STATUS_CHOICES`` since
  the task it referenced no longer exists.
* Remove ``JanitorFTSDedupeTask``, ``JanitorDBFTSDedupeStatus``,
  the ``fts_dedupe`` function and method, and all wiring in
  ``janitor.py`` (``_NIGHTLY_TASK_CLASSES``, ``_JANITOR_METHOD_MAP``,
  ``_JANITOR_STATII``).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant