Skip to content

fix: Add paginated merge and load-vocab-source command#13

Merged
gkennos merged 11 commits into
mainfrom
12-reduce-mem
Jun 15, 2026
Merged

fix: Add paginated merge and load-vocab-source command#13
gkennos merged 11 commits into
mainfrom
12-reduce-mem

Conversation

@nicoloesch

Copy link
Copy Markdown
Collaborator

Summary

Fixes #12

  • Adds paginated merge operations to orm-loader so large staging-to-target merges commit in bounded batches rather than one transaction.
  • Adds a new load-vocab-source command to omop-alchemy with bulk mode, progress feedback, and crash-resilient retry.

orm-loader: paginated merge via _rownum

Staging tables now get a _rownum BIGINT GENERATED ALWAYS AS IDENTITY column at creation time. merge_insert, merge_replace, and merge_upsert all accept a merge_batch_size parameter (default 1 M rows). For tables larger than one batch, a _rownum index is built on the staging table and rows are processed in range-keyed batches, each committed independently. This bounds WAL accumulation to one batch per transaction instead of the full table. Small tables (below merge_batch_size) fall through to the original single-statement path.

The COPY statement was updated to include an explicit column list so the identity column is excluded from input.

omop-alchemy: load-vocab-source command

New cli_vocab.py implementing a load-vocab-source command with:

  • --bulk-mode: disables FK triggers and drops indexes before loading, then rebuilds after. Substantially faster than per-table management for a full vocabulary reload.
  • --merge-strategy: replace, upsert, or insert_if_empty.
  • --merge-batch-size: passed through to the orm-loader paginated merge.
  • Progress bar with per-phase descriptions including the post-load index rebuild phase, which can take 15+ minutes on concept_ancestor.
  • Crash-resilient retry: if a retryable connection error occurs mid-merge and the strategy is insert_if_empty, the partially loaded table is truncated before retrying. Safe because FK triggers are disabled via ALTER TABLE ... DISABLE TRIGGER ALL and that state persists across crash and recovery.

@nicoloesch nicoloesch requested a review from gkennos June 3, 2026 04:38

@gkennos gkennos left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth populating changelog.md file for this PR

At least the 'create index' double-up needs changing before release if nothing else

Comment thread src/orm_loader/tables/loadable_table.py
Comment thread src/orm_loader/loaders/loading_helpers.py
Comment thread tests/backends/test_postgres_backend.py
@nicoloesch

Copy link
Copy Markdown
Collaborator Author

@gkennos Addressed all PR comments. Also adapted the changelog.md to include the recent changes and bumped the version to 0.5.0 given that it is probably a new feature and less a fix.

@gkennos gkennos self-requested a review June 15, 2026 22:28
@gkennos gkennos merged commit ab7c4da into main Jun 15, 2026
3 checks passed
@gkennos gkennos deleted the 12-reduce-mem branch June 15, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vocabulary load is slow and unstable on large tables

2 participants