Undo by gburd · Pull Request #21 · gburd/postgres

gburd · 2026-03-26T19:14:14Z

No description provided.

Implement deferred file deletion (BDB: __fop_remove). Deletion is scheduled for transaction commit or abort, not executed immediately. API: FileOpsDelete(path, at_commit) -> void WAL: XLOG_FILEOPS_DELETE (intentional no-op during redo; deletion driven by XACT commit/abort records). On Windows: uses pgunlink() with retry on EACCES.

Implement deferred file rename (BDB: __fop_rename). The rename is scheduled for commit time using durable_rename() which handles fsync ordering on Unix and MoveFileEx with retry on Windows. API: FileOpsRename(oldpath, newpath) -> int WAL: XLOG_FILEOPS_RENAME (intentional no-op during redo).

Implement WAL-logged file write at offset (BDB: __fop_write). Data is written immediately using pwrite() and fsynced for durability. API: FileOpsWrite(path, offset, data, len) -> int WAL: XLOG_FILEOPS_WRITE with redo that replays the write. On Windows: uses SetFilePointerEx + WriteFile via pg_pwrite.

Implement WAL-logged file truncation. Executed immediately with XLogFlush before the irreversible operation (following SMGR_TRUNCATE pattern). Uses ftruncate() on POSIX, SetEndOfFile() on Windows. API: FileOpsTruncate(path, length) -> void WAL: XLOG_FILEOPS_TRUNCATE with redo that replays the truncation.

Implement WAL-logged file metadata operations. CHMOD: chmod() on POSIX, _chmod() on Windows with limited mode bits (only _S_IREAD/_S_IWRITE; no group/other support). CHOWN: chown() on POSIX, no-op with WARNING on Windows (Windows uses ACLs for ownership, not uid/gid). Both execute immediately and are WAL-logged for crash recovery.

MKDIR: Immediate execution using MakePGDirectory(). Registers rmdir-on-abort for automatic cleanup on rollback. On Windows: _mkdir() (no mode parameter, permissions inherited from parent). RMDIR: Deferred to commit time (like DELETE). Uses rmdir() on all platforms, _rmdir() on Windows.

SYMLINK: Immediate execution. Uses symlink() on POSIX, pgsymlink() (NTFS junction points) on Windows. Registers delete-on-abort. LINK: Immediate execution. Uses link() on POSIX, CreateHardLinkA() on Windows (NTFS only). Registers delete-on-abort. Both create links idempotently during redo (unlink first if exists).

Add extended attribute operations to the transactional file operations framework, completing the Berkeley DB fileops.src operation set. FileOpsSetXattr() and FileOpsRemoveXattr() provide immediate execution with WAL logging for crash recovery replay. A new cross-platform portability layer (src/port/pg_xattr.c) abstracts platform differences: - Linux: <sys/xattr.h> setxattr/removexattr - macOS: <sys/xattr.h> with extra options parameter - FreeBSD: <sys/extattr.h> extattr_set_file/extattr_delete_file - Windows: NTFS Alternate Data Streams via CreateFileA("path:name") - Fallback: returns ENOTSUP (operation succeeds in WAL but no-op on unsupported platforms for WAL stream portability) Platform detection uses compiler-defined macros (__linux__, __APPLE__, __FreeBSD__, WIN32) rather than configure-time checks, avoiding meson.build/configure.ac complexity.

Add regression tests for all FILEOPS operations (CREATE, DELETE, RENAME, WRITE, TRUNCATE, CHMOD, CHOWN, MKDIR, RMDIR, SYMLINK, LINK, SETXATTR, REMOVEXATTR) and a crash recovery test for WAL replay. Update the transactional fileops example script with the expanded operation set following the Berkeley DB fileops.src model.

Introduce the IndexPrune framework that allows index access methods to register callbacks for proactively pruning dead index entries when UNDO records are discarded. This avoids accumulating dead tuples that would otherwise require VACUUM to clean up. Key components: - index_prune.h: IndexPruneCallbacks structure and registration API - index_prune.c: Registry management and IndexPruneNotifyDiscard() dispatcher - relundo_discard.c: Hook to call IndexPruneNotifyDiscard on UNDO discard Individual index AM implementations follow in subsequent commits.

Placeholder for index pruning design documentation. To be populated when design notes are split by subsystem.

Register IndexPrune callbacks in the B-tree access method handler. nbtprune.c implements dead-entry detection and removal using UNDO discard notifications, allowing proactive cleanup without full VACUUM.

Register IndexPrune callbacks in the hash access method handler. hashprune.c implements dead-entry detection and removal using UNDO discard notifications for hash indexes.

Register IndexPrune callbacks in the GIN access method handler. ginprune.c implements dead-entry detection and removal using UNDO discard notifications for GIN indexes.

Register IndexPrune callbacks in the GiST access method handler. gistprune.c implements dead-entry detection and removal using UNDO discard notifications for GiST indexes.

Register IndexPrune callbacks in the SP-GiST access method handler. spgprune.c implements dead-entry detection and removal using UNDO discard notifications for SP-GiST indexes.

Add VACUUM statistics tracking for UNDO-pruned index entries and verbose output. Include comprehensive test suite exercising index pruning across all supported index access methods via test_undo_tam.

Adds opt-in UNDO support to the standard heap table access method. When enabled, heap operations write UNDO records to enable physical rollback without scanning the heap, and support UNDO-based MVCC visibility determination. How heap uses UNDO: INSERT operations: - Before inserting tuple, call PrepareXactUndoData() to reserve UNDO space - Write UNDO record with: transaction ID, tuple TID, old tuple data (null for INSERT) - On abort: UndoReplay() marks tuple as LP_UNUSED without heap scan UPDATE operations: - Write UNDO record with complete old tuple version before update - On abort: UndoReplay() restores old tuple version from UNDO DELETE operations: - Write UNDO record with complete deleted tuple data - On abort: UndoReplay() resurrects tuple from UNDO record MVCC visibility: - Tuples reference UNDO chain via xmin/xmax - HeapTupleSatisfiesSnapshot() can walk UNDO chain for older versions - Enables reconstructing tuple state as of any snapshot Configuration: CREATE TABLE t (...) WITH (enable_undo=on); The enable_undo storage parameter is per-table and defaults to off for backward compatibility. When disabled, heap behaves exactly as before. Value proposition: 1. Faster rollback: No heap scan required, UNDO chains are sequential - Traditional abort: Full heap scan to mark tuples invalid (O(n) random I/O) - UNDO abort: Sequential UNDO log scan (O(n) sequential I/O, better cache locality) 2. Cleaner abort handling: UNDO records are self-contained - No need to track which heap pages were modified - Works across crashes (UNDO is WAL-logged) 3. Foundation for future features: - Multi-version concurrency control without bloat - Faster VACUUM (can discard entire UNDO segments) - Point-in-time recovery improvements Trade-offs: Costs: - Additional writes: Every DML writes both heap + UNDO (roughly 2x write amplification) - UNDO log space: Requires space for UNDO records until no longer visible - Complexity: New GUCs (undo_retention, max_undo_workers), monitoring needed Benefits: - Primarily valuable for workloads with: - Frequent aborts (e.g., speculative execution, deadlocks) - Long-running transactions needing old snapshots - Hot UPDATE workloads benefiting from cleaner rollback Not recommended for: - Bulk load workloads (COPY: 2x write amplification without abort benefit) - Append-only tables (rare aborts mean cost without benefit) - Space-constrained systems (UNDO retention increases storage) When beneficial: - OLTP with high abort rates (>5%) - Systems with aggressive pruning needs (frequent VACUUM) - Workloads requiring historical visibility (audit, time-travel queries) Integration points: - heap_insert/update/delete call PrepareXactUndoData/InsertXactUndoData - Heap pruning respects undo_retention to avoid discarding needed UNDO - pg_upgrade compatibility: UNDO disabled for upgraded tables Background workers: - Cluster-wide UNDO has async workers for cleanup/discard of old UNDO records - Rollback itself is synchronous (via UndoReplay() during transaction abort) - Workers periodically trim UNDO logs based on undo_retention and snapshot visibility This demonstrates cluster-wide UNDO in production use. Note that this differs from per-relation logical UNDO (added in subsequent patches), which uses per-table UNDO forks and async rollback via background workers.

Implement UNDO resource manager for B-tree indexes and regression test. When a transaction aborts, provisionally inserted index entries are marked LP_DEAD. Includes zero_vacuum test verifying aborted inserts leave no dead tuples and index consistency via bt_index_check().

Document the cluster-wide UNDO architecture including UNDO log design, record format, transaction integration, and heap AM integration details.

Add diagnostic timing to CreateCheckPoint() that breaks down the previously unmeasured pre-write and post-sync phases (SyncPre, DelayStart, DelayComplete, XLogFlush, ControlFile, SyncPost, RemoveWAL, TruncSub). When log_checkpoints is on, a new LOG line is emitted before the existing checkpoint-complete message, making it straightforward to diagnose slow shutdown checkpoints. Add CheckPointUndoLog() to persist UNDO log statistics at checkpoint time. Called from CheckPointGuts() before the buffer write phase, it scans active UNDO logs under shared locks and logs allocated/discarded/ retained byte counts when log_checkpoints is enabled. Increase pg_ctl stop timeout for isolation tests to 180 seconds via PGCTLTIMEOUT environment variable, preventing false test failures when shutdown checkpoints take longer than the default 60-second timeout. Also add env dict support for regress/isolation tests in the root meson.build, matching the existing pattern for TAP tests.

Add a self-contained benchmark suite comparing three scenarios: baseline (master), undo-compiled-but-off, and undo-enabled. Covers insert/update/delete throughput, rollback cost, VACUUM overhead, read stability under writes, storage footprint, and pgbench TPS with a mixed OLTP workload including 10% rollbacks.

When running from within the undo branch checkout, git worktree add fails because the branch is already in use. Fall back to detached HEAD worktree, then symlink as last resort. Also fix cleanup to skip symlinked source directories.

ENODATA is Linux-specific. FreeBSD uses ENOATTR for "attribute not found" from extattr operations. Define PG_ENOATTR in pg_xattr.h that maps to the correct platform errno, and use it in fileops.c.

psql does not expand :variables inside $$ string constants, so the DO block loop limit was not being set. Use a temp function with an integer parameter instead.

SQL rewrites (b2, b3, b4, b6, b7): replace repeated full-table UPDATE/DELETE with single-row PK lookups, small batch operations, cross-table updates, and 1-5% targeted mutations. This produces realistic OLTP measurements and avoids multi-hour runtimes on small systems. Portability: add get_nproc() and get_dir_bytes() helpers for Illumos (psrinfo, du -sk) and expand record_sysinfo() to handle Illumos prtconf/psrinfo and non-GNU coreutils.

The Solaris/Illumos linker fails with undefined ldap_start_tls_s when building postgres. Disable LDAP for benchmarks on SunOS.

PG19 changed "excluding connections establishing" to "without initial connection time". Match both formats.

Illumos date(1) does not support -Iseconds. Use portable strftime format '+%Y-%m-%dT%H:%M:%S%z' as fallback.

Illumos sort(1) does not support -g (general numeric sort). Use -n which handles our decimal values correctly.

github-actions Bot force-pushed the master branch 30 times, most recently from 9355586 to 9cbf7e6 Compare March 30, 2026 18:18

gburd added 30 commits April 24, 2026 15:59

[NOT FOR MERGE] Design notes for index pruning

4c9228d

Placeholder for index pruning design documentation. To be populated when design notes are split by subsystem.

Add UNDO-informed index pruning for NBTREE

6113238

Register IndexPrune callbacks in the B-tree access method handler. nbtprune.c implements dead-entry detection and removal using UNDO discard notifications, allowing proactive cleanup without full VACUUM.

Add UNDO-informed index pruning for HASH

4c8af2a

Register IndexPrune callbacks in the hash access method handler. hashprune.c implements dead-entry detection and removal using UNDO discard notifications for hash indexes.

Add UNDO-informed index pruning for GIN

21e0692

Register IndexPrune callbacks in the GIN access method handler. ginprune.c implements dead-entry detection and removal using UNDO discard notifications for GIN indexes.

Add UNDO-informed index pruning for GiST

fff86ac

Register IndexPrune callbacks in the GiST access method handler. gistprune.c implements dead-entry detection and removal using UNDO discard notifications for GiST indexes.

Add UNDO-informed index pruning for SP-GiST

eeba3ee

Register IndexPrune callbacks in the SP-GiST access method handler. spgprune.c implements dead-entry detection and removal using UNDO discard notifications for SP-GiST indexes.

Add UNDO-informed index pruning VACUUM integration

65a951d

Add VACUUM statistics tracking for UNDO-pruned index entries and verbose output. Include comprehensive test suite exercising index pruning across all supported index access methods via test_undo_tam.

[NOT FOR MERGE] Design notes for cluster-wide UNDO with Heap table AM

d87a6f4

Document the cluster-wide UNDO architecture including UNDO log design, record format, transaction integration, and heap AM integration details.

Fix FreeBSD build: replace ENODATA with portable PG_ENOATTR

95e086f

ENODATA is Linux-specific. FreeBSD uses ENOATTR for "attribute not found" from extattr operations. Define PG_ENOATTR in pg_xattr.h that maps to the correct platform errno, and use it in fileops.c.

Fix B1 individual insert: use temp function instead of DO block

c9b55d6

psql does not expand :variables inside $$ string constants, so the DO block loop limit was not being set. Use a temp function with an integer parameter instead.

Fix Illumos build: disable LDAP in meson config

6ca81cb

The Solaris/Illumos linker fails with undefined ldap_start_tls_s when building postgres. Disable LDAP for benchmarks on SunOS.

Fix pgbench TPS extraction for PG19 output format

b49a6cb

PG19 changed "excluding connections establishing" to "without initial connection time". Match both formats.

Fix Illumos date: add fallback for date -Iseconds in csv_write

0560fc3

Illumos date(1) does not support -Iseconds. Use portable strftime format '+%Y-%m-%dT%H:%M:%S%z' as fallback.

Fix Illumos sort: use -n instead of -g for numeric sort

b0ea24f

Illumos sort(1) does not support -g (general numeric sort). Use -n which handles our decimal values correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undo#21

Undo#21
gburd wants to merge 36 commits intomasterfrom
undo

gburd commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gburd commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant