Conversation
9355586 to
9cbf7e6
Compare
Implement deferred file deletion (BDB: __fop_remove). Deletion is scheduled for transaction commit or abort, not executed immediately. API: FileOpsDelete(path, at_commit) -> void WAL: XLOG_FILEOPS_DELETE (intentional no-op during redo; deletion driven by XACT commit/abort records). On Windows: uses pgunlink() with retry on EACCES.
Implement deferred file rename (BDB: __fop_rename). The rename is scheduled for commit time using durable_rename() which handles fsync ordering on Unix and MoveFileEx with retry on Windows. API: FileOpsRename(oldpath, newpath) -> int WAL: XLOG_FILEOPS_RENAME (intentional no-op during redo).
Implement WAL-logged file write at offset (BDB: __fop_write). Data is written immediately using pwrite() and fsynced for durability. API: FileOpsWrite(path, offset, data, len) -> int WAL: XLOG_FILEOPS_WRITE with redo that replays the write. On Windows: uses SetFilePointerEx + WriteFile via pg_pwrite.
Implement WAL-logged file truncation. Executed immediately with XLogFlush before the irreversible operation (following SMGR_TRUNCATE pattern). Uses ftruncate() on POSIX, SetEndOfFile() on Windows. API: FileOpsTruncate(path, length) -> void WAL: XLOG_FILEOPS_TRUNCATE with redo that replays the truncation.
Implement WAL-logged file metadata operations. CHMOD: chmod() on POSIX, _chmod() on Windows with limited mode bits (only _S_IREAD/_S_IWRITE; no group/other support). CHOWN: chown() on POSIX, no-op with WARNING on Windows (Windows uses ACLs for ownership, not uid/gid). Both execute immediately and are WAL-logged for crash recovery.
MKDIR: Immediate execution using MakePGDirectory(). Registers rmdir-on-abort for automatic cleanup on rollback. On Windows: _mkdir() (no mode parameter, permissions inherited from parent). RMDIR: Deferred to commit time (like DELETE). Uses rmdir() on all platforms, _rmdir() on Windows.
SYMLINK: Immediate execution. Uses symlink() on POSIX, pgsymlink() (NTFS junction points) on Windows. Registers delete-on-abort. LINK: Immediate execution. Uses link() on POSIX, CreateHardLinkA() on Windows (NTFS only). Registers delete-on-abort. Both create links idempotently during redo (unlink first if exists).
Add extended attribute operations to the transactional file operations
framework, completing the Berkeley DB fileops.src operation set.
FileOpsSetXattr() and FileOpsRemoveXattr() provide immediate execution
with WAL logging for crash recovery replay. A new cross-platform
portability layer (src/port/pg_xattr.c) abstracts platform differences:
- Linux: <sys/xattr.h> setxattr/removexattr
- macOS: <sys/xattr.h> with extra options parameter
- FreeBSD: <sys/extattr.h> extattr_set_file/extattr_delete_file
- Windows: NTFS Alternate Data Streams via CreateFileA("path:name")
- Fallback: returns ENOTSUP (operation succeeds in WAL but no-op
on unsupported platforms for WAL stream portability)
Platform detection uses compiler-defined macros (__linux__, __APPLE__,
__FreeBSD__, WIN32) rather than configure-time checks, avoiding
meson.build/configure.ac complexity.
Add regression tests for all FILEOPS operations (CREATE, DELETE, RENAME, WRITE, TRUNCATE, CHMOD, CHOWN, MKDIR, RMDIR, SYMLINK, LINK, SETXATTR, REMOVEXATTR) and a crash recovery test for WAL replay. Update the transactional fileops example script with the expanded operation set following the Berkeley DB fileops.src model.
Introduce the IndexPrune framework that allows index access methods to register callbacks for proactively pruning dead index entries when UNDO records are discarded. This avoids accumulating dead tuples that would otherwise require VACUUM to clean up. Key components: - index_prune.h: IndexPruneCallbacks structure and registration API - index_prune.c: Registry management and IndexPruneNotifyDiscard() dispatcher - relundo_discard.c: Hook to call IndexPruneNotifyDiscard on UNDO discard Individual index AM implementations follow in subsequent commits.
Placeholder for index pruning design documentation. To be populated when design notes are split by subsystem.
Register IndexPrune callbacks in the B-tree access method handler. nbtprune.c implements dead-entry detection and removal using UNDO discard notifications, allowing proactive cleanup without full VACUUM.
Register IndexPrune callbacks in the hash access method handler. hashprune.c implements dead-entry detection and removal using UNDO discard notifications for hash indexes.
Register IndexPrune callbacks in the GIN access method handler. ginprune.c implements dead-entry detection and removal using UNDO discard notifications for GIN indexes.
Register IndexPrune callbacks in the GiST access method handler. gistprune.c implements dead-entry detection and removal using UNDO discard notifications for GiST indexes.
Register IndexPrune callbacks in the SP-GiST access method handler. spgprune.c implements dead-entry detection and removal using UNDO discard notifications for SP-GiST indexes.
Add VACUUM statistics tracking for UNDO-pruned index entries and verbose output. Include comprehensive test suite exercising index pruning across all supported index access methods via test_undo_tam.
Adds opt-in UNDO support to the standard heap table access method.
When enabled, heap operations write UNDO records to enable physical
rollback without scanning the heap, and support UNDO-based MVCC
visibility determination.
How heap uses UNDO:
INSERT operations:
- Before inserting tuple, call PrepareXactUndoData() to reserve UNDO space
- Write UNDO record with: transaction ID, tuple TID, old tuple data (null for INSERT)
- On abort: UndoReplay() marks tuple as LP_UNUSED without heap scan
UPDATE operations:
- Write UNDO record with complete old tuple version before update
- On abort: UndoReplay() restores old tuple version from UNDO
DELETE operations:
- Write UNDO record with complete deleted tuple data
- On abort: UndoReplay() resurrects tuple from UNDO record
MVCC visibility:
- Tuples reference UNDO chain via xmin/xmax
- HeapTupleSatisfiesSnapshot() can walk UNDO chain for older versions
- Enables reconstructing tuple state as of any snapshot
Configuration:
CREATE TABLE t (...) WITH (enable_undo=on);
The enable_undo storage parameter is per-table and defaults to off for
backward compatibility. When disabled, heap behaves exactly as before.
Value proposition:
1. Faster rollback: No heap scan required, UNDO chains are sequential
- Traditional abort: Full heap scan to mark tuples invalid (O(n) random I/O)
- UNDO abort: Sequential UNDO log scan (O(n) sequential I/O, better cache locality)
2. Cleaner abort handling: UNDO records are self-contained
- No need to track which heap pages were modified
- Works across crashes (UNDO is WAL-logged)
3. Foundation for future features:
- Multi-version concurrency control without bloat
- Faster VACUUM (can discard entire UNDO segments)
- Point-in-time recovery improvements
Trade-offs:
Costs:
- Additional writes: Every DML writes both heap + UNDO (roughly 2x write amplification)
- UNDO log space: Requires space for UNDO records until no longer visible
- Complexity: New GUCs (undo_retention, max_undo_workers), monitoring needed
Benefits:
- Primarily valuable for workloads with:
- Frequent aborts (e.g., speculative execution, deadlocks)
- Long-running transactions needing old snapshots
- Hot UPDATE workloads benefiting from cleaner rollback
Not recommended for:
- Bulk load workloads (COPY: 2x write amplification without abort benefit)
- Append-only tables (rare aborts mean cost without benefit)
- Space-constrained systems (UNDO retention increases storage)
When beneficial:
- OLTP with high abort rates (>5%)
- Systems with aggressive pruning needs (frequent VACUUM)
- Workloads requiring historical visibility (audit, time-travel queries)
Integration points:
- heap_insert/update/delete call PrepareXactUndoData/InsertXactUndoData
- Heap pruning respects undo_retention to avoid discarding needed UNDO
- pg_upgrade compatibility: UNDO disabled for upgraded tables
Background workers:
- Cluster-wide UNDO has async workers for cleanup/discard of old UNDO records
- Rollback itself is synchronous (via UndoReplay() during transaction abort)
- Workers periodically trim UNDO logs based on undo_retention and snapshot visibility
This demonstrates cluster-wide UNDO in production use. Note that this
differs from per-relation logical UNDO (added in subsequent patches),
which uses per-table UNDO forks and async rollback via background
workers.
Implement UNDO resource manager for B-tree indexes and regression test. When a transaction aborts, provisionally inserted index entries are marked LP_DEAD. Includes zero_vacuum test verifying aborted inserts leave no dead tuples and index consistency via bt_index_check().
Document the cluster-wide UNDO architecture including UNDO log design, record format, transaction integration, and heap AM integration details.
Add diagnostic timing to CreateCheckPoint() that breaks down the previously unmeasured pre-write and post-sync phases (SyncPre, DelayStart, DelayComplete, XLogFlush, ControlFile, SyncPost, RemoveWAL, TruncSub). When log_checkpoints is on, a new LOG line is emitted before the existing checkpoint-complete message, making it straightforward to diagnose slow shutdown checkpoints. Add CheckPointUndoLog() to persist UNDO log statistics at checkpoint time. Called from CheckPointGuts() before the buffer write phase, it scans active UNDO logs under shared locks and logs allocated/discarded/ retained byte counts when log_checkpoints is enabled. Increase pg_ctl stop timeout for isolation tests to 180 seconds via PGCTLTIMEOUT environment variable, preventing false test failures when shutdown checkpoints take longer than the default 60-second timeout. Also add env dict support for regress/isolation tests in the root meson.build, matching the existing pattern for TAP tests.
Add a self-contained benchmark suite comparing three scenarios: baseline (master), undo-compiled-but-off, and undo-enabled. Covers insert/update/delete throughput, rollback cost, VACUUM overhead, read stability under writes, storage footprint, and pgbench TPS with a mixed OLTP workload including 10% rollbacks.
When running from within the undo branch checkout, git worktree add fails because the branch is already in use. Fall back to detached HEAD worktree, then symlink as last resort. Also fix cleanup to skip symlinked source directories.
ENODATA is Linux-specific. FreeBSD uses ENOATTR for "attribute not found" from extattr operations. Define PG_ENOATTR in pg_xattr.h that maps to the correct platform errno, and use it in fileops.c.
psql does not expand :variables inside $$ string constants, so the DO block loop limit was not being set. Use a temp function with an integer parameter instead.
SQL rewrites (b2, b3, b4, b6, b7): replace repeated full-table UPDATE/DELETE with single-row PK lookups, small batch operations, cross-table updates, and 1-5% targeted mutations. This produces realistic OLTP measurements and avoids multi-hour runtimes on small systems. Portability: add get_nproc() and get_dir_bytes() helpers for Illumos (psrinfo, du -sk) and expand record_sysinfo() to handle Illumos prtconf/psrinfo and non-GNU coreutils.
The Solaris/Illumos linker fails with undefined ldap_start_tls_s when building postgres. Disable LDAP for benchmarks on SunOS.
PG19 changed "excluding connections establishing" to "without initial connection time". Match both formats.
Illumos date(1) does not support -Iseconds. Use portable strftime format '+%Y-%m-%dT%H:%M:%S%z' as fallback.
Illumos sort(1) does not support -g (general numeric sort). Use -n which handles our decimal values correctly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.