Skip to content

Undo#21

Draft
gburd wants to merge 36 commits intomasterfrom
undo
Draft

Undo#21
gburd wants to merge 36 commits intomasterfrom
undo

Conversation

@gburd
Copy link
Copy Markdown
Owner

@gburd gburd commented Mar 26, 2026

No description provided.

@github-actions github-actions Bot force-pushed the master branch 30 times, most recently from 9355586 to 9cbf7e6 Compare March 30, 2026 18:18
gburd added 30 commits April 24, 2026 15:59
Implement deferred file deletion (BDB: __fop_remove). Deletion is
scheduled for transaction commit or abort, not executed immediately.

API: FileOpsDelete(path, at_commit) -> void
WAL: XLOG_FILEOPS_DELETE (intentional no-op during redo; deletion
driven by XACT commit/abort records).
On Windows: uses pgunlink() with retry on EACCES.
Implement deferred file rename (BDB: __fop_rename). The rename is
scheduled for commit time using durable_rename() which handles fsync
ordering on Unix and MoveFileEx with retry on Windows.

API: FileOpsRename(oldpath, newpath) -> int
WAL: XLOG_FILEOPS_RENAME (intentional no-op during redo).
Implement WAL-logged file write at offset (BDB: __fop_write). Data is
written immediately using pwrite() and fsynced for durability.

API: FileOpsWrite(path, offset, data, len) -> int
WAL: XLOG_FILEOPS_WRITE with redo that replays the write.
On Windows: uses SetFilePointerEx + WriteFile via pg_pwrite.
Implement WAL-logged file truncation. Executed immediately with
XLogFlush before the irreversible operation (following SMGR_TRUNCATE
pattern). Uses ftruncate() on POSIX, SetEndOfFile() on Windows.

API: FileOpsTruncate(path, length) -> void
WAL: XLOG_FILEOPS_TRUNCATE with redo that replays the truncation.
Implement WAL-logged file metadata operations.

CHMOD: chmod() on POSIX, _chmod() on Windows with limited mode bits
(only _S_IREAD/_S_IWRITE; no group/other support).

CHOWN: chown() on POSIX, no-op with WARNING on Windows (Windows uses
ACLs for ownership, not uid/gid).

Both execute immediately and are WAL-logged for crash recovery.
MKDIR: Immediate execution using MakePGDirectory(). Registers
rmdir-on-abort for automatic cleanup on rollback. On Windows: _mkdir()
(no mode parameter, permissions inherited from parent).

RMDIR: Deferred to commit time (like DELETE). Uses rmdir() on all
platforms, _rmdir() on Windows.
SYMLINK: Immediate execution. Uses symlink() on POSIX, pgsymlink()
(NTFS junction points) on Windows. Registers delete-on-abort.

LINK: Immediate execution. Uses link() on POSIX, CreateHardLinkA()
on Windows (NTFS only). Registers delete-on-abort.

Both create links idempotently during redo (unlink first if exists).
Add extended attribute operations to the transactional file operations
framework, completing the Berkeley DB fileops.src operation set.

FileOpsSetXattr() and FileOpsRemoveXattr() provide immediate execution
with WAL logging for crash recovery replay. A new cross-platform
portability layer (src/port/pg_xattr.c) abstracts platform differences:

  - Linux: <sys/xattr.h> setxattr/removexattr
  - macOS: <sys/xattr.h> with extra options parameter
  - FreeBSD: <sys/extattr.h> extattr_set_file/extattr_delete_file
  - Windows: NTFS Alternate Data Streams via CreateFileA("path:name")
  - Fallback: returns ENOTSUP (operation succeeds in WAL but no-op
    on unsupported platforms for WAL stream portability)

Platform detection uses compiler-defined macros (__linux__, __APPLE__,
__FreeBSD__, WIN32) rather than configure-time checks, avoiding
meson.build/configure.ac complexity.
Add regression tests for all FILEOPS operations (CREATE, DELETE,
RENAME, WRITE, TRUNCATE, CHMOD, CHOWN, MKDIR, RMDIR, SYMLINK, LINK,
SETXATTR, REMOVEXATTR) and a crash recovery test for WAL replay.

Update the transactional fileops example script with the expanded
operation set following the Berkeley DB fileops.src model.
Introduce the IndexPrune framework that allows index access methods to
register callbacks for proactively pruning dead index entries when UNDO
records are discarded. This avoids accumulating dead tuples that would
otherwise require VACUUM to clean up.

Key components:
- index_prune.h: IndexPruneCallbacks structure and registration API
- index_prune.c: Registry management and IndexPruneNotifyDiscard() dispatcher
- relundo_discard.c: Hook to call IndexPruneNotifyDiscard on UNDO discard

Individual index AM implementations follow in subsequent commits.
Placeholder for index pruning design documentation.
To be populated when design notes are split by subsystem.
Register IndexPrune callbacks in the B-tree access method handler.
nbtprune.c implements dead-entry detection and removal using UNDO
discard notifications, allowing proactive cleanup without full VACUUM.
Register IndexPrune callbacks in the hash access method handler.
hashprune.c implements dead-entry detection and removal using UNDO
discard notifications for hash indexes.
Register IndexPrune callbacks in the GIN access method handler.
ginprune.c implements dead-entry detection and removal using UNDO
discard notifications for GIN indexes.
Register IndexPrune callbacks in the GiST access method handler.
gistprune.c implements dead-entry detection and removal using UNDO
discard notifications for GiST indexes.
Register IndexPrune callbacks in the SP-GiST access method handler.
spgprune.c implements dead-entry detection and removal using UNDO
discard notifications for SP-GiST indexes.
Add VACUUM statistics tracking for UNDO-pruned index entries and verbose
output. Include comprehensive test suite exercising index pruning across
all supported index access methods via test_undo_tam.
Adds opt-in UNDO support to the standard heap table access method.
When enabled, heap operations write UNDO records to enable physical
rollback without scanning the heap, and support UNDO-based MVCC
visibility determination.

How heap uses UNDO:

INSERT operations:
  - Before inserting tuple, call PrepareXactUndoData() to reserve UNDO space
  - Write UNDO record with: transaction ID, tuple TID, old tuple data (null for INSERT)
  - On abort: UndoReplay() marks tuple as LP_UNUSED without heap scan

UPDATE operations:
  - Write UNDO record with complete old tuple version before update
  - On abort: UndoReplay() restores old tuple version from UNDO

DELETE operations:
  - Write UNDO record with complete deleted tuple data
  - On abort: UndoReplay() resurrects tuple from UNDO record

MVCC visibility:
  - Tuples reference UNDO chain via xmin/xmax
  - HeapTupleSatisfiesSnapshot() can walk UNDO chain for older versions
  - Enables reconstructing tuple state as of any snapshot

Configuration:
  CREATE TABLE t (...) WITH (enable_undo=on);

The enable_undo storage parameter is per-table and defaults to off for
backward compatibility. When disabled, heap behaves exactly as before.

Value proposition:

1. Faster rollback: No heap scan required, UNDO chains are sequential
   - Traditional abort: Full heap scan to mark tuples invalid (O(n) random I/O)
   - UNDO abort: Sequential UNDO log scan (O(n) sequential I/O, better cache locality)

2. Cleaner abort handling: UNDO records are self-contained
   - No need to track which heap pages were modified
   - Works across crashes (UNDO is WAL-logged)

3. Foundation for future features:
   - Multi-version concurrency control without bloat
   - Faster VACUUM (can discard entire UNDO segments)
   - Point-in-time recovery improvements

Trade-offs:

Costs:
  - Additional writes: Every DML writes both heap + UNDO (roughly 2x write amplification)
  - UNDO log space: Requires space for UNDO records until no longer visible
  - Complexity: New GUCs (undo_retention, max_undo_workers), monitoring needed

Benefits:
  - Primarily valuable for workloads with:
    - Frequent aborts (e.g., speculative execution, deadlocks)
    - Long-running transactions needing old snapshots
    - Hot UPDATE workloads benefiting from cleaner rollback

Not recommended for:
  - Bulk load workloads (COPY: 2x write amplification without abort benefit)
  - Append-only tables (rare aborts mean cost without benefit)
  - Space-constrained systems (UNDO retention increases storage)

When beneficial:
  - OLTP with high abort rates (>5%)
  - Systems with aggressive pruning needs (frequent VACUUM)
  - Workloads requiring historical visibility (audit, time-travel queries)

Integration points:
  - heap_insert/update/delete call PrepareXactUndoData/InsertXactUndoData
  - Heap pruning respects undo_retention to avoid discarding needed UNDO
  - pg_upgrade compatibility: UNDO disabled for upgraded tables

Background workers:
  - Cluster-wide UNDO has async workers for cleanup/discard of old UNDO records
  - Rollback itself is synchronous (via UndoReplay() during transaction abort)
  - Workers periodically trim UNDO logs based on undo_retention and snapshot visibility

This demonstrates cluster-wide UNDO in production use. Note that this
differs from per-relation logical UNDO (added in subsequent patches),
which uses per-table UNDO forks and async rollback via background
workers.
Implement UNDO resource manager for B-tree indexes and regression test.
When a transaction aborts, provisionally inserted index entries are marked
LP_DEAD. Includes zero_vacuum test verifying aborted inserts leave no dead
tuples and index consistency via bt_index_check().
Document the cluster-wide UNDO architecture including UNDO log design,
record format, transaction integration, and heap AM integration details.
Add diagnostic timing to CreateCheckPoint() that breaks down the
previously unmeasured pre-write and post-sync phases (SyncPre,
DelayStart, DelayComplete, XLogFlush, ControlFile, SyncPost,
RemoveWAL, TruncSub). When log_checkpoints is on, a new LOG line
is emitted before the existing checkpoint-complete message, making
it straightforward to diagnose slow shutdown checkpoints.

Add CheckPointUndoLog() to persist UNDO log statistics at checkpoint
time. Called from CheckPointGuts() before the buffer write phase, it
scans active UNDO logs under shared locks and logs allocated/discarded/
retained byte counts when log_checkpoints is enabled.

Increase pg_ctl stop timeout for isolation tests to 180 seconds via
PGCTLTIMEOUT environment variable, preventing false test failures when
shutdown checkpoints take longer than the default 60-second timeout.
Also add env dict support for regress/isolation tests in the root
meson.build, matching the existing pattern for TAP tests.
Add a self-contained benchmark suite comparing three scenarios:
baseline (master), undo-compiled-but-off, and undo-enabled.
Covers insert/update/delete throughput, rollback cost, VACUUM
overhead, read stability under writes, storage footprint, and
pgbench TPS with a mixed OLTP workload including 10% rollbacks.
When running from within the undo branch checkout, git worktree add
fails because the branch is already in use. Fall back to detached
HEAD worktree, then symlink as last resort. Also fix cleanup to
skip symlinked source directories.
ENODATA is Linux-specific. FreeBSD uses ENOATTR for "attribute not
found" from extattr operations. Define PG_ENOATTR in pg_xattr.h
that maps to the correct platform errno, and use it in fileops.c.
psql does not expand :variables inside $$ string constants, so the
DO block loop limit was not being set. Use a temp function with an
integer parameter instead.
SQL rewrites (b2, b3, b4, b6, b7): replace repeated full-table
UPDATE/DELETE with single-row PK lookups, small batch operations,
cross-table updates, and 1-5% targeted mutations. This produces
realistic OLTP measurements and avoids multi-hour runtimes on
small systems.

Portability: add get_nproc() and get_dir_bytes() helpers for
Illumos (psrinfo, du -sk) and expand record_sysinfo() to handle
Illumos prtconf/psrinfo and non-GNU coreutils.
The Solaris/Illumos linker fails with undefined ldap_start_tls_s
when building postgres. Disable LDAP for benchmarks on SunOS.
PG19 changed "excluding connections establishing" to
"without initial connection time". Match both formats.
Illumos date(1) does not support -Iseconds. Use portable
strftime format '+%Y-%m-%dT%H:%M:%S%z' as fallback.
Illumos sort(1) does not support -g (general numeric sort).
Use -n which handles our decimal values correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant