tests: add catastrophic single-node failure recovery integration test by rasifr · Pull Request #99 · pgEdge/ace

rasifr · 2026-04-01T10:20:07Z

Summary

Adds TestCatastrophicSingleNodeFailure, an end-to-end integration test that exercises Scenario 1 from the Spock catastrophic-node-failure recovery documentation.

Real replication lag is induced by disabling n2's Spock subscription to n1 (spock.sub_disable) before writing on n1. n3 receives all transactions normally; n2 genuinely never processes those WAL records. n1 is then stopped before n2 can catch up, producing a real LSN gap rather than artificially deleted rows.

Covered cases:

Missing inserts: rows inserted on n1 while n2's subscription is paused
Stale update: n3 is ahead of n2 (n3 has v2, n2 still has v1)
Preserve-origin: repaired rows on n2 carry n1's origin ID and original commit timestamp to microsecond precision

Relates to SPOC-492.

Adds TestCatastrophicSingleNodeFailure which exercises Scenario 1 from the Spock catastrophic-node-failure recovery documentation end-to-end. Real replication lag is induced by disabling n2's Spock subscription to n1 (spock.sub_disable) before inserting rows and an UPDATE on n1. n3 receives all transactions normally; n2 genuinely never processes those WAL records. n1 is then stopped before n2 can catch up, producing a real LSN gap rather than artificially deleted rows. Covered cases: - Missing inserts: rows inserted on n1 while n2's subscription is down - Stale update: n3 is ahead of n2 (n3 has v2, n2 still has v1) - Preserve-origin: repaired rows on n2 carry n1's origin ID and original commit timestamp to microsecond precision Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-01T10:20:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c06c581b-c3d1-4fbf-830b-ef7196e08960

📥 Commits

Reviewing files that changed from the base of the PR and between 670946b and 2901440.

📒 Files selected for processing (1)

tests/integration/recovery_single_node_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

tests/integration/recovery_single_node_test.go

📝 Walkthrough

Walkthrough

Adds a new integration test that simulates a single-node catastrophic failure in a three-node Spock replication setup, exercises diff and repair workflows with AgainstOrigin and Until fencing, and verifies post-repair convergence and preservation of Spock metadata.

Changes

Cohort / File(s)	Summary
Integration test + helpers `tests/integration/recovery_single_node_test.go`	Adds `TestCatastrophicSingleNodeFailure` and two helpers (`writeN2N3ClusterJSON`, `findSubscriptionName`). The test creates schemas on three nodes, induces replication lag, fences the origin node, runs table-diff with `AgainstOrigin`/`Until`, repairs survivors with `SourceOfTruth=n3` & `PreserveOrigin=true`, and asserts convergence and metadata (`node_origin`, `commit_ts`).

Poem

🐇 I nibble logs where three nodes play,
I pause one heartbeat, then patch the fray.
Diffs in my pouch, repairs by moonlight,
Origins guarded, commits set right.
Hoppity-hop — the cluster sleeps tight.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely summarizes the main change: adding an integration test for catastrophic single-node failure recovery.
Description check	✅ Passed	The description clearly relates to the changeset by explaining what the new test does, how it induces replication lag, the scenarios it covers, and references the related issue SPOC-492.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch recovery-single-node-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-04-01T10:21:02Z

Up to standards ✅

🟢 Issues 3 medium

Results:
3 new issues

Category Results

Complexity 3 medium

View in Codacy

🟢 Metrics 27 complexity . 4 duplication

Metric Results

Complexity 27

Duplication 4

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/integration/recovery_single_node_test.go (1)

60-61: Use a unique temp cluster file name to avoid collisions across runs.

Using a fixed test_cluster_n2_n3.json in the working directory can collide in parallel or repeated test execution.

Suggested fix

+ import "path/filepath"
...
- clusterName := "test_cluster_n2_n3"
+ clusterName := fmt.Sprintf("test_cluster_n2_n3_%d", time.Now().UnixNano())
...
- path := clusterName + ".json"
+ path := filepath.Join(t.TempDir(), clusterName+".json")
  require.NoError(t, os.WriteFile(path, data, 0644))
- t.Cleanup(func() { os.Remove(path) })
  return clusterName

Also applies to: 113-116

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/integration/recovery_single_node_test.go` around lines 60 - 61, The
test currently writes a fixed file name ("test_cluster_n2_n3.json") using
clusterName and types.ClusterConfig which can collide across parallel runs;
change the test to create a unique temp file (e.g., via os.CreateTemp or
t.TempDir()+unique suffix) and write the cluster JSON to that temp path, use
that generated path wherever the fixed filename is referenced (including the
other occurrence around lines 113-116), and ensure the temp file is cleaned up
or created inside t.TempDir() so parallel tests don't conflict.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/recovery_single_node_test.go`:
- Around line 220-225: The test disables the n2→n1 subscription via calling
spock.sub_disable for the subscription name found by findSubscriptionName (using
pgCluster.Node2Pool.Exec), but never re-enables it, causing cross-test state
leakage; modify the test to restore the subscription during cleanup (e.g., use a
deferred or explicit cleanup step) by calling spock.sub_disable('<subName>',
false) or the appropriate spock sub_enable equivalent on the same subscription
name found by findSubscriptionName so the subscription is re-enabled after the
test completes; apply the same fix to the other occurrence around lines 305-309.
- Around line 384-391: The loop over n3Rows silently skips rows missing
"_spock_metadata_"; change it to assert presence and type immediately: inside
the for _, row := range n3Rows loop call metaRaw, ok :=
row.Get("_spock_metadata_") and use require.True(t, ok, "_spock_metadata_ must
be present in diff row") (or require.NotNil/Fail) to fail hard if missing, then
assert the type with require.IsType or require.True(t, meta, "map[string]any")
using the meta := metaRaw.(map[string]any) type assertion guarded by a
require.True on the assertion, and finally keep the require.Contains(t, meta,
"node_origin", ...) and require.Contains(t, meta, "commit_ts", ...) checks.

---

Nitpick comments:
In `@tests/integration/recovery_single_node_test.go`:
- Around line 60-61: The test currently writes a fixed file name
("test_cluster_n2_n3.json") using clusterName and types.ClusterConfig which can
collide across parallel runs; change the test to create a unique temp file
(e.g., via os.CreateTemp or t.TempDir()+unique suffix) and write the cluster
JSON to that temp path, use that generated path wherever the fixed filename is
referenced (including the other occurrence around lines 113-116), and ensure the
temp file is cleaned up or created inside t.TempDir() so parallel tests don't
conflict.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a001feac-8646-4a39-8d11-e4185ddb7c29

📥 Commits

Reviewing files that changed from the base of the PR and between 4d40476 and 670946b.

📒 Files selected for processing (1)

tests/integration/recovery_single_node_test.go

coderabbitai · 2026-04-01T10:24:10Z

tests/integration/recovery_single_node_test.go

+	subName := findSubscriptionName(t, pgCluster.Node2Pool, "n1")
+	_, err = pgCluster.Node2Pool.Exec(ctx,
+		fmt.Sprintf("SELECT spock.sub_disable('%s', true)", subName),
+	)
+	require.NoError(t, err, "disable n2 subscription to n1 (%s)", subName)
+	log.Printf("Disabled n2 subscription %q to n1 – real lag begins", subName)


⚠️ Potential issue | 🟠 Major

Restore the n2→n1 subscription in cleanup to avoid cross-test state leakage.

Line 220 disables replication from n1 to n2, but no cleanup re-enables it. This can break later integration tests that assume normal topology.

Suggested fix

subName := findSubscriptionName(t, pgCluster.Node2Pool, "n1") - _, err = pgCluster.Node2Pool.Exec(ctx, - fmt.Sprintf("SELECT spock.sub_disable('%s', true)", subName), - ) + _, err = pgCluster.Node2Pool.Exec(ctx, + "SELECT spock.sub_disable($1, true)", + subName, + ) require.NoError(t, err, "disable n2 subscription to n1 (%s)", subName) + t.Cleanup(func() { + if _, e := pgCluster.Node2Pool.Exec(ctx, "SELECT spock.sub_enable($1, true)", subName); e != nil { + t.Logf("cleanup: failed to re-enable n2 subscription %q: %v", subName, e) + } + })

Also applies to: 305-309

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integration/recovery_single_node_test.go` around lines 220 - 225, The test disables the n2→n1 subscription via calling spock.sub_disable for the subscription name found by findSubscriptionName (using pgCluster.Node2Pool.Exec), but never re-enables it, causing cross-test state leakage; modify the test to restore the subscription during cleanup (e.g., use a deferred or explicit cleanup step) by calling spock.sub_disable('<subName>', false) or the appropriate spock sub_enable equivalent on the same subscription name found by findSubscriptionName so the subscription is re-enabled after the test completes; apply the same fix to the other occurrence around lines 305-309.

tests/integration/recovery_single_node_test.go

- Use UnixNano-suffixed cluster name to avoid file collisions in parallel runs - Re-enable n2's Spock subscription in t.Cleanup to prevent cross-test state leakage - Hard-fail on missing _spock_metadata_ instead of silently skipping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

mason-sharp approved these changes Apr 1, 2026

View reviewed changes

mason-sharp merged commit feec19c into main Apr 1, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: add catastrophic single-node failure recovery integration test#99

tests: add catastrophic single-node failure recovery integration test#99
mason-sharp merged 2 commits intomainfrom
recovery-single-node-test

rasifr commented Apr 1, 2026

Uh oh!

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

Walkthrough

Changes

Poem

Uh oh!

codacy-production bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rasifr commented Apr 1, 2026

Summary

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Uh oh!

codacy-production bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

codacy-production bot commented Apr 1, 2026 •

edited

Loading