Skip to content

tests: add catastrophic single-node failure recovery integration test#99

Merged
mason-sharp merged 2 commits intomainfrom
recovery-single-node-test
Apr 1, 2026
Merged

tests: add catastrophic single-node failure recovery integration test#99
mason-sharp merged 2 commits intomainfrom
recovery-single-node-test

Conversation

@rasifr
Copy link
Copy Markdown
Member

@rasifr rasifr commented Apr 1, 2026

Summary

Adds TestCatastrophicSingleNodeFailure, an end-to-end integration test that exercises Scenario 1 from the Spock catastrophic-node-failure recovery documentation.

Real replication lag is induced by disabling n2's Spock subscription to n1 (spock.sub_disable) before writing on n1. n3 receives all transactions normally; n2 genuinely never processes those WAL records. n1 is then stopped before n2 can catch up, producing a real LSN gap rather than artificially deleted rows.

Covered cases:

  • Missing inserts: rows inserted on n1 while n2's subscription is paused
  • Stale update: n3 is ahead of n2 (n3 has v2, n2 still has v1)
  • Preserve-origin: repaired rows on n2 carry n1's origin ID and original commit timestamp to microsecond precision

Relates to SPOC-492.

Adds TestCatastrophicSingleNodeFailure which exercises Scenario 1 from
the Spock catastrophic-node-failure recovery documentation end-to-end.

Real replication lag is induced by disabling n2's Spock subscription to
n1 (spock.sub_disable) before inserting rows and an UPDATE on n1.  n3
receives all transactions normally; n2 genuinely never processes those
WAL records.  n1 is then stopped before n2 can catch up, producing a
real LSN gap rather than artificially deleted rows.

Covered cases:
- Missing inserts: rows inserted on n1 while n2's subscription is down
- Stale update: n3 is ahead of n2 (n3 has v2, n2 still has v1)
- Preserve-origin: repaired rows on n2 carry n1's origin ID and original
  commit timestamp to microsecond precision

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c06c581b-c3d1-4fbf-830b-ef7196e08960

📥 Commits

Reviewing files that changed from the base of the PR and between 670946b and 2901440.

📒 Files selected for processing (1)
  • tests/integration/recovery_single_node_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/recovery_single_node_test.go

📝 Walkthrough

Walkthrough

Adds a new integration test that simulates a single-node catastrophic failure in a three-node Spock replication setup, exercises diff and repair workflows with AgainstOrigin and Until fencing, and verifies post-repair convergence and preservation of Spock metadata.

Changes

Cohort / File(s) Summary
Integration test + helpers
tests/integration/recovery_single_node_test.go
Adds TestCatastrophicSingleNodeFailure and two helpers (writeN2N3ClusterJSON, findSubscriptionName). The test creates schemas on three nodes, induces replication lag, fences the origin node, runs table-diff with AgainstOrigin/Until, repairs survivors with SourceOfTruth=n3 & PreserveOrigin=true, and asserts convergence and metadata (node_origin, commit_ts).

Poem

🐇 I nibble logs where three nodes play,
I pause one heartbeat, then patch the fray.
Diffs in my pouch, repairs by moonlight,
Origins guarded, commits set right.
Hoppity-hop — the cluster sleeps tight.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding an integration test for catastrophic single-node failure recovery.
Description check ✅ Passed The description clearly relates to the changeset by explaining what the new test does, how it induces replication lag, the scenarios it covers, and references the related issue SPOC-492.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch recovery-single-node-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 1, 2026

Up to standards ✅

🟢 Issues 3 medium

Results:
3 new issues

Category Results
Complexity 3 medium

View in Codacy

🟢 Metrics 27 complexity . 4 duplication

Metric Results
Complexity 27
Duplication 4

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/integration/recovery_single_node_test.go (1)

60-61: Use a unique temp cluster file name to avoid collisions across runs.

Using a fixed test_cluster_n2_n3.json in the working directory can collide in parallel or repeated test execution.

Suggested fix
+ import "path/filepath"
...
- clusterName := "test_cluster_n2_n3"
+ clusterName := fmt.Sprintf("test_cluster_n2_n3_%d", time.Now().UnixNano())
...
- path := clusterName + ".json"
+ path := filepath.Join(t.TempDir(), clusterName+".json")
  require.NoError(t, os.WriteFile(path, data, 0644))
- t.Cleanup(func() { os.Remove(path) })
  return clusterName

Also applies to: 113-116

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/recovery_single_node_test.go` around lines 60 - 61, The
test currently writes a fixed file name ("test_cluster_n2_n3.json") using
clusterName and types.ClusterConfig which can collide across parallel runs;
change the test to create a unique temp file (e.g., via os.CreateTemp or
t.TempDir()+unique suffix) and write the cluster JSON to that temp path, use
that generated path wherever the fixed filename is referenced (including the
other occurrence around lines 113-116), and ensure the temp file is cleaned up
or created inside t.TempDir() so parallel tests don't conflict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/recovery_single_node_test.go`:
- Around line 220-225: The test disables the n2→n1 subscription via calling
spock.sub_disable for the subscription name found by findSubscriptionName (using
pgCluster.Node2Pool.Exec), but never re-enables it, causing cross-test state
leakage; modify the test to restore the subscription during cleanup (e.g., use a
deferred or explicit cleanup step) by calling spock.sub_disable('<subName>',
false) or the appropriate spock sub_enable equivalent on the same subscription
name found by findSubscriptionName so the subscription is re-enabled after the
test completes; apply the same fix to the other occurrence around lines 305-309.
- Around line 384-391: The loop over n3Rows silently skips rows missing
"_spock_metadata_"; change it to assert presence and type immediately: inside
the for _, row := range n3Rows loop call metaRaw, ok :=
row.Get("_spock_metadata_") and use require.True(t, ok, "_spock_metadata_ must
be present in diff row") (or require.NotNil/Fail) to fail hard if missing, then
assert the type with require.IsType or require.True(t, meta, "map[string]any")
using the meta := metaRaw.(map[string]any) type assertion guarded by a
require.True on the assertion, and finally keep the require.Contains(t, meta,
"node_origin", ...) and require.Contains(t, meta, "commit_ts", ...) checks.

---

Nitpick comments:
In `@tests/integration/recovery_single_node_test.go`:
- Around line 60-61: The test currently writes a fixed file name
("test_cluster_n2_n3.json") using clusterName and types.ClusterConfig which can
collide across parallel runs; change the test to create a unique temp file
(e.g., via os.CreateTemp or t.TempDir()+unique suffix) and write the cluster
JSON to that temp path, use that generated path wherever the fixed filename is
referenced (including the other occurrence around lines 113-116), and ensure the
temp file is cleaned up or created inside t.TempDir() so parallel tests don't
conflict.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a001feac-8646-4a39-8d11-e4185ddb7c29

📥 Commits

Reviewing files that changed from the base of the PR and between 4d40476 and 670946b.

📒 Files selected for processing (1)
  • tests/integration/recovery_single_node_test.go

Comment on lines +220 to +225
subName := findSubscriptionName(t, pgCluster.Node2Pool, "n1")
_, err = pgCluster.Node2Pool.Exec(ctx,
fmt.Sprintf("SELECT spock.sub_disable('%s', true)", subName),
)
require.NoError(t, err, "disable n2 subscription to n1 (%s)", subName)
log.Printf("Disabled n2 subscription %q to n1 – real lag begins", subName)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore the n2→n1 subscription in cleanup to avoid cross-test state leakage.

Line 220 disables replication from n1 to n2, but no cleanup re-enables it. This can break later integration tests that assume normal topology.

Suggested fix
  subName := findSubscriptionName(t, pgCluster.Node2Pool, "n1")
- _, err = pgCluster.Node2Pool.Exec(ctx,
- 	fmt.Sprintf("SELECT spock.sub_disable('%s', true)", subName),
- )
+ _, err = pgCluster.Node2Pool.Exec(ctx,
+ 	"SELECT spock.sub_disable($1, true)",
+ 	subName,
+ )
  require.NoError(t, err, "disable n2 subscription to n1 (%s)", subName)
+ t.Cleanup(func() {
+ 	if _, e := pgCluster.Node2Pool.Exec(ctx, "SELECT spock.sub_enable($1, true)", subName); e != nil {
+ 		t.Logf("cleanup: failed to re-enable n2 subscription %q: %v", subName, e)
+ 	}
+ })

Also applies to: 305-309

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/recovery_single_node_test.go` around lines 220 - 225, The
test disables the n2→n1 subscription via calling spock.sub_disable for the
subscription name found by findSubscriptionName (using
pgCluster.Node2Pool.Exec), but never re-enables it, causing cross-test state
leakage; modify the test to restore the subscription during cleanup (e.g., use a
deferred or explicit cleanup step) by calling spock.sub_disable('<subName>',
false) or the appropriate spock sub_enable equivalent on the same subscription
name found by findSubscriptionName so the subscription is re-enabled after the
test completes; apply the same fix to the other occurrence around lines 305-309.

- Use UnixNano-suffixed cluster name to avoid file collisions in
  parallel runs
- Re-enable n2's Spock subscription in t.Cleanup to prevent cross-test
  state leakage
- Hard-fail on missing _spock_metadata_ instead of silently skipping

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mason-sharp mason-sharp merged commit feec19c into main Apr 1, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants