tests: add catastrophic single-node failure recovery integration test#99
tests: add catastrophic single-node failure recovery integration test#99mason-sharp merged 2 commits intomainfrom
Conversation
Adds TestCatastrophicSingleNodeFailure which exercises Scenario 1 from the Spock catastrophic-node-failure recovery documentation end-to-end. Real replication lag is induced by disabling n2's Spock subscription to n1 (spock.sub_disable) before inserting rows and an UPDATE on n1. n3 receives all transactions normally; n2 genuinely never processes those WAL records. n1 is then stopped before n2 can catch up, producing a real LSN gap rather than artificially deleted rows. Covered cases: - Missing inserts: rows inserted on n1 while n2's subscription is down - Stale update: n3 is ahead of n2 (n3 has v2, n2 still has v1) - Preserve-origin: repaired rows on n2 carry n1's origin ID and original commit timestamp to microsecond precision Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a new integration test that simulates a single-node catastrophic failure in a three-node Spock replication setup, exercises diff and repair workflows with AgainstOrigin and Until fencing, and verifies post-repair convergence and preservation of Spock metadata. Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Category | Results |
|---|---|
| Complexity | 3 medium |
🟢 Metrics 27 complexity . 4 duplication
Metric Results Complexity 27 Duplication 4
TIP This summary will be updated as you push new changes. Give us feedback
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/integration/recovery_single_node_test.go (1)
60-61: Use a unique temp cluster file name to avoid collisions across runs.Using a fixed
test_cluster_n2_n3.jsonin the working directory can collide in parallel or repeated test execution.Suggested fix
+ import "path/filepath" ... - clusterName := "test_cluster_n2_n3" + clusterName := fmt.Sprintf("test_cluster_n2_n3_%d", time.Now().UnixNano()) ... - path := clusterName + ".json" + path := filepath.Join(t.TempDir(), clusterName+".json") require.NoError(t, os.WriteFile(path, data, 0644)) - t.Cleanup(func() { os.Remove(path) }) return clusterNameAlso applies to: 113-116
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/recovery_single_node_test.go` around lines 60 - 61, The test currently writes a fixed file name ("test_cluster_n2_n3.json") using clusterName and types.ClusterConfig which can collide across parallel runs; change the test to create a unique temp file (e.g., via os.CreateTemp or t.TempDir()+unique suffix) and write the cluster JSON to that temp path, use that generated path wherever the fixed filename is referenced (including the other occurrence around lines 113-116), and ensure the temp file is cleaned up or created inside t.TempDir() so parallel tests don't conflict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/integration/recovery_single_node_test.go`:
- Around line 220-225: The test disables the n2→n1 subscription via calling
spock.sub_disable for the subscription name found by findSubscriptionName (using
pgCluster.Node2Pool.Exec), but never re-enables it, causing cross-test state
leakage; modify the test to restore the subscription during cleanup (e.g., use a
deferred or explicit cleanup step) by calling spock.sub_disable('<subName>',
false) or the appropriate spock sub_enable equivalent on the same subscription
name found by findSubscriptionName so the subscription is re-enabled after the
test completes; apply the same fix to the other occurrence around lines 305-309.
- Around line 384-391: The loop over n3Rows silently skips rows missing
"_spock_metadata_"; change it to assert presence and type immediately: inside
the for _, row := range n3Rows loop call metaRaw, ok :=
row.Get("_spock_metadata_") and use require.True(t, ok, "_spock_metadata_ must
be present in diff row") (or require.NotNil/Fail) to fail hard if missing, then
assert the type with require.IsType or require.True(t, meta, "map[string]any")
using the meta := metaRaw.(map[string]any) type assertion guarded by a
require.True on the assertion, and finally keep the require.Contains(t, meta,
"node_origin", ...) and require.Contains(t, meta, "commit_ts", ...) checks.
---
Nitpick comments:
In `@tests/integration/recovery_single_node_test.go`:
- Around line 60-61: The test currently writes a fixed file name
("test_cluster_n2_n3.json") using clusterName and types.ClusterConfig which can
collide across parallel runs; change the test to create a unique temp file
(e.g., via os.CreateTemp or t.TempDir()+unique suffix) and write the cluster
JSON to that temp path, use that generated path wherever the fixed filename is
referenced (including the other occurrence around lines 113-116), and ensure the
temp file is cleaned up or created inside t.TempDir() so parallel tests don't
conflict.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a001feac-8646-4a39-8d11-e4185ddb7c29
📒 Files selected for processing (1)
tests/integration/recovery_single_node_test.go
| subName := findSubscriptionName(t, pgCluster.Node2Pool, "n1") | ||
| _, err = pgCluster.Node2Pool.Exec(ctx, | ||
| fmt.Sprintf("SELECT spock.sub_disable('%s', true)", subName), | ||
| ) | ||
| require.NoError(t, err, "disable n2 subscription to n1 (%s)", subName) | ||
| log.Printf("Disabled n2 subscription %q to n1 – real lag begins", subName) |
There was a problem hiding this comment.
Restore the n2→n1 subscription in cleanup to avoid cross-test state leakage.
Line 220 disables replication from n1 to n2, but no cleanup re-enables it. This can break later integration tests that assume normal topology.
Suggested fix
subName := findSubscriptionName(t, pgCluster.Node2Pool, "n1")
- _, err = pgCluster.Node2Pool.Exec(ctx,
- fmt.Sprintf("SELECT spock.sub_disable('%s', true)", subName),
- )
+ _, err = pgCluster.Node2Pool.Exec(ctx,
+ "SELECT spock.sub_disable($1, true)",
+ subName,
+ )
require.NoError(t, err, "disable n2 subscription to n1 (%s)", subName)
+ t.Cleanup(func() {
+ if _, e := pgCluster.Node2Pool.Exec(ctx, "SELECT spock.sub_enable($1, true)", subName); e != nil {
+ t.Logf("cleanup: failed to re-enable n2 subscription %q: %v", subName, e)
+ }
+ })Also applies to: 305-309
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration/recovery_single_node_test.go` around lines 220 - 225, The
test disables the n2→n1 subscription via calling spock.sub_disable for the
subscription name found by findSubscriptionName (using
pgCluster.Node2Pool.Exec), but never re-enables it, causing cross-test state
leakage; modify the test to restore the subscription during cleanup (e.g., use a
deferred or explicit cleanup step) by calling spock.sub_disable('<subName>',
false) or the appropriate spock sub_enable equivalent on the same subscription
name found by findSubscriptionName so the subscription is re-enabled after the
test completes; apply the same fix to the other occurrence around lines 305-309.
- Use UnixNano-suffixed cluster name to avoid file collisions in parallel runs - Re-enable n2's Spock subscription in t.Cleanup to prevent cross-test state leakage - Hard-fail on missing _spock_metadata_ instead of silently skipping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Adds
TestCatastrophicSingleNodeFailure, an end-to-end integration test that exercises Scenario 1 from the Spock catastrophic-node-failure recovery documentation.Real replication lag is induced by disabling n2's Spock subscription to n1 (
spock.sub_disable) before writing on n1. n3 receives all transactions normally; n2 genuinely never processes those WAL records. n1 is then stopped before n2 can catch up, producing a real LSN gap rather than artificially deleted rows.Covered cases:
Relates to SPOC-492.