Conversation
Auto-detect whether spock is installed and branch accordingly across all repair, diff, rerun, and merkle code paths: - Add detectSpock() to table repair; conditionally call spock.repair_mode() only when spock is present, fall back to session_replication_role alone - Replace spock.xact_commit_timestamp_origin() with native PG14+ pg_xact_commit_timestamp_origin() in diff, rerun, and merkle queries - Add GetNodeOriginNames() wrapper that uses spock.node when available, falls back to pg_replication_origin for node name resolution - Add native PG alternatives for LSN queries (pg_subscription-based) - Add CheckSpockInstalled() utility; spock-diff and repset-diff now return a clear error when spock is not installed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tive PG Introduce a testEnv struct that encapsulates environment-specific state (pools, service names, cluster config, HasSpock flag), allowing the same test logic to run against both spock-replicated and vanilla PostgreSQL clusters with zero code duplication. Refactor 17 test functions (9 repair, 8 diff) to accept *testEnv. Add docker-compose-native.yaml with two postgres:17 containers and a TestNativePG suite that exercises all applicable shared tests (40 subtests) against standalone PG nodes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded `spock.` schema prefix with a template function
`{{aceSchema}}` that reads from config at render time. Defaults to
"spock" for backward compatibility but allows alternate schemas
(e.g. "ace") via the existing mtree.schema config key.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor all merkle tree test functions to accept *testEnv, replacing pgCluster globals and spock.repair_mode() calls with env-aware abstractions. All 28 merkle tree subtests (Init, Build, Diff, Merge, Split, CDC, Teardown, NumericScaleInvariance) now run on both spock and native PostgreSQL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR extends the tool to support both Spock and native PostgreSQL replication environments. It adds database query functions for native LSN/origin resolution, introduces SQL templates with schema-agnostic metadata handling, implements runtime Spock detection with conditional feature usage, and provides a comprehensive test environment abstraction and native PostgreSQL integration test suite. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
internal/consistency/mtree/merkle.go (1)
501-520:⚠️ Potential issue | 🟠 MajorKeep per-node origin-ID mappings instead of sharing one node's map.
buildFetchRowsSQL*now extracts the rawroidentfrompg_xact_commit_timestamp_origin(xmin), butloadSpockNodeNames()returns after loading just the first available node's origin names, then applies that single mapping to rows from every node viaTranslateNodeOrigin(). Each PostgreSQL node maintains its ownpg_replication_origincatalog, so the same numericroidentvalue may represent different origins on different nodes. This will causenode_originto be mislabeled in Merkle diff output. Either maintain per-node mappings or ensure origin lookups happen per node as rows are processed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/consistency/mtree/merkle.go` around lines 501 - 520, loadSpockNodeNames currently stops after the first successful node and sets a single SpockNodeNames map which is then (incorrectly) used for all nodes; change the task to store per-node origin mappings (e.g. SpockNodeNames map[string]map[uint32]string keyed by node identifier), have loadSpockNodeNames populate the map for each ClusterNodes entry (do not return after first success and cache failures per-node), and update places that call TranslateNodeOrigin (and the code paths that use the raw roident produced by buildFetchRowsSQL* functions) to look up the origin name from the per-node map for that specific node. Ensure connections are closed on every iteration and errors are collected/returned appropriately if a node’s mapping cannot be loaded.internal/consistency/diff/table_diff.go (1)
186-209:⚠️ Potential issue | 🟠 Majorroident from first pool reused across all nodes breaks filtering and metadata on native PostgreSQL.
loadSpockNodeNames()loads origin-to-name mappings from a single pool, while bothbuildEffectiveFilter()(line 255) andfetchRows()(line 449) use that mapping to filter rows and label metadata viapg_xact_commit_timestamp_origin(xmin). PostgreSQL documents that roident values are specific to each cluster'spg_replication_origincatalog—they differ across instances. Applying one node's roident mapping (e.g.,roident = '1'from the first pool) to other nodes will filter wrong rows and mislabelnode_originin diff output, particularly in native replication setups where roident IDs are never globally synchronized.This needs per-node origin resolution: either query each pool separately for its mappings, or translate roident to stable identifiers (like
roname) within each node's query before comparison and serialization.Also applies to: 253–256, 448–450; affects
table_rerun.goExecuteRerunTask path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/consistency/diff/table_diff.go` around lines 186 - 209, loadSpockNodeNames currently queries only the first pool and reuses that origin->name mapping across all nodes, which is incorrect for per-cluster roident values; change TableDiffTask.loadSpockNodeNames to iterate t.Pools and call queries.GetNodeOriginNames for each pool, storing the result keyed by pool identifier (e.g., pool name or connection string) into a per-pool mapping (replace t.SpockNodeNames map[string]string with map[string]map[string]string or similar), then update callers buildEffectiveFilter and fetchRows (and table_rerun.go's ExecuteRerunTask path) to look up the appropriate per-pool mapping when filtering or labeling using pg_xact_commit_timestamp_origin(xmin) so each node resolves roident->roname using its own pool's mapping.internal/consistency/repair/table_repair.go (1)
195-204:⚠️ Potential issue | 🔴 CriticalAdd
LOCALkeyword to makesession_replication_roletransaction-scoped.
SET session_replication_rolepersists afterCOMMITon the session and will leak into subsequent transactions when the connection is returned to the pool. UseSET LOCAL session_replication_roleinstead to ensure it reverts at transaction end and does not affect later repairs on the same pooled connection.Suggested fix
if t.FireTriggers { - _, err = tx.Exec(t.Ctx, "SET session_replication_role = 'local'") + _, err = tx.Exec(t.Ctx, "SET LOCAL session_replication_role = 'local'") } else { - _, err = tx.Exec(t.Ctx, "SET session_replication_role = 'replica'") + _, err = tx.Exec(t.Ctx, "SET LOCAL session_replication_role = 'replica'") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/consistency/repair/table_repair.go` around lines 195 - 204, The session_replication_role is being set with a non-transaction-scoped command; update the two tx.Exec calls in table_repair.go (the block using t.FireTriggers and tx.Exec) to use "SET LOCAL session_replication_role = 'local'" and "SET LOCAL session_replication_role = 'replica'" so the setting is transaction-scoped and will revert at transaction end; keep the existing error handling (returning fmt.Errorf(...) on err) and the logger.Debug call as-is (logger.Debug("session_replication_role set on %s (fire_triggers: %v)", nodeName, t.FireTriggers)).
🧹 Nitpick comments (5)
tests/integration/table_repair_test.go (3)
1239-1243: Multiple Spock-specific tests not refactored.
TestTableRepair_PreserveOriginand related tests (FixNulls_PreserveOrigin,Bidirectional_PreserveOrigin,MixedOps_PreserveOrigin) directly use Spock-specific features without thetestEnvabstraction. These tests should either be documented as Spock-only or have skip guards added for clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/table_repair_test.go` around lines 1239 - 1243, TestTableRepair_PreserveOrigin and its related tests (FixNulls_PreserveOrigin, Bidirectional_PreserveOrigin, MixedOps_PreserveOrigin) use Spock-specific features directly; update them to either be documented as Spock-only or add a skip guard using the test environment helper (e.g., call testEnv.RequireSpock() or t.SkipUnlessSpock() at the start of each test) so they only run when Spock is available; modify the beginning of functions TestTableRepair_PreserveOrigin, FixNulls_PreserveOrigin, Bidirectional_PreserveOrigin, and MixedOps_PreserveOrigin to invoke the shared testEnv skip/check helper (or add t.Skip with a clear message) before any Spock-specific setup.
5-5: Copyright year inconsistency.This file uses
2023 - 2025while other files in this PR use2023 - 2026. Consider updating for consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/table_repair_test.go` at line 5, Update the copyright header in the file to match the project's year range by changing "2023 - 2025" to "2023 - 2026" in the file header comment (the top-line copyright comment string).
573-750: Inconsistent refactoring: test still usespgClusterdirectly.
TestTableRepair_VariousDataTypesdirectly usespgCluster.Node1Pool,spock.repair_mode(), andspock.repset_add_table()without thetestEnvabstraction. This test will fail if run in a native PG environment.If this test is intentionally Spock-only (due to repset usage), consider adding a skip guard or documenting it. Otherwise, consider refactoring it to use
testEnvlike the other tests.♻️ Proposed fix: Add skip guard if Spock-only
func TestTableRepair_VariousDataTypes(t *testing.T) { + // This test requires Spock for repset functionality + env := newSpockEnv() + if !env.HasSpock { + t.Skip("Skipping: requires Spock extension") + } + ctx := context.Background()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/table_repair_test.go` around lines 573 - 750, This test uses Spock-specific APIs (pgCluster.Node1Pool/Node2Pool, spock.repair_mode, spock.repset_add_table) but wasn’t guarded or refactored to testEnv; either add an explicit skip at the top of TestTableRepair_VariousDataTypes (e.g., if !testEnv.IsSpock() { t.Skip("spock-only test") }) or refactor the test to use testEnv helpers (replace pgCluster.Node1Pool/Node2Pool with testEnv-provided pools and replace direct spock SQL calls with testEnv helper methods for repset_add_table and repair_mode) so the test runs correctly in non-Spock environments.tests/integration/test_env_test.go (2)
202-202: Unused parametercompositeinsetupDivergence.The
compositeparameter is accepted but never used in the method body. Consider removing it if not needed, or document why it's retained for API consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_env_test.go` at line 202, The parameter `composite` on function `setupDivergence` is unused; remove it from the function signature and from all call sites (update any tests calling `setupDivergence(t, ctx, qualifiedTableName, composite)` to `setupDivergence(t, ctx, qualifiedTableName)`), or if you must preserve the API, explicitly mark it as used by adding a no-op reference (for example `_ = composite`) inside `setupDivergence` so linters/tests stop flagging it as unused; adjust imports/usages accordingly.
357-357: Hardcoded sleep may cause flaky tests or slow execution.The 2-second sleep before running the repair task lacks documentation explaining why it's needed. If it's working around a race condition, consider adding a comment. If it's no longer necessary, removing it would speed up tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_env_test.go` at line 357, Replace the hardcoded time.Sleep(2 * time.Second) in the test with a deterministic wait: either poll the expected condition (e.g., loop with time.Sleep(100*time.Millisecond) and a timeout checking the repair task's observable effect or state) until success, or, if the delay is truly required, add a clear comment above the time.Sleep explaining why the pause is necessary and reference the specific repair invocation being synchronized with; ensure any polling uses a reasonable timeout and fails the test on expiry rather than hanging indefinitely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@db/queries/queries.go`:
- Around line 989-999: GetNodeOriginNames currently chooses GetSpockNodeNames
based on the spock extension existing, which mismatches replication origin OIDs;
change it to detect the ID source by querying pg_replication_origin and deciding
the translator based on whether its roident values map to spock.node.node_id or
to pg_replication_origin identifiers. Concretely: replace the extname check with
a lightweight query that inspects pg_replication_origin (optionally joining
against spock.node) to see if roident corresponds to spock.node.node_id (use a
join like pg_replication_origin.roident::text = spock.node.node_id::text or
existence of matching rows); if that test indicates spock node IDs are the
source, call GetSpockNodeNames(ctx, db), otherwise call
GetReplicationOriginNames(ctx, db). Ensure you retain error handling from the
original GetNodeOriginNames.
In `@db/queries/templates.go`:
- Around line 1325-1332: The Templates struct is missing the CompareBlocksSQL
field referenced by the struct literal; add a field named CompareBlocksSQL of
type *template.Template to the Templates struct definition so the struct literal
that sets CompareBlocksSQL compiles (ensure the field name exactly matches
CompareBlocksSQL and the type matches other template fields like those around
native LSN templates).
- Around line 1551-1568: The LIKE '%' || $1 || '%' substring matching in the
templates GetNativeOriginLSNForNode and GetNativeSlotLSNForNode can return wrong
subscriptions when node names are substrings of each other; update both WHERE
clauses to use exact matching (s.subname = $1) if subscription naming permits,
otherwise replace the LIKE with a stricter pattern that enforces boundaries (for
example a regex match using s.subname ~ ('\m' || $1 || '\M') or a LIKE that
includes explicit delimiters present in your subscription names) so the query
selects the intended subscription rather than any substring match.
In `@internal/consistency/diff/repset_diff.go`:
- Around line 158-169: The spock extension check should run for every node
before proceeding to repset discovery: remove the conditional that gates
queries.CheckSpockInstalled on len(repsetNodeNames) and call
queries.CheckSpockInstalled(c.Ctx, pool) for each node (using nodeName and
pool), closing pool and returning a formatted error on failure; if spock is not
installed on that node return an error that includes nodeName (instead of
deferring to CheckRepSetExists), then continue to CheckRepSetExists only after
the per-node spock check succeeds.
In `@internal/consistency/diff/spock_diff.go`:
- Around line 310-320: The loop over t.Pools only validates the first pool then
breaks, so some nodes may skip the spock check and fail later in
GetSpockNodeAndSubInfo; remove the premature break and instead iterate every
pool in t.Pools calling queries.CheckSpockInstalled(t.Ctx, pool), and if it
returns false return an error that includes the pool/node identifier (e.g.
pool.Name or pool.Node) so the message pinpoints the specific node missing the
spock extension; keep the same error wrapping for query failures from
CheckSpockInstalled.
In `@internal/consistency/repair/table_repair.go`:
- Around line 756-759: The code detects Spock only after validation which causes
fetchLSNsForNode()/autoSelectSourceOfTruth() to assume spockAvailable=false;
move detection earlier so spockAvailable is set once t.Pools exists or make
fetching lazy. Concretely, call t.detectSpock() immediately after t.Pools is
populated (before calling ValidateAndPrepare()) or add a lazy guard inside
fetchLSNsForNode() that invokes detectSpock() the first time it runs (and
respects t.DryRun), ensuring detectSpock() sets spockAvailable before any
LSN-fetching logic runs.
- Around line 165-180: detectSpock currently returns after the first successful
CheckSpockInstalled and uses map iteration order to set t.spockAvailable; change
it to check every entry in t.Pools (calling queries.CheckSpockInstalled for
each), record whether any node reports spockInstalled==true and any reports
false, and if both true and false are observed fail fast by returning an error
(do not silently pick the first result); otherwise set t.spockAvailable to the
unanimous boolean. Change detectSpock to return error and update its callers to
handle that error; keep logging for per-node failures but do not treat the first
successful probe as definitive.
In `@tests/integration/docker-compose-native.yaml`:
- Around line 28-30: Remove the fixed host port bindings so Docker can allocate
ephemeral host ports: in the docker-compose service definitions that currently
include "ports: - target: 5432 published: 7432" (and the similar block at
43-45), delete the "published: <port>" lines (or replace the full mapping with a
single target-only mapping) so only the container port is exposed and the tests
rely on MappedPort() in tests/integration/native_pg_test.go to discover the host
port at runtime.
In `@tests/integration/merkle_tree_test.go`:
- Around line 476-480: The goroutine attaches the CDC listener to
env.ClusterNodes[0] which can be in arbitrary order; instead find the node
matching env.ServiceN1 and pass that to cdc.ListenForChanges. Replace the
hardcoded env.ClusterNodes[0] lookup with a search over env.ClusterNodes for the
node whose name/service equals env.ServiceN1 (assign to nodeInfo), then call
cdc.ListenForChanges(ctx, nodeInfo) and keep defer wg.Done() as-is so the
listener watches the intended service.
---
Outside diff comments:
In `@internal/consistency/diff/table_diff.go`:
- Around line 186-209: loadSpockNodeNames currently queries only the first pool
and reuses that origin->name mapping across all nodes, which is incorrect for
per-cluster roident values; change TableDiffTask.loadSpockNodeNames to iterate
t.Pools and call queries.GetNodeOriginNames for each pool, storing the result
keyed by pool identifier (e.g., pool name or connection string) into a per-pool
mapping (replace t.SpockNodeNames map[string]string with
map[string]map[string]string or similar), then update callers
buildEffectiveFilter and fetchRows (and table_rerun.go's ExecuteRerunTask path)
to look up the appropriate per-pool mapping when filtering or labeling using
pg_xact_commit_timestamp_origin(xmin) so each node resolves roident->roname
using its own pool's mapping.
In `@internal/consistency/mtree/merkle.go`:
- Around line 501-520: loadSpockNodeNames currently stops after the first
successful node and sets a single SpockNodeNames map which is then (incorrectly)
used for all nodes; change the task to store per-node origin mappings (e.g.
SpockNodeNames map[string]map[uint32]string keyed by node identifier), have
loadSpockNodeNames populate the map for each ClusterNodes entry (do not return
after first success and cache failures per-node), and update places that call
TranslateNodeOrigin (and the code paths that use the raw roident produced by
buildFetchRowsSQL* functions) to look up the origin name from the per-node map
for that specific node. Ensure connections are closed on every iteration and
errors are collected/returned appropriately if a node’s mapping cannot be
loaded.
In `@internal/consistency/repair/table_repair.go`:
- Around line 195-204: The session_replication_role is being set with a
non-transaction-scoped command; update the two tx.Exec calls in table_repair.go
(the block using t.FireTriggers and tx.Exec) to use "SET LOCAL
session_replication_role = 'local'" and "SET LOCAL session_replication_role =
'replica'" so the setting is transaction-scoped and will revert at transaction
end; keep the existing error handling (returning fmt.Errorf(...) on err) and the
logger.Debug call as-is (logger.Debug("session_replication_role set on %s
(fire_triggers: %v)", nodeName, t.FireTriggers)).
---
Nitpick comments:
In `@tests/integration/table_repair_test.go`:
- Around line 1239-1243: TestTableRepair_PreserveOrigin and its related tests
(FixNulls_PreserveOrigin, Bidirectional_PreserveOrigin, MixedOps_PreserveOrigin)
use Spock-specific features directly; update them to either be documented as
Spock-only or add a skip guard using the test environment helper (e.g., call
testEnv.RequireSpock() or t.SkipUnlessSpock() at the start of each test) so they
only run when Spock is available; modify the beginning of functions
TestTableRepair_PreserveOrigin, FixNulls_PreserveOrigin,
Bidirectional_PreserveOrigin, and MixedOps_PreserveOrigin to invoke the shared
testEnv skip/check helper (or add t.Skip with a clear message) before any
Spock-specific setup.
- Line 5: Update the copyright header in the file to match the project's year
range by changing "2023 - 2025" to "2023 - 2026" in the file header comment (the
top-line copyright comment string).
- Around line 573-750: This test uses Spock-specific APIs
(pgCluster.Node1Pool/Node2Pool, spock.repair_mode, spock.repset_add_table) but
wasn’t guarded or refactored to testEnv; either add an explicit skip at the top
of TestTableRepair_VariousDataTypes (e.g., if !testEnv.IsSpock() {
t.Skip("spock-only test") }) or refactor the test to use testEnv helpers
(replace pgCluster.Node1Pool/Node2Pool with testEnv-provided pools and replace
direct spock SQL calls with testEnv helper methods for repset_add_table and
repair_mode) so the test runs correctly in non-Spock environments.
In `@tests/integration/test_env_test.go`:
- Line 202: The parameter `composite` on function `setupDivergence` is unused;
remove it from the function signature and from all call sites (update any tests
calling `setupDivergence(t, ctx, qualifiedTableName, composite)` to
`setupDivergence(t, ctx, qualifiedTableName)`), or if you must preserve the API,
explicitly mark it as used by adding a no-op reference (for example `_ =
composite`) inside `setupDivergence` so linters/tests stop flagging it as
unused; adjust imports/usages accordingly.
- Line 357: Replace the hardcoded time.Sleep(2 * time.Second) in the test with a
deterministic wait: either poll the expected condition (e.g., loop with
time.Sleep(100*time.Millisecond) and a timeout checking the repair task's
observable effect or state) until success, or, if the delay is truly required,
add a clear comment above the time.Sleep explaining why the pause is necessary
and reference the specific repair invocation being synchronized with; ensure any
polling uses a reasonable timeout and fails the test on expiry rather than
hanging indefinitely.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1333b0b9-2f32-4543-bbbc-6dd36810011a
📒 Files selected for processing (16)
db/queries/queries.godb/queries/templates.gointernal/consistency/diff/repset_diff.gointernal/consistency/diff/spock_diff.gointernal/consistency/diff/table_diff.gointernal/consistency/diff/table_rerun.gointernal/consistency/mtree/merkle.gointernal/consistency/repair/table_repair.gotests/integration/docker-compose-native.yamltests/integration/helpers_test.gotests/integration/main_test.gotests/integration/merkle_tree_test.gotests/integration/native_pg_test.gotests/integration/table_diff_test.gotests/integration/table_repair_test.gotests/integration/test_env_test.go
| func GetNodeOriginNames(ctx context.Context, db DBQuerier) (map[string]string, error) { | ||
| var spockAvailable bool | ||
| err := db.QueryRow(ctx, "SELECT EXISTS (SELECT 1 FROM pg_extension WHERE extname = 'spock')").Scan(&spockAvailable) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("detecting spock extension: %w", err) | ||
| } | ||
| if spockAvailable { | ||
| return GetSpockNodeNames(ctx, db) | ||
| } | ||
| return GetReplicationOriginNames(ctx, db) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In PostgreSQL, what identifier does pg_xact_commit_timestamp_origin()return: the replication origin OID frompg_replication_origin.roident, or an extension-specific node ID such as Spock's spock.node.node_id?
💡 Result:
pg_xact_commit_timestamp_origin returns the replication origin OID from pg_replication_origin.roident.
Citations:
- 1: https://postgr.es/c/b1e48bbe6
- 2: https://doxygen.postgresql.org/commit__ts_8c.html
- 3: https://doxygen.postgresql.org/commit__ts_8c_source.html
- 4: https://pgpedia.info/p/pg_last_committed_xact.html
- 5: https://postgresql.org/docs/current/catalog-pg-replication-origin.html
Fix GetNodeOriginNames to select translators by ID source, not extension presence.
pg_xact_commit_timestamp_origin() returns replication origin OIDs from pg_replication_origin.roident, not Spock-specific node IDs. On Spock clusters, selecting GetSpockNodeNames() purely because the extension exists will use a map keyed by spock.node.node_id—a different namespace than the replication origin OIDs being passed in. This causes lookup failures. The translator should be chosen based on where the IDs actually come from, not whether Spock is installed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@db/queries/queries.go` around lines 989 - 999, GetNodeOriginNames currently
chooses GetSpockNodeNames based on the spock extension existing, which
mismatches replication origin OIDs; change it to detect the ID source by
querying pg_replication_origin and deciding the translator based on whether its
roident values map to spock.node.node_id or to pg_replication_origin
identifiers. Concretely: replace the extname check with a lightweight query that
inspects pg_replication_origin (optionally joining against spock.node) to see if
roident corresponds to spock.node.node_id (use a join like
pg_replication_origin.roident::text = spock.node.node_id::text or existence of
matching rows); if that test indicates spock node IDs are the source, call
GetSpockNodeNames(ctx, db), otherwise call GetReplicationOriginNames(ctx, db).
Ensure you retain error handling from the original GetNodeOriginNames.
| CompareBlocksSQL: template.Must(template.New("compareBlocksSQL").Parse(` | ||
| SELECT | ||
| * | ||
| FROM | ||
| {{.TableName}} | ||
| WHERE | ||
| {{.WhereClause}} | ||
| `)), |
There was a problem hiding this comment.
Build failure: CompareBlocksSQL field not declared in Templates struct.
The static analysis and pipeline failures confirm that CompareBlocksSQL is used in the struct literal but not declared in the Templates struct (lines 26-145). This causes a compilation error.
🐛 Proposed fix: Add the field declaration
Add the following line to the Templates struct (around line 131, before the native LSN templates):
GetSpockSlotLSNForNode *template.Template
+ CompareBlocksSQL *template.Template
GetNativeOriginLSNForNode *template.Template🧰 Tools
🪛 GitHub Actions: Go Integration Tests
[error] 1325-1325: Go test failed: unknown field CompareBlocksSQL in struct literal of type Templates
🪛 GitHub Check: test
[failure] 1325-1325:
unknown field CompareBlocksSQL in struct literal of type Templates
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@db/queries/templates.go` around lines 1325 - 1332, The Templates struct is
missing the CompareBlocksSQL field referenced by the struct literal; add a field
named CompareBlocksSQL of type *template.Template to the Templates struct
definition so the struct literal that sets CompareBlocksSQL compiles (ensure the
field name exactly matches CompareBlocksSQL and the type matches other template
fields like those around native LSN templates).
| GetNativeOriginLSNForNode: template.Must(template.New("getNativeOriginLSNForNode").Parse(` | ||
| SELECT ros.remote_lsn::text | ||
| FROM pg_catalog.pg_replication_origin_status ros | ||
| JOIN pg_catalog.pg_replication_origin ro ON ro.roident = ros.local_id | ||
| JOIN pg_catalog.pg_subscription s ON ro.roname LIKE 'pg_%' || s.oid::text | ||
| WHERE s.subname LIKE '%' || $1 || '%' | ||
| AND ros.remote_lsn IS NOT NULL | ||
| LIMIT 1 | ||
| `)), | ||
| GetNativeSlotLSNForNode: template.Must(template.New("getNativeSlotLSNForNode").Parse(` | ||
| SELECT rs.confirmed_flush_lsn::text | ||
| FROM pg_catalog.pg_replication_slots rs | ||
| JOIN pg_catalog.pg_subscription s ON rs.slot_name = s.subslotname | ||
| WHERE s.subname LIKE '%' || $1 || '%' | ||
| AND rs.confirmed_flush_lsn IS NOT NULL | ||
| ORDER BY rs.confirmed_flush_lsn DESC | ||
| LIMIT 1 | ||
| `)), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check how subscription names are structured to understand if loose matching is safe
# Search for subscription name creation/usage patterns
rg -n "subname|sub_name|subscription" --type go -C3 | head -80Repository: pgEdge/ace
Length of output: 4751
🏁 Script executed:
#!/bin/bash
# Find where GetNativeOriginLSNForNode and GetNativeSlotLSNForNode are called
echo "=== Searching for GetNativeOriginLSNForNode calls ==="
rg -n "GetNativeOriginLSNForNode" --type go -B2 -A2
echo ""
echo "=== Searching for GetNativeSlotLSNForNode calls ==="
rg -n "GetNativeSlotLSNForNode" --type go -B2 -A2Repository: pgEdge/ace
Length of output: 3270
🏁 Script executed:
#!/bin/bash
# Look for node name patterns and how they're structured
echo "=== Searching for node name usage patterns ==="
rg -n "nodeName|node_name" --type go | head -30
echo ""
echo "=== Check for subscription name creation/formatting ==="
rg -n "pg_%|subname.*node" --type go -B1 -A1 | head -40Repository: pgEdge/ace
Length of output: 3408
🏁 Script executed:
#!/bin/bash
# Check the context around the template definitions to understand the parameter usage
echo "=== Looking at template definitions context ==="
sed -n '1540,1580p' db/queries/templates.goRepository: pgEdge/ace
Length of output: 1903
🏁 Script executed:
#!/bin/bash
# View the actual function implementations
sed -n '931,960p' db/queries/queries.goRepository: pgEdge/ace
Length of output: 957
🏁 Script executed:
#!/bin/bash
# Look for how subscriptions are named - check for patterns or documentation
rg -n "subname|subscription.*name" --type go db/queries/queries.go | head -20Repository: pgEdge/ace
Length of output: 36
🏁 Script executed:
#!/bin/bash
# Search for any validation or naming conventions for nodes/subscriptions
rg -n "node.*name|subscription.*naming|replication.*subscription" --type md -B1 -A1Repository: pgEdge/ace
Length of output: 2319
LIKE pattern matching can return incorrect subscription when node names are substrings of each other.
The queries use LIKE '%' || $1 || '%' which performs substring matching. If node names are substrings of one another (e.g., n1 and n10), the query for n1 will also match subscriptions containing n10. While LIMIT 1 ensures only one row is returned, it doesn't prevent selecting the wrong subscription. This risk is already acknowledged in the codebase with defensive comments at lines 2922 and 2943 of internal/consistency/repair/table_repair.go noting "Native PG queries may fail if subscription naming doesn't match."
Consider using exact matching (= instead of LIKE) if subscription naming conventions allow it, or implement more precise pattern matching that accounts for node name boundaries.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@db/queries/templates.go` around lines 1551 - 1568, The LIKE '%' || $1 || '%'
substring matching in the templates GetNativeOriginLSNForNode and
GetNativeSlotLSNForNode can return wrong subscriptions when node names are
substrings of each other; update both WHERE clauses to use exact matching
(s.subname = $1) if subscription naming permits, otherwise replace the LIKE with
a stricter pattern that enforces boundaries (for example a regex match using
s.subname ~ ('\m' || $1 || '\M') or a LIKE that includes explicit delimiters
present in your subscription names) so the query selects the intended
subscription rather than any substring match.
| // Check if spock extension is installed (only on first node) | ||
| if len(repsetNodeNames) == 0 { | ||
| spockInstalled, err := queries.CheckSpockInstalled(c.Ctx, pool) | ||
| if err != nil { | ||
| pool.Close() | ||
| return fmt.Errorf("failed to check for spock extension on node %s: %w", nodeName, err) | ||
| } | ||
| if !spockInstalled { | ||
| pool.Close() | ||
| return fmt.Errorf("repset-diff requires the spock extension, which is not installed on this cluster") | ||
| } | ||
| } |
There was a problem hiding this comment.
Validate Spock before repset discovery on each node.
Tying the extension check to len(repsetNodeNames) means later nodes skip the prerequisite once the first repset-bearing node is found. That can fall through to CheckRepSetExists on nodes that do not even have Spock installed, which loses the clearer prerequisite error.
🛠 Suggested change
- // Check if spock extension is installed (only on first node)
- if len(repsetNodeNames) == 0 {
- spockInstalled, err := queries.CheckSpockInstalled(c.Ctx, pool)
- if err != nil {
- pool.Close()
- return fmt.Errorf("failed to check for spock extension on node %s: %w", nodeName, err)
- }
- if !spockInstalled {
- pool.Close()
- return fmt.Errorf("repset-diff requires the spock extension, which is not installed on this cluster")
- }
- }
+ spockInstalled, err := queries.CheckSpockInstalled(c.Ctx, pool)
+ if err != nil {
+ pool.Close()
+ return fmt.Errorf("failed to check for spock extension on node %s: %w", nodeName, err)
+ }
+ if !spockInstalled {
+ pool.Close()
+ return fmt.Errorf("repset-diff requires the spock extension, which is not installed on node %s", nodeName)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Check if spock extension is installed (only on first node) | |
| if len(repsetNodeNames) == 0 { | |
| spockInstalled, err := queries.CheckSpockInstalled(c.Ctx, pool) | |
| if err != nil { | |
| pool.Close() | |
| return fmt.Errorf("failed to check for spock extension on node %s: %w", nodeName, err) | |
| } | |
| if !spockInstalled { | |
| pool.Close() | |
| return fmt.Errorf("repset-diff requires the spock extension, which is not installed on this cluster") | |
| } | |
| } | |
| // Check if spock extension is installed | |
| spockInstalled, err := queries.CheckSpockInstalled(c.Ctx, pool) | |
| if err != nil { | |
| pool.Close() | |
| return fmt.Errorf("failed to check for spock extension on node %s: %w", nodeName, err) | |
| } | |
| if !spockInstalled { | |
| pool.Close() | |
| return fmt.Errorf("repset-diff requires the spock extension, which is not installed on node %s", nodeName) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/consistency/diff/repset_diff.go` around lines 158 - 169, The spock
extension check should run for every node before proceeding to repset discovery:
remove the conditional that gates queries.CheckSpockInstalled on
len(repsetNodeNames) and call queries.CheckSpockInstalled(c.Ctx, pool) for each
node (using nodeName and pool), closing pool and returning a formatted error on
failure; if spock is not installed on that node return an error that includes
nodeName (instead of deferring to CheckRepSetExists), then continue to
CheckRepSetExists only after the per-node spock check succeeds.
| // Check if spock extension is installed | ||
| for _, pool := range t.Pools { | ||
| spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to check for spock extension: %w", err) | ||
| } | ||
| if !spockInstalled { | ||
| return fmt.Errorf("spock-diff requires the spock extension, which is not installed on this cluster") | ||
| } | ||
| break | ||
| } |
There was a problem hiding this comment.
Run the Spock prerequisite on every selected node.
This only validates one pool, so a mixed or misconfigured cluster can get past the prerequisite and then fail later when GetSpockNodeAndSubInfo hits an unchecked node without spock. Check every pool and include the node name in the error.
🛠 Suggested change
- // Check if spock extension is installed
- for _, pool := range t.Pools {
- spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool)
+ // Check that spock is installed on every selected node.
+ for name, pool := range t.Pools {
+ spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool)
if err != nil {
- return fmt.Errorf("failed to check for spock extension: %w", err)
+ return fmt.Errorf("failed to check for spock extension on node %s: %w", name, err)
}
if !spockInstalled {
- return fmt.Errorf("spock-diff requires the spock extension, which is not installed on this cluster")
+ return fmt.Errorf("spock-diff requires the spock extension, which is not installed on node %s", name)
}
- break
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Check if spock extension is installed | |
| for _, pool := range t.Pools { | |
| spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool) | |
| if err != nil { | |
| return fmt.Errorf("failed to check for spock extension: %w", err) | |
| } | |
| if !spockInstalled { | |
| return fmt.Errorf("spock-diff requires the spock extension, which is not installed on this cluster") | |
| } | |
| break | |
| } | |
| // Check that spock is installed on every selected node. | |
| for name, pool := range t.Pools { | |
| spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool) | |
| if err != nil { | |
| return fmt.Errorf("failed to check for spock extension on node %s: %w", name, err) | |
| } | |
| if !spockInstalled { | |
| return fmt.Errorf("spock-diff requires the spock extension, which is not installed on node %s", name) | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/consistency/diff/spock_diff.go` around lines 310 - 320, The loop
over t.Pools only validates the first pool then breaks, so some nodes may skip
the spock check and fail later in GetSpockNodeAndSubInfo; remove the premature
break and instead iterate every pool in t.Pools calling
queries.CheckSpockInstalled(t.Ctx, pool), and if it returns false return an
error that includes the pool/node identifier (e.g. pool.Name or pool.Node) so
the message pinpoints the specific node missing the spock extension; keep the
same error wrapping for query failures from CheckSpockInstalled.
| // detectSpock checks whether the spock extension is installed on any available | ||
| // node and stores the result in t.spockAvailable. | ||
| func (t *TableRepairTask) detectSpock() { | ||
| for nodeName, pool := range t.Pools { | ||
| spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool) | ||
| if err != nil { | ||
| logger.Warn("failed to detect spock extension on %s: %v", nodeName, err) | ||
| continue | ||
| } | ||
| t.spockAvailable = spockInstalled | ||
| logger.Info("spock extension detected: %v (checked on %s)", spockInstalled, nodeName) | ||
| return | ||
| } | ||
| logger.Warn("could not detect spock extension on any node; assuming not available") | ||
| t.spockAvailable = false | ||
| } |
There was a problem hiding this comment.
Don’t let map iteration choose cluster capability.
detectSpock() returns after the first successful check, and t.Pools is a map. On a partially migrated or inconsistent cluster, the task-wide flag depends on iteration order, so later code can either call spock.repair_mode() on vanilla nodes or skip it on Spock nodes. If mixed capability isn’t supported, fail fast instead of taking the first answer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/consistency/repair/table_repair.go` around lines 165 - 180,
detectSpock currently returns after the first successful CheckSpockInstalled and
uses map iteration order to set t.spockAvailable; change it to check every entry
in t.Pools (calling queries.CheckSpockInstalled for each), record whether any
node reports spockInstalled==true and any reports false, and if both true and
false are observed fail fast by returning an error (do not silently pick the
first result); otherwise set t.spockAvailable to the unanimous boolean. Change
detectSpock to return error and update its callers to handle that error; keep
logging for per-node failures but do not treat the first successful probe as
definitive.
| // Detect whether spock extension is available before any repair path | ||
| if !t.DryRun { | ||
| t.detectSpock() | ||
| } |
There was a problem hiding this comment.
Detect Spock before validation-time recovery logic.
ValidateAndPrepare() can call autoSelectSourceOfTruth() in recovery mode, and that path reaches fetchLSNsForNode(). Because detection only happens here—after validation, and never in dry-run—those probes always see the zero-value spockAvailable=false and take the native LSN path on Spock clusters. Move detection earlier, once t.Pools exist, or make LSN fetching detect lazily.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/consistency/repair/table_repair.go` around lines 756 - 759, The code
detects Spock only after validation which causes
fetchLSNsForNode()/autoSelectSourceOfTruth() to assume spockAvailable=false;
move detection earlier so spockAvailable is set once t.Pools exists or make
fetching lazy. Concretely, call t.detectSpock() immediately after t.Pools is
populated (before calling ValidateAndPrepare()) or add a lazy guard inside
fetchLSNsForNode() that invokes detectSpock() the first time it runs (and
respects t.DryRun), ensuring detectSpock() sets spockAvailable before any
LSN-fetching logic runs.
| ports: | ||
| - target: 5432 | ||
| published: 7432 |
There was a problem hiding this comment.
Let Docker allocate the host ports.
These fixed published ports make the native stack fail whenever 7432/7433 are already taken or two integration runs overlap. tests/integration/native_pg_test.go:58-123 already resolves each container’s mapped port via MappedPort(), so the compose file can rely on ephemeral bindings instead.
🛠 Suggested change
native-n1:
@@
ports:
- - target: 5432
- published: 7432
+ - target: 5432
native-n2:
@@
ports:
- - target: 5432
- published: 7433
+ - target: 5432Also applies to: 43-45
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration/docker-compose-native.yaml` around lines 28 - 30, Remove
the fixed host port bindings so Docker can allocate ephemeral host ports: in the
docker-compose service definitions that currently include "ports: - target: 5432
published: 7432" (and the similar block at 43-45), delete the "published:
<port>" lines (or replace the full mapping with a single target-only mapping) so
only the container port is exposed and the tests rely on MappedPort() in
tests/integration/native_pg_test.go to discover the host port at runtime.
| go func() { | ||
| defer wg.Done() | ||
| nodeInfo := pgCluster.ClusterNodes[0] | ||
| nodeInfo := env.ClusterNodes[0] | ||
| cdc.ListenForChanges(ctx, nodeInfo) | ||
| }() |
There was a problem hiding this comment.
Look up the CDC source node by name.
The task is built for env.ServiceN1, but the listener is attached to env.ClusterNodes[0]. That only works while slice order happens to match the service mapping. If a harness orders ClusterNodes differently, this goroutine will watch the wrong node and the env.N1Pool assertions become flaky.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration/merkle_tree_test.go` around lines 476 - 480, The goroutine
attaches the CDC listener to env.ClusterNodes[0] which can be in arbitrary
order; instead find the node matching env.ServiceN1 and pass that to
cdc.ListenForChanges. Replace the hardcoded env.ClusterNodes[0] lookup with a
search over env.ClusterNodes for the node whose name/service equals
env.ServiceN1 (assign to nodeInfo), then call cdc.ListenForChanges(ctx,
nodeInfo) and keep defer wg.Done() as-is so the listener watches the intended
service.
We previously required spock in order to run comparisons. With this PR, schema and table diff/repair can be used with vanilla Postgres.