Skip to content

Ability to use vanilla Postgres#97

Draft
mason-sharp wants to merge 5 commits intomainfrom
feature/ACE-178/spockless
Draft

Ability to use vanilla Postgres#97
mason-sharp wants to merge 5 commits intomainfrom
feature/ACE-178/spockless

Conversation

@mason-sharp
Copy link
Copy Markdown
Member

We previously required spock in order to run comparisons. With this PR, schema and table diff/repair can be used with vanilla Postgres.

mason-sharp and others added 4 commits March 29, 2026 17:57
Auto-detect whether spock is installed and branch accordingly across all
repair, diff, rerun, and merkle code paths:

- Add detectSpock() to table repair; conditionally call spock.repair_mode()
  only when spock is present, fall back to session_replication_role alone
- Replace spock.xact_commit_timestamp_origin() with native PG14+
  pg_xact_commit_timestamp_origin() in diff, rerun, and merkle queries
- Add GetNodeOriginNames() wrapper that uses spock.node when available,
  falls back to pg_replication_origin for node name resolution
- Add native PG alternatives for LSN queries (pg_subscription-based)
- Add CheckSpockInstalled() utility; spock-diff and repset-diff now
  return a clear error when spock is not installed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tive PG

Introduce a testEnv struct that encapsulates environment-specific state
(pools, service names, cluster config, HasSpock flag), allowing the same
test logic to run against both spock-replicated and vanilla PostgreSQL
clusters with zero code duplication.

Refactor 17 test functions (9 repair, 8 diff) to accept *testEnv. Add
docker-compose-native.yaml with two postgres:17 containers and a
TestNativePG suite that exercises all applicable shared tests (40
subtests) against standalone PG nodes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded `spock.` schema prefix with a template function
`{{aceSchema}}` that reads from config at render time. Defaults to
"spock" for backward compatibility but allows alternate schemas
(e.g. "ace") via the existing mtree.schema config key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor all merkle tree test functions to accept *testEnv, replacing
pgCluster globals and spock.repair_mode() calls with env-aware
abstractions. All 28 merkle tree subtests (Init, Build, Diff, Merge,
Split, CDC, Teardown, NumericScaleInvariance) now run on both spock
and native PostgreSQL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9643703d-ea36-4d39-97e5-86098e4e7f80

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR extends the tool to support both Spock and native PostgreSQL replication environments. It adds database query functions for native LSN/origin resolution, introduces SQL templates with schema-agnostic metadata handling, implements runtime Spock detection with conditional feature usage, and provides a comprehensive test environment abstraction and native PostgreSQL integration test suite.

Changes

Cohort / File(s) Summary
Database Query Layer
db/queries/queries.go, db/queries/templates.go
Added functions for native origin LSN fetching (GetNativeOriginLSNForNode, GetNativeSlotLSNForNode) and origin name resolution (GetNodeOriginNames, GetReplicationOriginNames, CheckSpockInstalled). Introduced aceTemplateFuncs to parameterize Spock schema references with {{aceSchema}} placeholder in metadata/CDC templates; extended Templates struct with native variant templates and CompareBlocksSQL field.
Consistency Detection & Queries
internal/consistency/diff/repset_diff.go, internal/consistency/diff/spock_diff.go, internal/consistency/diff/table_diff.go, internal/consistency/diff/table_rerun.go, internal/consistency/mtree/merkle.go
Added early Spock installation checks in diff tasks; switched origin lookups from Spock-specific to extension-aware via GetNodeOriginNames; replaced Spock schema function calls (spock.xact_commit_timestamp_origin) with native PostgreSQL equivalents (pg_xact_commit_timestamp_origin).
Repair Logic
internal/consistency/repair/table_repair.go
Added detectSpock() probe and spockAvailable field; made repair mode conditional on Spock availability; introduced disableSpockRepairMode() helper; updated LSN fetching to choose between Spock and native query functions based on availability; modified setupTransactionMode return type to (bool, error).
Test Environment Abstraction
tests/integration/test_env_test.go, tests/integration/helpers_test.go, tests/integration/main_test.go, tests/integration/table_diff_test.go, tests/integration/table_repair_test.go, tests/integration/merkle_tree_test.go
Introduced testEnv struct and constructors (newSpockEnv, newNativeEnv) to unify environment setup across Spock and native scenarios. Refactored test helpers and suite functions to delegate logic to testEnv methods, reducing duplication and enabling shared test execution across both replication modes.
Native PostgreSQL Integration
tests/integration/docker-compose-native.yaml, tests/integration/native_pg_test.go
Added standalone Docker Compose configuration for native PostgreSQL 17 nodes with logical replication enabled. Implemented comprehensive TestNativePG suite validating Spock absence, verifying Spock-dependent tasks fail appropriately, and running shared table-diff/repair/merkle-tree tests against vanilla PostgreSQL clusters.

Poem

🐰 A rabbit hops through schema lands,
Now Spock and native both take stands!
With templates that dance and freely flow,
And tests that let both pathways glow,
We've built a bridge of perfect grace,
For data truth in every place! 🎯✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Ability to use vanilla Postgres' directly and clearly summarizes the main change: enabling diff/repair functionality with vanilla PostgreSQL instead of requiring Spock.
Description check ✅ Passed The description explains that the PR removes the Spock requirement for comparisons and enables schema/table diff/repair with vanilla PostgreSQL, which aligns with the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/ACE-178/spockless

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
internal/consistency/mtree/merkle.go (1)

501-520: ⚠️ Potential issue | 🟠 Major

Keep per-node origin-ID mappings instead of sharing one node's map.

buildFetchRowsSQL* now extracts the raw roident from pg_xact_commit_timestamp_origin(xmin), but loadSpockNodeNames() returns after loading just the first available node's origin names, then applies that single mapping to rows from every node via TranslateNodeOrigin(). Each PostgreSQL node maintains its own pg_replication_origin catalog, so the same numeric roident value may represent different origins on different nodes. This will cause node_origin to be mislabeled in Merkle diff output. Either maintain per-node mappings or ensure origin lookups happen per node as rows are processed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/consistency/mtree/merkle.go` around lines 501 - 520,
loadSpockNodeNames currently stops after the first successful node and sets a
single SpockNodeNames map which is then (incorrectly) used for all nodes; change
the task to store per-node origin mappings (e.g. SpockNodeNames
map[string]map[uint32]string keyed by node identifier), have loadSpockNodeNames
populate the map for each ClusterNodes entry (do not return after first success
and cache failures per-node), and update places that call TranslateNodeOrigin
(and the code paths that use the raw roident produced by buildFetchRowsSQL*
functions) to look up the origin name from the per-node map for that specific
node. Ensure connections are closed on every iteration and errors are
collected/returned appropriately if a node’s mapping cannot be loaded.
internal/consistency/diff/table_diff.go (1)

186-209: ⚠️ Potential issue | 🟠 Major

roident from first pool reused across all nodes breaks filtering and metadata on native PostgreSQL.

loadSpockNodeNames() loads origin-to-name mappings from a single pool, while both buildEffectiveFilter() (line 255) and fetchRows() (line 449) use that mapping to filter rows and label metadata via pg_xact_commit_timestamp_origin(xmin). PostgreSQL documents that roident values are specific to each cluster's pg_replication_origin catalog—they differ across instances. Applying one node's roident mapping (e.g., roident = '1' from the first pool) to other nodes will filter wrong rows and mislabel node_origin in diff output, particularly in native replication setups where roident IDs are never globally synchronized.

This needs per-node origin resolution: either query each pool separately for its mappings, or translate roident to stable identifiers (like roname) within each node's query before comparison and serialization.

Also applies to: 253–256, 448–450; affects table_rerun.go ExecuteRerunTask path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/consistency/diff/table_diff.go` around lines 186 - 209,
loadSpockNodeNames currently queries only the first pool and reuses that
origin->name mapping across all nodes, which is incorrect for per-cluster
roident values; change TableDiffTask.loadSpockNodeNames to iterate t.Pools and
call queries.GetNodeOriginNames for each pool, storing the result keyed by pool
identifier (e.g., pool name or connection string) into a per-pool mapping
(replace t.SpockNodeNames map[string]string with map[string]map[string]string or
similar), then update callers buildEffectiveFilter and fetchRows (and
table_rerun.go's ExecuteRerunTask path) to look up the appropriate per-pool
mapping when filtering or labeling using pg_xact_commit_timestamp_origin(xmin)
so each node resolves roident->roname using its own pool's mapping.
internal/consistency/repair/table_repair.go (1)

195-204: ⚠️ Potential issue | 🔴 Critical

Add LOCAL keyword to make session_replication_role transaction-scoped.

SET session_replication_role persists after COMMIT on the session and will leak into subsequent transactions when the connection is returned to the pool. Use SET LOCAL session_replication_role instead to ensure it reverts at transaction end and does not affect later repairs on the same pooled connection.

Suggested fix
  if t.FireTriggers {
- 	_, err = tx.Exec(t.Ctx, "SET session_replication_role = 'local'")
+ 	_, err = tx.Exec(t.Ctx, "SET LOCAL session_replication_role = 'local'")
  } else {
- 	_, err = tx.Exec(t.Ctx, "SET session_replication_role = 'replica'")
+ 	_, err = tx.Exec(t.Ctx, "SET LOCAL session_replication_role = 'replica'")
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/consistency/repair/table_repair.go` around lines 195 - 204, The
session_replication_role is being set with a non-transaction-scoped command;
update the two tx.Exec calls in table_repair.go (the block using t.FireTriggers
and tx.Exec) to use "SET LOCAL session_replication_role = 'local'" and "SET
LOCAL session_replication_role = 'replica'" so the setting is transaction-scoped
and will revert at transaction end; keep the existing error handling (returning
fmt.Errorf(...) on err) and the logger.Debug call as-is
(logger.Debug("session_replication_role set on %s (fire_triggers: %v)",
nodeName, t.FireTriggers)).
🧹 Nitpick comments (5)
tests/integration/table_repair_test.go (3)

1239-1243: Multiple Spock-specific tests not refactored.

TestTableRepair_PreserveOrigin and related tests (FixNulls_PreserveOrigin, Bidirectional_PreserveOrigin, MixedOps_PreserveOrigin) directly use Spock-specific features without the testEnv abstraction. These tests should either be documented as Spock-only or have skip guards added for clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/table_repair_test.go` around lines 1239 - 1243,
TestTableRepair_PreserveOrigin and its related tests (FixNulls_PreserveOrigin,
Bidirectional_PreserveOrigin, MixedOps_PreserveOrigin) use Spock-specific
features directly; update them to either be documented as Spock-only or add a
skip guard using the test environment helper (e.g., call testEnv.RequireSpock()
or t.SkipUnlessSpock() at the start of each test) so they only run when Spock is
available; modify the beginning of functions TestTableRepair_PreserveOrigin,
FixNulls_PreserveOrigin, Bidirectional_PreserveOrigin, and
MixedOps_PreserveOrigin to invoke the shared testEnv skip/check helper (or add
t.Skip with a clear message) before any Spock-specific setup.

5-5: Copyright year inconsistency.

This file uses 2023 - 2025 while other files in this PR use 2023 - 2026. Consider updating for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/table_repair_test.go` at line 5, Update the copyright
header in the file to match the project's year range by changing "2023 - 2025"
to "2023 - 2026" in the file header comment (the top-line copyright comment
string).

573-750: Inconsistent refactoring: test still uses pgCluster directly.

TestTableRepair_VariousDataTypes directly uses pgCluster.Node1Pool, spock.repair_mode(), and spock.repset_add_table() without the testEnv abstraction. This test will fail if run in a native PG environment.

If this test is intentionally Spock-only (due to repset usage), consider adding a skip guard or documenting it. Otherwise, consider refactoring it to use testEnv like the other tests.

♻️ Proposed fix: Add skip guard if Spock-only
 func TestTableRepair_VariousDataTypes(t *testing.T) {
+	// This test requires Spock for repset functionality
+	env := newSpockEnv()
+	if !env.HasSpock {
+		t.Skip("Skipping: requires Spock extension")
+	}
+
 	ctx := context.Background()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/table_repair_test.go` around lines 573 - 750, This test
uses Spock-specific APIs (pgCluster.Node1Pool/Node2Pool, spock.repair_mode,
spock.repset_add_table) but wasn’t guarded or refactored to testEnv; either add
an explicit skip at the top of TestTableRepair_VariousDataTypes (e.g., if
!testEnv.IsSpock() { t.Skip("spock-only test") }) or refactor the test to use
testEnv helpers (replace pgCluster.Node1Pool/Node2Pool with testEnv-provided
pools and replace direct spock SQL calls with testEnv helper methods for
repset_add_table and repair_mode) so the test runs correctly in non-Spock
environments.
tests/integration/test_env_test.go (2)

202-202: Unused parameter composite in setupDivergence.

The composite parameter is accepted but never used in the method body. Consider removing it if not needed, or document why it's retained for API consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/test_env_test.go` at line 202, The parameter `composite` on
function `setupDivergence` is unused; remove it from the function signature and
from all call sites (update any tests calling `setupDivergence(t, ctx,
qualifiedTableName, composite)` to `setupDivergence(t, ctx,
qualifiedTableName)`), or if you must preserve the API, explicitly mark it as
used by adding a no-op reference (for example `_ = composite`) inside
`setupDivergence` so linters/tests stop flagging it as unused; adjust
imports/usages accordingly.

357-357: Hardcoded sleep may cause flaky tests or slow execution.

The 2-second sleep before running the repair task lacks documentation explaining why it's needed. If it's working around a race condition, consider adding a comment. If it's no longer necessary, removing it would speed up tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/test_env_test.go` at line 357, Replace the hardcoded
time.Sleep(2 * time.Second) in the test with a deterministic wait: either poll
the expected condition (e.g., loop with time.Sleep(100*time.Millisecond) and a
timeout checking the repair task's observable effect or state) until success,
or, if the delay is truly required, add a clear comment above the time.Sleep
explaining why the pause is necessary and reference the specific repair
invocation being synchronized with; ensure any polling uses a reasonable timeout
and fails the test on expiry rather than hanging indefinitely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@db/queries/queries.go`:
- Around line 989-999: GetNodeOriginNames currently chooses GetSpockNodeNames
based on the spock extension existing, which mismatches replication origin OIDs;
change it to detect the ID source by querying pg_replication_origin and deciding
the translator based on whether its roident values map to spock.node.node_id or
to pg_replication_origin identifiers. Concretely: replace the extname check with
a lightweight query that inspects pg_replication_origin (optionally joining
against spock.node) to see if roident corresponds to spock.node.node_id (use a
join like pg_replication_origin.roident::text = spock.node.node_id::text or
existence of matching rows); if that test indicates spock node IDs are the
source, call GetSpockNodeNames(ctx, db), otherwise call
GetReplicationOriginNames(ctx, db). Ensure you retain error handling from the
original GetNodeOriginNames.

In `@db/queries/templates.go`:
- Around line 1325-1332: The Templates struct is missing the CompareBlocksSQL
field referenced by the struct literal; add a field named CompareBlocksSQL of
type *template.Template to the Templates struct definition so the struct literal
that sets CompareBlocksSQL compiles (ensure the field name exactly matches
CompareBlocksSQL and the type matches other template fields like those around
native LSN templates).
- Around line 1551-1568: The LIKE '%' || $1 || '%' substring matching in the
templates GetNativeOriginLSNForNode and GetNativeSlotLSNForNode can return wrong
subscriptions when node names are substrings of each other; update both WHERE
clauses to use exact matching (s.subname = $1) if subscription naming permits,
otherwise replace the LIKE with a stricter pattern that enforces boundaries (for
example a regex match using s.subname ~ ('\m' || $1 || '\M') or a LIKE that
includes explicit delimiters present in your subscription names) so the query
selects the intended subscription rather than any substring match.

In `@internal/consistency/diff/repset_diff.go`:
- Around line 158-169: The spock extension check should run for every node
before proceeding to repset discovery: remove the conditional that gates
queries.CheckSpockInstalled on len(repsetNodeNames) and call
queries.CheckSpockInstalled(c.Ctx, pool) for each node (using nodeName and
pool), closing pool and returning a formatted error on failure; if spock is not
installed on that node return an error that includes nodeName (instead of
deferring to CheckRepSetExists), then continue to CheckRepSetExists only after
the per-node spock check succeeds.

In `@internal/consistency/diff/spock_diff.go`:
- Around line 310-320: The loop over t.Pools only validates the first pool then
breaks, so some nodes may skip the spock check and fail later in
GetSpockNodeAndSubInfo; remove the premature break and instead iterate every
pool in t.Pools calling queries.CheckSpockInstalled(t.Ctx, pool), and if it
returns false return an error that includes the pool/node identifier (e.g.
pool.Name or pool.Node) so the message pinpoints the specific node missing the
spock extension; keep the same error wrapping for query failures from
CheckSpockInstalled.

In `@internal/consistency/repair/table_repair.go`:
- Around line 756-759: The code detects Spock only after validation which causes
fetchLSNsForNode()/autoSelectSourceOfTruth() to assume spockAvailable=false;
move detection earlier so spockAvailable is set once t.Pools exists or make
fetching lazy. Concretely, call t.detectSpock() immediately after t.Pools is
populated (before calling ValidateAndPrepare()) or add a lazy guard inside
fetchLSNsForNode() that invokes detectSpock() the first time it runs (and
respects t.DryRun), ensuring detectSpock() sets spockAvailable before any
LSN-fetching logic runs.
- Around line 165-180: detectSpock currently returns after the first successful
CheckSpockInstalled and uses map iteration order to set t.spockAvailable; change
it to check every entry in t.Pools (calling queries.CheckSpockInstalled for
each), record whether any node reports spockInstalled==true and any reports
false, and if both true and false are observed fail fast by returning an error
(do not silently pick the first result); otherwise set t.spockAvailable to the
unanimous boolean. Change detectSpock to return error and update its callers to
handle that error; keep logging for per-node failures but do not treat the first
successful probe as definitive.

In `@tests/integration/docker-compose-native.yaml`:
- Around line 28-30: Remove the fixed host port bindings so Docker can allocate
ephemeral host ports: in the docker-compose service definitions that currently
include "ports: - target: 5432 published: 7432" (and the similar block at
43-45), delete the "published: <port>" lines (or replace the full mapping with a
single target-only mapping) so only the container port is exposed and the tests
rely on MappedPort() in tests/integration/native_pg_test.go to discover the host
port at runtime.

In `@tests/integration/merkle_tree_test.go`:
- Around line 476-480: The goroutine attaches the CDC listener to
env.ClusterNodes[0] which can be in arbitrary order; instead find the node
matching env.ServiceN1 and pass that to cdc.ListenForChanges. Replace the
hardcoded env.ClusterNodes[0] lookup with a search over env.ClusterNodes for the
node whose name/service equals env.ServiceN1 (assign to nodeInfo), then call
cdc.ListenForChanges(ctx, nodeInfo) and keep defer wg.Done() as-is so the
listener watches the intended service.

---

Outside diff comments:
In `@internal/consistency/diff/table_diff.go`:
- Around line 186-209: loadSpockNodeNames currently queries only the first pool
and reuses that origin->name mapping across all nodes, which is incorrect for
per-cluster roident values; change TableDiffTask.loadSpockNodeNames to iterate
t.Pools and call queries.GetNodeOriginNames for each pool, storing the result
keyed by pool identifier (e.g., pool name or connection string) into a per-pool
mapping (replace t.SpockNodeNames map[string]string with
map[string]map[string]string or similar), then update callers
buildEffectiveFilter and fetchRows (and table_rerun.go's ExecuteRerunTask path)
to look up the appropriate per-pool mapping when filtering or labeling using
pg_xact_commit_timestamp_origin(xmin) so each node resolves roident->roname
using its own pool's mapping.

In `@internal/consistency/mtree/merkle.go`:
- Around line 501-520: loadSpockNodeNames currently stops after the first
successful node and sets a single SpockNodeNames map which is then (incorrectly)
used for all nodes; change the task to store per-node origin mappings (e.g.
SpockNodeNames map[string]map[uint32]string keyed by node identifier), have
loadSpockNodeNames populate the map for each ClusterNodes entry (do not return
after first success and cache failures per-node), and update places that call
TranslateNodeOrigin (and the code paths that use the raw roident produced by
buildFetchRowsSQL* functions) to look up the origin name from the per-node map
for that specific node. Ensure connections are closed on every iteration and
errors are collected/returned appropriately if a node’s mapping cannot be
loaded.

In `@internal/consistency/repair/table_repair.go`:
- Around line 195-204: The session_replication_role is being set with a
non-transaction-scoped command; update the two tx.Exec calls in table_repair.go
(the block using t.FireTriggers and tx.Exec) to use "SET LOCAL
session_replication_role = 'local'" and "SET LOCAL session_replication_role =
'replica'" so the setting is transaction-scoped and will revert at transaction
end; keep the existing error handling (returning fmt.Errorf(...) on err) and the
logger.Debug call as-is (logger.Debug("session_replication_role set on %s
(fire_triggers: %v)", nodeName, t.FireTriggers)).

---

Nitpick comments:
In `@tests/integration/table_repair_test.go`:
- Around line 1239-1243: TestTableRepair_PreserveOrigin and its related tests
(FixNulls_PreserveOrigin, Bidirectional_PreserveOrigin, MixedOps_PreserveOrigin)
use Spock-specific features directly; update them to either be documented as
Spock-only or add a skip guard using the test environment helper (e.g., call
testEnv.RequireSpock() or t.SkipUnlessSpock() at the start of each test) so they
only run when Spock is available; modify the beginning of functions
TestTableRepair_PreserveOrigin, FixNulls_PreserveOrigin,
Bidirectional_PreserveOrigin, and MixedOps_PreserveOrigin to invoke the shared
testEnv skip/check helper (or add t.Skip with a clear message) before any
Spock-specific setup.
- Line 5: Update the copyright header in the file to match the project's year
range by changing "2023 - 2025" to "2023 - 2026" in the file header comment (the
top-line copyright comment string).
- Around line 573-750: This test uses Spock-specific APIs
(pgCluster.Node1Pool/Node2Pool, spock.repair_mode, spock.repset_add_table) but
wasn’t guarded or refactored to testEnv; either add an explicit skip at the top
of TestTableRepair_VariousDataTypes (e.g., if !testEnv.IsSpock() {
t.Skip("spock-only test") }) or refactor the test to use testEnv helpers
(replace pgCluster.Node1Pool/Node2Pool with testEnv-provided pools and replace
direct spock SQL calls with testEnv helper methods for repset_add_table and
repair_mode) so the test runs correctly in non-Spock environments.

In `@tests/integration/test_env_test.go`:
- Line 202: The parameter `composite` on function `setupDivergence` is unused;
remove it from the function signature and from all call sites (update any tests
calling `setupDivergence(t, ctx, qualifiedTableName, composite)` to
`setupDivergence(t, ctx, qualifiedTableName)`), or if you must preserve the API,
explicitly mark it as used by adding a no-op reference (for example `_ =
composite`) inside `setupDivergence` so linters/tests stop flagging it as
unused; adjust imports/usages accordingly.
- Line 357: Replace the hardcoded time.Sleep(2 * time.Second) in the test with a
deterministic wait: either poll the expected condition (e.g., loop with
time.Sleep(100*time.Millisecond) and a timeout checking the repair task's
observable effect or state) until success, or, if the delay is truly required,
add a clear comment above the time.Sleep explaining why the pause is necessary
and reference the specific repair invocation being synchronized with; ensure any
polling uses a reasonable timeout and fails the test on expiry rather than
hanging indefinitely.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1333b0b9-2f32-4543-bbbc-6dd36810011a

📥 Commits

Reviewing files that changed from the base of the PR and between 005e1e0 and c9ea0d5.

📒 Files selected for processing (16)
  • db/queries/queries.go
  • db/queries/templates.go
  • internal/consistency/diff/repset_diff.go
  • internal/consistency/diff/spock_diff.go
  • internal/consistency/diff/table_diff.go
  • internal/consistency/diff/table_rerun.go
  • internal/consistency/mtree/merkle.go
  • internal/consistency/repair/table_repair.go
  • tests/integration/docker-compose-native.yaml
  • tests/integration/helpers_test.go
  • tests/integration/main_test.go
  • tests/integration/merkle_tree_test.go
  • tests/integration/native_pg_test.go
  • tests/integration/table_diff_test.go
  • tests/integration/table_repair_test.go
  • tests/integration/test_env_test.go

Comment on lines +989 to +999
func GetNodeOriginNames(ctx context.Context, db DBQuerier) (map[string]string, error) {
var spockAvailable bool
err := db.QueryRow(ctx, "SELECT EXISTS (SELECT 1 FROM pg_extension WHERE extname = 'spock')").Scan(&spockAvailable)
if err != nil {
return nil, fmt.Errorf("detecting spock extension: %w", err)
}
if spockAvailable {
return GetSpockNodeNames(ctx, db)
}
return GetReplicationOriginNames(ctx, db)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PostgreSQL, what identifier does pg_xact_commit_timestamp_origin()return: the replication origin OID frompg_replication_origin.roident, or an extension-specific node ID such as Spock's spock.node.node_id?

💡 Result:

pg_xact_commit_timestamp_origin returns the replication origin OID from pg_replication_origin.roident.

Citations:


Fix GetNodeOriginNames to select translators by ID source, not extension presence.

pg_xact_commit_timestamp_origin() returns replication origin OIDs from pg_replication_origin.roident, not Spock-specific node IDs. On Spock clusters, selecting GetSpockNodeNames() purely because the extension exists will use a map keyed by spock.node.node_id—a different namespace than the replication origin OIDs being passed in. This causes lookup failures. The translator should be chosen based on where the IDs actually come from, not whether Spock is installed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/queries/queries.go` around lines 989 - 999, GetNodeOriginNames currently
chooses GetSpockNodeNames based on the spock extension existing, which
mismatches replication origin OIDs; change it to detect the ID source by
querying pg_replication_origin and deciding the translator based on whether its
roident values map to spock.node.node_id or to pg_replication_origin
identifiers. Concretely: replace the extname check with a lightweight query that
inspects pg_replication_origin (optionally joining against spock.node) to see if
roident corresponds to spock.node.node_id (use a join like
pg_replication_origin.roident::text = spock.node.node_id::text or existence of
matching rows); if that test indicates spock node IDs are the source, call
GetSpockNodeNames(ctx, db), otherwise call GetReplicationOriginNames(ctx, db).
Ensure you retain error handling from the original GetNodeOriginNames.

Comment on lines +1325 to +1332
CompareBlocksSQL: template.Must(template.New("compareBlocksSQL").Parse(`
SELECT
*
FROM
{{.TableName}}
WHERE
{{.WhereClause}}
`)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Build failure: CompareBlocksSQL field not declared in Templates struct.

The static analysis and pipeline failures confirm that CompareBlocksSQL is used in the struct literal but not declared in the Templates struct (lines 26-145). This causes a compilation error.

🐛 Proposed fix: Add the field declaration

Add the following line to the Templates struct (around line 131, before the native LSN templates):

 	GetSpockSlotLSNForNode           *template.Template
+	CompareBlocksSQL                 *template.Template
 	GetNativeOriginLSNForNode        *template.Template
🧰 Tools
🪛 GitHub Actions: Go Integration Tests

[error] 1325-1325: Go test failed: unknown field CompareBlocksSQL in struct literal of type Templates

🪛 GitHub Check: test

[failure] 1325-1325:
unknown field CompareBlocksSQL in struct literal of type Templates

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/queries/templates.go` around lines 1325 - 1332, The Templates struct is
missing the CompareBlocksSQL field referenced by the struct literal; add a field
named CompareBlocksSQL of type *template.Template to the Templates struct
definition so the struct literal that sets CompareBlocksSQL compiles (ensure the
field name exactly matches CompareBlocksSQL and the type matches other template
fields like those around native LSN templates).

Comment on lines +1551 to +1568
GetNativeOriginLSNForNode: template.Must(template.New("getNativeOriginLSNForNode").Parse(`
SELECT ros.remote_lsn::text
FROM pg_catalog.pg_replication_origin_status ros
JOIN pg_catalog.pg_replication_origin ro ON ro.roident = ros.local_id
JOIN pg_catalog.pg_subscription s ON ro.roname LIKE 'pg_%' || s.oid::text
WHERE s.subname LIKE '%' || $1 || '%'
AND ros.remote_lsn IS NOT NULL
LIMIT 1
`)),
GetNativeSlotLSNForNode: template.Must(template.New("getNativeSlotLSNForNode").Parse(`
SELECT rs.confirmed_flush_lsn::text
FROM pg_catalog.pg_replication_slots rs
JOIN pg_catalog.pg_subscription s ON rs.slot_name = s.subslotname
WHERE s.subname LIKE '%' || $1 || '%'
AND rs.confirmed_flush_lsn IS NOT NULL
ORDER BY rs.confirmed_flush_lsn DESC
LIMIT 1
`)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check how subscription names are structured to understand if loose matching is safe

# Search for subscription name creation/usage patterns
rg -n "subname|sub_name|subscription" --type go -C3 | head -80

Repository: pgEdge/ace

Length of output: 4751


🏁 Script executed:

#!/bin/bash
# Find where GetNativeOriginLSNForNode and GetNativeSlotLSNForNode are called
echo "=== Searching for GetNativeOriginLSNForNode calls ==="
rg -n "GetNativeOriginLSNForNode" --type go -B2 -A2

echo ""
echo "=== Searching for GetNativeSlotLSNForNode calls ==="
rg -n "GetNativeSlotLSNForNode" --type go -B2 -A2

Repository: pgEdge/ace

Length of output: 3270


🏁 Script executed:

#!/bin/bash
# Look for node name patterns and how they're structured
echo "=== Searching for node name usage patterns ==="
rg -n "nodeName|node_name" --type go | head -30

echo ""
echo "=== Check for subscription name creation/formatting ==="
rg -n "pg_%|subname.*node" --type go -B1 -A1 | head -40

Repository: pgEdge/ace

Length of output: 3408


🏁 Script executed:

#!/bin/bash
# Check the context around the template definitions to understand the parameter usage
echo "=== Looking at template definitions context ==="
sed -n '1540,1580p' db/queries/templates.go

Repository: pgEdge/ace

Length of output: 1903


🏁 Script executed:

#!/bin/bash
# View the actual function implementations
sed -n '931,960p' db/queries/queries.go

Repository: pgEdge/ace

Length of output: 957


🏁 Script executed:

#!/bin/bash
# Look for how subscriptions are named - check for patterns or documentation
rg -n "subname|subscription.*name" --type go db/queries/queries.go | head -20

Repository: pgEdge/ace

Length of output: 36


🏁 Script executed:

#!/bin/bash
# Search for any validation or naming conventions for nodes/subscriptions
rg -n "node.*name|subscription.*naming|replication.*subscription" --type md -B1 -A1

Repository: pgEdge/ace

Length of output: 2319


LIKE pattern matching can return incorrect subscription when node names are substrings of each other.

The queries use LIKE '%' || $1 || '%' which performs substring matching. If node names are substrings of one another (e.g., n1 and n10), the query for n1 will also match subscriptions containing n10. While LIMIT 1 ensures only one row is returned, it doesn't prevent selecting the wrong subscription. This risk is already acknowledged in the codebase with defensive comments at lines 2922 and 2943 of internal/consistency/repair/table_repair.go noting "Native PG queries may fail if subscription naming doesn't match."

Consider using exact matching (= instead of LIKE) if subscription naming conventions allow it, or implement more precise pattern matching that accounts for node name boundaries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/queries/templates.go` around lines 1551 - 1568, The LIKE '%' || $1 || '%'
substring matching in the templates GetNativeOriginLSNForNode and
GetNativeSlotLSNForNode can return wrong subscriptions when node names are
substrings of each other; update both WHERE clauses to use exact matching
(s.subname = $1) if subscription naming permits, otherwise replace the LIKE with
a stricter pattern that enforces boundaries (for example a regex match using
s.subname ~ ('\m' || $1 || '\M') or a LIKE that includes explicit delimiters
present in your subscription names) so the query selects the intended
subscription rather than any substring match.

Comment on lines +158 to +169
// Check if spock extension is installed (only on first node)
if len(repsetNodeNames) == 0 {
spockInstalled, err := queries.CheckSpockInstalled(c.Ctx, pool)
if err != nil {
pool.Close()
return fmt.Errorf("failed to check for spock extension on node %s: %w", nodeName, err)
}
if !spockInstalled {
pool.Close()
return fmt.Errorf("repset-diff requires the spock extension, which is not installed on this cluster")
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate Spock before repset discovery on each node.

Tying the extension check to len(repsetNodeNames) means later nodes skip the prerequisite once the first repset-bearing node is found. That can fall through to CheckRepSetExists on nodes that do not even have Spock installed, which loses the clearer prerequisite error.

🛠 Suggested change
-		// Check if spock extension is installed (only on first node)
-		if len(repsetNodeNames) == 0 {
-			spockInstalled, err := queries.CheckSpockInstalled(c.Ctx, pool)
-			if err != nil {
-				pool.Close()
-				return fmt.Errorf("failed to check for spock extension on node %s: %w", nodeName, err)
-			}
-			if !spockInstalled {
-				pool.Close()
-				return fmt.Errorf("repset-diff requires the spock extension, which is not installed on this cluster")
-			}
-		}
+		spockInstalled, err := queries.CheckSpockInstalled(c.Ctx, pool)
+		if err != nil {
+			pool.Close()
+			return fmt.Errorf("failed to check for spock extension on node %s: %w", nodeName, err)
+		}
+		if !spockInstalled {
+			pool.Close()
+			return fmt.Errorf("repset-diff requires the spock extension, which is not installed on node %s", nodeName)
+		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Check if spock extension is installed (only on first node)
if len(repsetNodeNames) == 0 {
spockInstalled, err := queries.CheckSpockInstalled(c.Ctx, pool)
if err != nil {
pool.Close()
return fmt.Errorf("failed to check for spock extension on node %s: %w", nodeName, err)
}
if !spockInstalled {
pool.Close()
return fmt.Errorf("repset-diff requires the spock extension, which is not installed on this cluster")
}
}
// Check if spock extension is installed
spockInstalled, err := queries.CheckSpockInstalled(c.Ctx, pool)
if err != nil {
pool.Close()
return fmt.Errorf("failed to check for spock extension on node %s: %w", nodeName, err)
}
if !spockInstalled {
pool.Close()
return fmt.Errorf("repset-diff requires the spock extension, which is not installed on node %s", nodeName)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/consistency/diff/repset_diff.go` around lines 158 - 169, The spock
extension check should run for every node before proceeding to repset discovery:
remove the conditional that gates queries.CheckSpockInstalled on
len(repsetNodeNames) and call queries.CheckSpockInstalled(c.Ctx, pool) for each
node (using nodeName and pool), closing pool and returning a formatted error on
failure; if spock is not installed on that node return an error that includes
nodeName (instead of deferring to CheckRepSetExists), then continue to
CheckRepSetExists only after the per-node spock check succeeds.

Comment on lines +310 to +320
// Check if spock extension is installed
for _, pool := range t.Pools {
spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool)
if err != nil {
return fmt.Errorf("failed to check for spock extension: %w", err)
}
if !spockInstalled {
return fmt.Errorf("spock-diff requires the spock extension, which is not installed on this cluster")
}
break
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Run the Spock prerequisite on every selected node.

This only validates one pool, so a mixed or misconfigured cluster can get past the prerequisite and then fail later when GetSpockNodeAndSubInfo hits an unchecked node without spock. Check every pool and include the node name in the error.

🛠 Suggested change
-	// Check if spock extension is installed
-	for _, pool := range t.Pools {
-		spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool)
+	// Check that spock is installed on every selected node.
+	for name, pool := range t.Pools {
+		spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool)
 		if err != nil {
-			return fmt.Errorf("failed to check for spock extension: %w", err)
+			return fmt.Errorf("failed to check for spock extension on node %s: %w", name, err)
 		}
 		if !spockInstalled {
-			return fmt.Errorf("spock-diff requires the spock extension, which is not installed on this cluster")
+			return fmt.Errorf("spock-diff requires the spock extension, which is not installed on node %s", name)
 		}
-		break
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Check if spock extension is installed
for _, pool := range t.Pools {
spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool)
if err != nil {
return fmt.Errorf("failed to check for spock extension: %w", err)
}
if !spockInstalled {
return fmt.Errorf("spock-diff requires the spock extension, which is not installed on this cluster")
}
break
}
// Check that spock is installed on every selected node.
for name, pool := range t.Pools {
spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool)
if err != nil {
return fmt.Errorf("failed to check for spock extension on node %s: %w", name, err)
}
if !spockInstalled {
return fmt.Errorf("spock-diff requires the spock extension, which is not installed on node %s", name)
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/consistency/diff/spock_diff.go` around lines 310 - 320, The loop
over t.Pools only validates the first pool then breaks, so some nodes may skip
the spock check and fail later in GetSpockNodeAndSubInfo; remove the premature
break and instead iterate every pool in t.Pools calling
queries.CheckSpockInstalled(t.Ctx, pool), and if it returns false return an
error that includes the pool/node identifier (e.g. pool.Name or pool.Node) so
the message pinpoints the specific node missing the spock extension; keep the
same error wrapping for query failures from CheckSpockInstalled.

Comment on lines +165 to +180
// detectSpock checks whether the spock extension is installed on any available
// node and stores the result in t.spockAvailable.
func (t *TableRepairTask) detectSpock() {
for nodeName, pool := range t.Pools {
spockInstalled, err := queries.CheckSpockInstalled(t.Ctx, pool)
if err != nil {
logger.Warn("failed to detect spock extension on %s: %v", nodeName, err)
continue
}
t.spockAvailable = spockInstalled
logger.Info("spock extension detected: %v (checked on %s)", spockInstalled, nodeName)
return
}
logger.Warn("could not detect spock extension on any node; assuming not available")
t.spockAvailable = false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t let map iteration choose cluster capability.

detectSpock() returns after the first successful check, and t.Pools is a map. On a partially migrated or inconsistent cluster, the task-wide flag depends on iteration order, so later code can either call spock.repair_mode() on vanilla nodes or skip it on Spock nodes. If mixed capability isn’t supported, fail fast instead of taking the first answer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/consistency/repair/table_repair.go` around lines 165 - 180,
detectSpock currently returns after the first successful CheckSpockInstalled and
uses map iteration order to set t.spockAvailable; change it to check every entry
in t.Pools (calling queries.CheckSpockInstalled for each), record whether any
node reports spockInstalled==true and any reports false, and if both true and
false are observed fail fast by returning an error (do not silently pick the
first result); otherwise set t.spockAvailable to the unanimous boolean. Change
detectSpock to return error and update its callers to handle that error; keep
logging for per-node failures but do not treat the first successful probe as
definitive.

Comment on lines +756 to +759
// Detect whether spock extension is available before any repair path
if !t.DryRun {
t.detectSpock()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Detect Spock before validation-time recovery logic.

ValidateAndPrepare() can call autoSelectSourceOfTruth() in recovery mode, and that path reaches fetchLSNsForNode(). Because detection only happens here—after validation, and never in dry-run—those probes always see the zero-value spockAvailable=false and take the native LSN path on Spock clusters. Move detection earlier, once t.Pools exist, or make LSN fetching detect lazily.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/consistency/repair/table_repair.go` around lines 756 - 759, The code
detects Spock only after validation which causes
fetchLSNsForNode()/autoSelectSourceOfTruth() to assume spockAvailable=false;
move detection earlier so spockAvailable is set once t.Pools exists or make
fetching lazy. Concretely, call t.detectSpock() immediately after t.Pools is
populated (before calling ValidateAndPrepare()) or add a lazy guard inside
fetchLSNsForNode() that invokes detectSpock() the first time it runs (and
respects t.DryRun), ensuring detectSpock() sets spockAvailable before any
LSN-fetching logic runs.

Comment on lines +28 to +30
ports:
- target: 5432
published: 7432
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Let Docker allocate the host ports.

These fixed published ports make the native stack fail whenever 7432/7433 are already taken or two integration runs overlap. tests/integration/native_pg_test.go:58-123 already resolves each container’s mapped port via MappedPort(), so the compose file can rely on ephemeral bindings instead.

🛠 Suggested change
   native-n1:
@@
     ports:
-      - target: 5432
-        published: 7432
+      - target: 5432
   native-n2:
@@
     ports:
-      - target: 5432
-        published: 7433
+      - target: 5432

Also applies to: 43-45

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/docker-compose-native.yaml` around lines 28 - 30, Remove
the fixed host port bindings so Docker can allocate ephemeral host ports: in the
docker-compose service definitions that currently include "ports: - target: 5432
published: 7432" (and the similar block at 43-45), delete the "published:
<port>" lines (or replace the full mapping with a single target-only mapping) so
only the container port is exposed and the tests rely on MappedPort() in
tests/integration/native_pg_test.go to discover the host port at runtime.

Comment on lines 476 to 480
go func() {
defer wg.Done()
nodeInfo := pgCluster.ClusterNodes[0]
nodeInfo := env.ClusterNodes[0]
cdc.ListenForChanges(ctx, nodeInfo)
}()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Look up the CDC source node by name.

The task is built for env.ServiceN1, but the listener is attached to env.ClusterNodes[0]. That only works while slice order happens to match the service mapping. If a harness orders ClusterNodes differently, this goroutine will watch the wrong node and the env.N1Pool assertions become flaky.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/merkle_tree_test.go` around lines 476 - 480, The goroutine
attaches the CDC listener to env.ClusterNodes[0] which can be in arbitrary
order; instead find the node matching env.ServiceN1 and pass that to
cdc.ListenForChanges. Replace the hardcoded env.ClusterNodes[0] lookup with a
search over env.ClusterNodes for the node whose name/service equals
env.ServiceN1 (assign to nodeInfo), then call cdc.ListenForChanges(ctx,
nodeInfo) and keep defer wg.Done() as-is so the listener watches the intended
service.

@mason-sharp mason-sharp marked this pull request as draft March 30, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant