Skip to content

Refactor transaction timestamp management for cross shard data#313

Merged
bootjp merged 21 commits intomainfrom
feature/multi-shard-transaction
Feb 16, 2026
Merged

Refactor transaction timestamp management for cross shard data#313
bootjp merged 21 commits intomainfrom
feature/multi-shard-transaction

Conversation

@bootjp
Copy link
Owner

@bootjp bootjp commented Feb 15, 2026

No description provided.

Copilot AI review requested due to automatic review settings February 15, 2026 11:42
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @bootjp, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the core transaction timestamp management within the system, enabling robust distributed transaction capabilities. It introduces explicit phases for transactions (prepare, commit, abort) and manages their state using dedicated metadata and internal keys. The changes ensure consistent timestamping across transactional operations and provide mechanisms for automatic lock resolution during data access, significantly enhancing the system's transactional integrity and scalability.

Highlights

  • Distributed Transaction Support: Introduced comprehensive support for distributed transactions across multiple shards, including prewrite, commit, and abort phases, orchestrated by the ShardedCoordinator.
  • Refactored Timestamp Management: Significantly refactored how transaction timestamps are managed, separating raw request timestamping from transactional timestamping, and introducing explicit commit and abort timestamps.
  • Transaction Metadata and Internal Keys: Implemented new data structures (TxnMeta, txnLock, txnIntent) and encoding/decoding logic for transaction metadata, along with a system of internal keys (!txn|lock|, !txn|int|, etc.) to manage transaction state within the key-value store.
  • MVCC Store Enhancements: Updated the MVCC store to correctly handle versioned values, ensuring they are inserted in sorted order by timestamp and improving the alignCommitTS logic.
  • Transaction Lock Resolution: Added logic to ShardStore to automatically resolve transaction locks during reads and scans, determining the transaction's status (committed, rolled back, or pending) and applying the appropriate resolution.
Changelog
  • adapter/internal.go
    • Imported the bytes package.
    • Refactored stampTimestamps into stampRawTimestamps and stampTxnTimestamps for clearer timestamp assignment logic.
    • Added helper functions forwardedTxnStartTS, forwardedTxnMetaMutation, and fillForwardedTxnCommitTS to manage transaction metadata and commit timestamps for forwarded requests.
  • kv/coordinator.go
    • Imported the sort package.
    • Modified dispatchTxn to incorporate primary key, commit timestamp, and lock TTL into transaction requests.
    • Updated redirect to use the newly introduced primaryKeyForElems for transaction requests.
    • Refactored txnRequests to construct prepare and commit mutations with transaction metadata, primary key, and commit timestamp.
    • Added primaryKeyForElems function to determine the primary key from a list of elements.
    • Added bytesCompare utility function for byte slice comparison.
  • kv/fsm.go
    • Imported the bytes package.
    • Modified Apply to correctly extract the commit timestamp from transaction metadata for commit and abort phases.
    • Added validation in handleRawRequest to prevent mutations of transaction-internal keys and check for conflicting locks.
    • Updated handleCommitRequest and handleAbortRequest to utilize transaction metadata and pass commitTS for aborts.
    • Introduced uniqueMutations to filter duplicate mutations within a request.
    • Added buildPrepareStoreMutations, buildCommitStoreMutations, and buildAbortCleanupStoreMutations to construct store-level mutations for different transaction phases.
    • Implemented appendRollbackRecord to record transaction rollbacks.
    • Added numerous helper functions for transaction-specific operations, including prepareTxnMutation, txnIntentFromPBMutation, txnCleanupMutations, txnLockForCommit, txnIntentForCommit, storeMutationForIntent, commitTxnKeyMutations, shouldClearAbortKey, splitTxnMeta, and assertNoConflictingTxnLock.
  • kv/fsm_occ_test.go
    • Updated a test case to align with the new transaction prepare phase and metadata structure, changing a commit attempt to a prewrite attempt.
  • kv/hlc_wall.go
    • Added hlcWallNow function to retrieve the current wall clock time.
    • Added hlcWallFromNowMs function to calculate a future timestamp based on a millisecond delta.
  • kv/shard_key.go
    • Modified routeKey to extract the embedded logical user key from transaction-internal keys for correct shard routing.
  • kv/shard_store.go
    • Refactored GetAt to use localGetAt and leaderGetAt for improved read path logic.
    • Added isVerifiedRaftLeader helper to check for verified Raft leadership.
    • Introduced leaderGetAt to handle transaction lock resolution for non-transactional keys during leader reads.
    • Added localGetAt for direct access to the underlying store.
    • Modified scanRouteAt to apply resolveScanLocks to scan results.
    • Implemented maybeResolveTxnLock to resolve transaction locks during reads.
    • Added resolveScanLocks to filter transaction-internal keys and resolve locks during scans.
    • Introduced txnStatus enum and primaryTxnStatus function to determine the state of a primary transaction.
    • Added txnCommitTS, hasTxnRollback, and loadTxnLock for querying transaction records.
    • Implemented bestEffortAbortPrimary and applyTxnResolution for transaction cleanup and resolution.
    • Added cleanupTS to determine an appropriate timestamp for transaction cleanup.
  • kv/sharded_coordinator.go
    • Imported the bytes package.
    • Modified Dispatch to handle transactional requests via a new dispatchTxn method.
    • Added dispatchTxn to orchestrate distributed transactions through prewrite, primary commit, and secondary commit phases.
    • Introduced preparedGroup struct to track prepared transaction groups.
    • Added prewriteTxn for the prepare phase of distributed transactions.
    • Added commitPrimaryTxn for committing the primary key of a distributed transaction.
    • Added commitSecondaryTxns for committing secondary keys across shards.
    • Implemented abortPreparedTxn for rolling back prepared transactions.
    • Added txnGroupForID to retrieve a transaction group by ID.
    • Added nextTxnTSAfter to generate the next transaction timestamp after a given start timestamp.
    • Added abortTSFrom to derive an abort timestamp from a commit timestamp.
    • Added txnMetaMutation to create transaction metadata mutations.
    • Added engineGroupIDForKey to determine the group ID for a given key.
    • Updated txnLogs to handle single-shard transactions using the new metadata structure.
    • Added primaryKeyFromMutations to identify the primary key from a set of mutations.
    • Added keyMutations to extract unique keys from mutations.
  • kv/sharded_integration_test.go
    • Updated TestShardedCoordinatorDispatch_RejectsCrossShardTxn to reflect the new distributed transaction behavior, asserting the outcome of a cross-shard transaction rather than its rejection.
  • kv/transaction.go
    • Modified Commit to conditionally attempt transactional cleanup (Abort) only for actual transactional batches.
    • Modified Abort to properly construct abort requests using extracted transaction metadata and a derived abort timestamp.
    • Added extractTxnMeta function to parse transaction metadata from a list of mutations.
  • kv/txn_codec.go
    • Added TxnMeta struct for transaction metadata and implemented its EncodeTxnMeta and DecodeTxnMeta functions.
    • Added txnLock struct for transaction locks and implemented its encodeTxnLock and decodeTxnLock functions.
    • Added txnIntent struct for transaction intents and implemented its encodeTxnIntent and decodeTxnIntent functions.
    • Implemented encodeTxnCommitRecord and decodeTxnCommitRecord for transaction commit records.
    • Implemented encodeTxnRollbackRecord for transaction rollback records.
    • Added u64ToInt helper function for safe uint64 to int conversion.
  • kv/txn_consts.go
    • Added constants for defaultTxnLockTTLms and various mutation factors for transaction prepare, commit, and abort phases.
  • kv/txn_errors.go
    • Added new error types specific to transaction operations, such as ErrTxnMetaMissing, ErrTxnInvalidMeta, ErrTxnLocked, ErrTxnCommitTSRequired, ErrTxnAlreadyCommitted, and ErrTxnPrimaryKeyRequired.
  • kv/txn_keys.go
    • Defined prefixes for transaction-internal keys (txnLockPrefix, txnIntentPrefix, txnCommitPrefix, txnRollbackPrefix, txnMetaPrefix).
    • Implemented functions to construct transaction-internal keys (txnLockKey, txnIntentKey, txnCommitKey, txnRollbackKey).
    • Added isTxnInternalKey and isTxnMetaKey functions to identify transaction-related keys.
    • Added txnRouteKey to extract the logical user key from transaction-internal keys for routing purposes.
  • store/lsm_store.go
    • Removed logic in alignCommitTS that forced the commit timestamp to be strictly greater than the last commit timestamp plus one.
  • store/mvcc_store.go
    • Imported the sort package.
    • Modified putVersionLocked and deleteVersionLocked to use insertVersionSorted for maintaining sorted versions.
    • Added insertVersionSorted function to insert versioned values into a sorted slice.
    • Modified alignCommitTS to update lastCommitTS only if the new commitTS is greater.
  • store/mvcc_store_prop_test.go
    • Updated a property test for Delete to ensure the deletion timestamp (delTS) is strictly after the actual put timestamp (actualPutTS).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major and impressive refactoring to support distributed transactions using a 2PC-like protocol. The changes are extensive, touching coordination, state machine logic, and storage layers. The introduction of explicit transaction metadata, locks, and intents is well-structured into new txn_* files. The overall design moves timestamp management to the coordinator level, which is a solid choice for a distributed system.

I've found a few areas for improvement:

  • Some code duplication that can be refactored.
  • A couple of performance issues, one of which is significant for scans.
  • A potential bug in primary key selection logic.

Addressing these points will make the new transaction implementation more robust and performant. Great work on this complex feature.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors transaction timestamp handling and adds MVCC-level plumbing for transactional metadata/locks so commits/aborts can carry an explicit commit timestamp and reads/scans can resolve transactional locks.

Changes:

  • Remove “auto-bump” commit timestamp alignment in MVCC stores and keep MVCC versions ordered by timestamp.
  • Introduce transaction internal keys/codecs (meta/lock/intent/commit/rollback) and propagate commit timestamps through coordinator/FSM paths.
  • Add shard-store read/scan lock-resolution behavior and update integration/property tests for the new semantics.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
store/mvcc_store_prop_test.go Ensures delete timestamps are strictly after puts in property tests.
store/mvcc_store.go Keeps per-key version lists sorted; adjusts commitTS alignment behavior.
store/lsm_store.go Removes commitTS auto-bump; relies on max-tracking for LastCommitTS.
kv/txn_keys.go Adds txn internal key prefixes, key builders, and routing extraction.
kv/txn_errors.go Adds typed txn errors for meta/locking/commitTS validation.
kv/txn_consts.go Adds txn constants (TTL default, mutation sizing factors).
kv/txn_codec.go Adds encoding/decoding for txn meta, locks, intents, and records.
kv/transaction.go Updates commit/abort behavior and txn meta extraction for cleanup.
kv/sharded_integration_test.go Updates cross-shard txn expectations (now succeeds) but needs renaming.
kv/sharded_coordinator.go Implements distributed txn dispatch (prewrite/commit/abort helpers).
kv/shard_store.go Adds read-time lock resolution and filters txn-internal keys from scans.
kv/shard_key.go Extends routeKey() to route txn-internal keys by embedded user key.
kv/hlc_wall.go Adds wall-time helpers for TTL/cleanup timestamp derivation.
kv/fsm_occ_test.go Updates OCC conflict test to match prepare-phase conflicts.
kv/fsm.go Implements prepare/commit/abort txn phases with locks/intents and commitTS enforcement.
kv/coordinator.go Refactors txn request construction to include meta + explicit commitTS.
adapter/internal.go Centralizes timestamp stamping for forwarded raw/txn requests and fills commitTS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 295 to 299
func (s *ShardStore) maybeResolveTxnLock(ctx context.Context, g *ShardGroup, key []byte, readTS uint64) error {
// Only consider locks visible at the read timestamp.
lockBytes, err := g.Store.GetAt(ctx, txnLockKey(key), readTS)
if err != nil {
if errors.Is(err, store.ErrKeyNotFound) {
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new lock-resolution path introduced here (maybeResolveTxnLock and related helpers) changes observable read semantics (auto-resolving committed/rolled-back txns, returning ErrTxnLocked, filtering txn-internal keys on scans), but there are no focused tests exercising these behaviors. Please add unit/integration coverage for pending-lock reads, commit-driven resolution, and scan filtering of !txn|... keys.

Copilot uses AI. Check for mistakes.
@bootjp bootjp requested a review from Copilot February 15, 2026 15:54
@bootjp
Copy link
Owner Author

bootjp commented Feb 15, 2026

/gemini review

@bootjp bootjp changed the title Refactor transaction timestamp management Refactor transaction timestamp management for cross shard data Feb 15, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major refactoring to implement a distributed transaction protocol, likely based on Percolator. The changes are extensive, touching coordination, state machine logic, storage, and adapters. The introduction of 2-phase commit (prewrite/commit), transaction metadata, lock resolution on read, and sharded transaction coordination is a significant step towards a robust distributed KV store. The code is well-structured with new concepts cleanly separated into new files (e.g., txn_codec.go, txn_keys.go). My review includes one suggestion for a minor performance improvement in the primary key selection logic.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


func (c *ShardedCoordinator) nextTxnTSAfter(startTS uint64) uint64 {
if c.clock == nil {
return startTS + 1
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function nextTxnTSAfter doesn't check for overflow when c.clock is nil. If startTS equals ^uint64(0) (the maximum uint64 value), then startTS + 1 will overflow to 0. This could cause issues when comparing timestamps. Consider adding an overflow check similar to the one in abortTSFrom or documenting that timestamps near uint64 max are not supported.

Suggested change
return startTS + 1
nextTS := startTS + 1
if nextTS == 0 {
return startTS
}
return nextTS

Copilot uses AI. Check for mistakes.
}

func (i *Internal) fillForwardedTxnCommitTS(reqs []*pb.Request, startTS uint64) {
const metaPrefix = "!txn|meta|"
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transaction metadata prefix is duplicated here as a local constant. This constant is already defined in kv/txn_keys.go as txnMetaPrefix = "!txn|meta|". Consider importing and using the constant from kv package instead of duplicating it to maintain consistency and avoid potential issues if the prefix changes.

Copilot uses AI. Check for mistakes.
@bootjp
Copy link
Owner Author

bootjp commented Feb 15, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major refactoring of the transaction handling mechanism, moving from a simple optimistic concurrency control model to a full-fledged two-phase commit (2PC) protocol to support cross-shard transactions. This is a significant and complex change that touches many parts of the system, including the coordinator, the Raft FSM, the storage layer, and the Redis adapter.

The implementation appears to be well-structured and correct, introducing new concepts like transaction metadata, locks, intents, and commit/rollback records. The logic for the 2PC coordinator, read-time lock resolution, and timestamp management seems sound. The code is well-modularized with new files for codecs, key formats, and constants, which improves maintainability.

I have one suggestion regarding a potential performance issue in the scan lock resolution logic, which could lead to an N+1 query problem. Otherwise, this is an excellent piece of work.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 253 to 258
func abortTSFrom(commitTS uint64) uint64 {
abortTS := commitTS + 1
if abortTS == 0 {
return commitTS
}
return abortTS
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abortTSFrom returns commitTS when commitTS+1 overflows, which can yield abortTS == commitTS and potentially <= startTS. That makes ABORT requests invalid (kvFSM requires abortTS > startTS) and can strand locks. Prefer to detect overflow and return an explicit error (or choose a different cleanup timestamp strategy) instead of returning an invalid/ambiguous abort timestamp.

Copilot uses AI. Check for mistakes.
var requests []*pb.Request
if reqs.IsTxn {
requests = txnRequests(reqs.StartTS, reqs.Elems)
primary := primaryKeyForElems(reqs.Elems)
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the follower redirect path, a txn with all-empty keys will generate a TxnMeta with an empty PrimaryKey and forward it to the leader. The leader will later fail with ErrTxnPrimaryKeyRequired, but the leader path returns that error earlier. To keep behavior consistent (and avoid forwarding obviously-invalid txns), validate that primaryKeyForElems(reqs.Elems) is non-empty here and return ErrTxnPrimaryKeyRequired/ErrInvalidRequest before calling txnRequests.

Suggested change
primary := primaryKeyForElems(reqs.Elems)
primary := primaryKeyForElems(reqs.Elems)
if len(primary) == 0 {
return nil, errors.WithStack(ErrTxnPrimaryKeyRequired)
}

Copilot uses AI. Check for mistakes.
@bootjp bootjp requested a review from Copilot February 16, 2026 18:59
@bootjp
Copy link
Owner Author

bootjp commented Feb 16, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request refactors transaction timestamp management to support cross-shard data using a 2PC-like approach with locks and intents (Percolator-style). This is a significant improvement over the previous OCC-only model, providing better consistency guarantees for distributed operations. The implementation of lock resolution at read time and the distributed coordinator logic are well-structured. However, there are some opportunities for optimization in the lock resolution path during scans and a potential correctness issue regarding ts=0 handling in proxied reads that should be addressed to ensure reliability on follower nodes.

Comment on lines 360 to 368
locked, err := scanKeyLockedAt(ctx, g, kvp.Key, ts)
if err != nil {
return nil, false, err
}
if !locked {
return kvp, false, nil
}

v, err := s.leaderGetAt(ctx, g, kvp.Key, ts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a redundant storage lookup here. scanKeyLockedAt performs a GetAt on the transaction lock key. If a lock is found, leaderGetAt is called, which eventually calls maybeResolveTxnLock, which performs the exact same GetAt on the lock key to retrieve the lock metadata. This doubles the number of storage lookups for every locked key encountered during a scan. Consider refactoring scanKeyLockedAt to return the lock data if present, or merging the logic into resolveScanKVP.

Comment on lines 486 to 505
primaryKey := primaryKeyFromMutations(muts)
if len(primaryKey) == 0 {
return nil, errors.WithStack(ErrTxnPrimaryKeyRequired)
}
logs = append(logs,
&pb.Request{IsTxn: true, Phase: pb.Phase_PREPARE, Ts: startTS, Mutations: muts},
&pb.Request{IsTxn: true, Phase: pb.Phase_COMMIT, Ts: startTS, Mutations: muts},
&pb.Request{
IsTxn: true,
Phase: pb.Phase_PREPARE,
Ts: startTS,
Mutations: append([]*pb.Mutation{
{Op: pb.Op_PUT, Key: []byte(txnMetaPrefix), Value: EncodeTxnMeta(TxnMeta{PrimaryKey: primaryKey, LockTTLms: defaultTxnLockTTLms, CommitTS: 0})},
}, muts...),
},
&pb.Request{
IsTxn: true,
Phase: pb.Phase_COMMIT,
Ts: startTS,
Mutations: append([]*pb.Mutation{
{Op: pb.Op_PUT, Key: []byte(txnMetaPrefix), Value: EncodeTxnMeta(TxnMeta{PrimaryKey: primaryKey, LockTTLms: 0, CommitTS: commitTS})},
}, keyMutations(muts)...),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In buildTxnLogs, the code performs two separate passes over the mutations: one in primaryKeyFromMutations and another in keyMutations. Both functions use a map to deduplicate keys. These can be combined into a single pass to reduce overhead, especially for large transaction batches.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 159 to 171
seen := make(map[string]struct{}, len(muts))
out := make([]*pb.Mutation, 0, len(muts))
for _, mut := range muts {
if mut == nil || len(mut.Key) == 0 {
return nil, errors.WithStack(ErrInvalidRequest)
}
keyStr := string(mut.Key)
if _, ok := seen[keyStr]; ok {
continue
}
seen[keyStr] = struct{}{}
out = append(out, mut)
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uniqueMutations silently drops subsequent mutations for the same key (it keeps the first occurrence). If a transactional batch contains multiple operations on the same key (e.g., Put then Del), the later operation will be ignored, leading to incorrect committed state. Consider either rejecting duplicate keys in a txn with a clear error, or preserving the last mutation per key ("last write wins") when deduplicating.

Copilot uses AI. Check for mistakes.
// Already resolved (committed/rolled back).
return nil, nil
}
if lock.StartTS != startTS {
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commitTxnKeyMutations only validates lock.StartTS but ignores lock.PrimaryKey, so a COMMIT request could accidentally (or via malformed input) commit/clean up a lock that belongs to a different primary as long as startTS matches. Since the lock encoding includes PrimaryKey, consider passing the expected primary key into commitTxnKeyMutations (and similarly for abort cleanup) and verifying it matches before applying/cleaning up the lock/intent.

Suggested change
if lock.StartTS != startTS {
if lock.StartTS != startTS || !bytes.Equal(lock.PrimaryKey, key) {

Copilot uses AI. Check for mistakes.
@bootjp bootjp requested a review from Copilot February 16, 2026 19:49
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and impressive refactoring of the transaction system, moving from a simple single-node OCC model to a full-fledged, sharded 2PC (Percolator-style) distributed transaction protocol. The changes are extensive, touching everything from the storage layer and FSM to the coordinator and client-facing adapters.

The new implementation appears robust and well-thought-out, correctly handling timestamp management with HLC, the two-phase commit protocol, and read-time lock resolution for failed or pending transactions. The addition of detailed tests for the new complex logic is also commendable.

I have one suggestion for improving robustness in the single-node coordinator's timestamp logic, which aligns with the principles of MVCC for consistent snapshot views. Overall, this is an excellent piece of work that greatly enhances the system's capabilities.

Comment on lines +110 to +114
commitTS := c.clock.Next()
if commitTS <= startTS {
c.clock.Observe(startTS)
commitTS = c.clock.Next()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to determine the commitTS is mostly correct, ensuring it's greater than startTS. However, there's a rare edge case where c.clock.Next() after c.clock.Observe(startTS) might still not produce a timestamp greater than startTS (e.g., if the logical component of the HLC overflows).

The sharded coordinator (kv/sharded_coordinator.go) includes a final defensive check for this. For consistency and improved robustness, I suggest adding a similar check here.

Suggested change
commitTS := c.clock.Next()
if commitTS <= startTS {
c.clock.Observe(startTS)
commitTS = c.clock.Next()
}
commitTS := c.clock.Next()
if commitTS <= startTS {
c.clock.Observe(startTS)
commitTS = c.clock.Next()
}
if commitTS <= startTS {
return nil, errors.WithStack(ErrTxnCommitTSRequired)
}
References
  1. In an MVCC system, all operations within a single transaction must share the same start timestamp to ensure atomicity and a consistent snapshot view. The proposed check for commitTS > startTS is crucial for maintaining the integrity and correct ordering of timestamps within an MVCC transaction, ensuring atomicity and a consistent snapshot view.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +415 to +419
func (s *ShardStore) planScanLockItem(ctx context.Context, g *ShardGroup, ts uint64, plan *scanLockPlan, kvp *store.KVPair) error {
if kvp == nil || isTxnInternalKey(kvp.Key) {
plan.items = append(plan.items, scanItem{skip: true})
return nil
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScanAt lock handling skips all txn-internal keys from the underlying store scan. That means pending locks for keys with no previously committed user-key version can be missed (the scan may return only the !txn|lock|... entry, which gets skipped), so ScanAt may incorrectly return no error / incomplete results instead of ErrTxnLocked or resolving the lock. Consider detecting txn lock keys in the scan results (e.g., handle the !txn|lock| prefix by decoding the lock and mapping it back to the embedded user key) or performing a separate lock-key scan over the same range before filtering internal keys.

Copilot uses AI. Check for mistakes.
adapter/redis.go Outdated
Comment on lines 1178 to 1180
if resp.Value == nil {
return nil, errors.WithStack(store.ErrKeyNotFound)
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawGet/tryLeaderGetAt currently interprets a nil resp.Value as "key not found". In proto3, an explicitly-set empty byte slice and an unset bytes field are indistinguishable on the wire, so this can make it impossible to reliably store/read empty values across gRPC (an empty value may round-trip as nil and be treated as not found). If empty values should be supported, consider adding an explicit "exists" (or "found") flag to RawGetResponse / GetResponse or returning a NotFound gRPC status instead of encoding non-existence as a nil bytes field.

Suggested change
if resp.Value == nil {
return nil, errors.WithStack(store.ErrKeyNotFound)
}

Copilot uses AI. Check for mistakes.
@bootjp bootjp requested a review from Copilot February 16, 2026 20:28
@bootjp
Copy link
Owner Author

bootjp commented Feb 16, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major refactoring of the transaction management system, moving from a simple optimistic concurrency control (OCC) model to a more robust two-phase commit (2PC) protocol. This change is crucial for supporting cross-shard transactions and ensuring data consistency in a distributed environment. The changes are extensive, touching coordination, state machine logic, storage adapters, and introducing new concepts like transaction metadata, locks, and intents. The new implementation appears to be well-thought-out, with careful handling of timestamps (using an HLC), lock resolution on read, and proper routing of transactional operations. New tests have been added to cover the new 2PC logic, including overflow conditions and multi-key transaction semantics. My main feedback is a minor suggestion to refactor some duplicated code for improved maintainability. Overall, this is a significant and well-executed improvement to the system's transactional capabilities.

Comment on lines +541 to 556
func keyMutations(muts []*pb.Mutation) []*pb.Mutation {
out := make([]*pb.Mutation, 0, len(muts))
seen := map[string]struct{}{}
for _, m := range muts {
if m == nil || len(m.Key) == 0 {
continue
}
k := string(m.Key)
if _, ok := seen[k]; ok {
continue
}
seen[k] = struct{}{}
out = append(out, &pb.Mutation{Op: pb.Op_PUT, Key: m.Key})
}
return out
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic in this function for iterating through mutations and collecting unique keys is very similar to the logic in primaryKeyAndKeyMutations. This code duplication could be reduced to improve maintainability. Consider extracting a common helper function or refactoring one function to call the other.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +157 to +179
if g.Raft == nil {
kvs, err := g.Store.ScanAt(ctx, start, end, limit, ts)
if err != nil {
return nil, errors.WithStack(err)
}
return kvs, nil
// Keep ScanAt behavior consistent even when running without raft.
return filterTxnInternalKVs(kvs), nil
}

// Reads should come from the shard's leader to avoid returning stale or
// incomplete results when this node is a follower for a given shard.
if g.Raft.State() == raft.Leader {
if err := g.Raft.VerifyLeader().Error(); err == nil {
kvs, err := g.Store.ScanAt(ctx, start, end, limit, ts)
if err != nil {
return nil, errors.WithStack(err)
}
return kvs, nil
}
if isVerifiedRaftLeader(g.Raft) {
return s.scanRouteAtLeader(ctx, g, start, end, limit, ts)
}

kvs, err := s.proxyRawScanAt(ctx, g, start, end, limit, ts)
if err != nil {
return nil, err
}
return kvs, nil
// The leader's RawScanAt is expected to perform lock resolution and filtering
// via ShardStore.ScanAt, so avoid N+1 proxy gets here.
return filterTxnInternalKVs(kvs), nil
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScanAt filters txn-internal keys after calling the underlying store's ScanAt with the user-provided limit. Because the underlying ScanAt counts internal keys toward the limit, filtering afterward can return fewer than limit user keys (or even an empty result if the first N keys are internal). Consider over-fetching (loop/paginate until you collect limit non-internal keys or exhaust the range) or moving the filtering into the store scan so internal keys don't consume the limit.

Copilot uses AI. Check for mistakes.
@bootjp bootjp requested a review from Copilot February 16, 2026 21:13
@bootjp
Copy link
Owner Author

bootjp commented Feb 16, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a substantial pull request that introduces distributed transactions across shards using a 2PC-like protocol. The changes are extensive and touch many parts of the system, from the gRPC layer to the storage FSM. The implementation of the 2PC protocol, including prewrite/commit phases and read-time lock resolution, appears well-thought-out and is a significant step forward for the system's capabilities. The addition of comprehensive tests for the new transaction logic is also commendable.

My review focuses on a couple of areas for potential improvement regarding code duplication and the retry strategy for secondary commits. Overall, this is a very impressive piece of work.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 862 to 868
func cleanupTS(startTS uint64) uint64 {
now := hlcWallNow()
next := startTS + 1
if now > next {
return now
}
return next
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanupTS can overflow when startTS==^uint64(0), returning a value (hlcWallNow) that may be <= startTS. That makes ABORT resolution requests invalid (kvFSM requires resolveTS > startTS), potentially leaving locks permanently unresolved. Consider handling the overflow explicitly (e.g., detect startTS==max and return 0/error, or reuse abortTSFrom to compute a strictly-greater timestamp) so lock cleanup never emits a non-monotonic resolve timestamp.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link
Contributor

Copilot AI commented Feb 16, 2026

@bootjp I've opened a new pull request, #314, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 7 commits February 16, 2026 21:25
Replace cleanupTS with abortTSFrom to handle overflow when startTS==^uint64(0).
Add explicit overflow checks to prevent invalid abort timestamps that violate
the FSM requirement that resolveTS > startTS, which could leave locks permanently
unresolved.

Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
Pass commitTS from primaryTxnStatus to abortTSFrom for semantic clarity
and consistency with usage elsewhere in the codebase.

Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
Explain why startTS is passed twice to abortTSFrom: no commitTS is available
when aborting an expired lock with no commit record.

Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
Use 0 instead of startTS as the second parameter to abortTSFrom to explicitly
indicate that no commitTS is available in this context.

Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
Remove extra blank lines left after deleting cleanupTS function to fix
golangci-lint gci formatting check.

Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
Clarify that while uint64 overflow is not expected in normal operation,
the overflow checks are defensive programming to handle the edge case
where startTS==^uint64(0) or a bug causes overflow, preventing violation
of the FSM invariant resolveTS > startTS.

Co-authored-by: bootjp <1306365+bootjp@users.noreply.github.com>
Fix timestamp overflow in lock cleanup causing permanent lock leaks
@bootjp bootjp enabled auto-merge February 16, 2026 21:43
@bootjp bootjp merged commit 8d7753a into main Feb 16, 2026
7 checks passed
@bootjp bootjp deleted the feature/multi-shard-transaction branch February 16, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants