Skip to content

Release 2026.4.2: ant-node 0.11.0#79

Merged
jacderida merged 9 commits intomainfrom
rc-2026.4.2
Apr 24, 2026
Merged

Release 2026.4.2: ant-node 0.11.0#79
jacderida merged 9 commits intomainfrom
rc-2026.4.2

Conversation

@jacderida
Copy link
Copy Markdown
Collaborator

Summary

  • Promote RC branch rc-2026.4.2 to main
  • Bump ant-node from 0.11.0-rc.2 to 0.11.0
  • Upgrade saorsa-core from the RC branch git dependency back to the published crates.io 0.24.0 release (which pulls in saorsa-transport 0.33.0)
  • Refresh Cargo.lock

🤖 Generated with Claude Code

grumbach and others added 9 commits April 22, 2026 16:57
Merkle verification never tied the 16 candidate pub_keys to the pool
midpoint's actual close group on the network. An attacker could generate
16 ML-DSA keypairs locally, point every reward_address at a self-owned
wallet, fund the on-chain merkle payment, and receive their own rewards
back.

Add a closeness defence: for each MerklePaymentProof, the verifier
queries the DHT for the closest peers to the winner_pool's midpoint
address and requires a majority of the 16 candidate pub_keys to hash to
PeerIds that appear in that returned set. Hooked in between the cheap
signature check and the on-chain RPC so a forged pool is rejected
without paying for an on-chain round-trip. Every storing node that
receives the proof runs the same check against the single winner_pool.

PaymentVerifier grows an optional Arc<P2PNode> attached at startup via
attach_p2p_node; node.rs, devnet.rs, and the e2e testnet harness all
wire it after P2PNode construction. Unit tests that don't exercise
merkle verification keep a None handle and log a warning.

New e2e test test_attack_merkle_pay_yourself_fabricated_pool fails on
main (verifier accepts 16 attacker-generated keys all rewarding the
payer) and passes with the fix.
Security review flagged three issues with the initial closeness check:

1. **Fail-open on missing P2PNode** was unsafe in production: any startup
   path that forgot to call attach_p2p_node would silently disable the
   defence and only log a warning. Fail CLOSED in release builds — the
   warn->error and Ok->Err flip surfaces the wiring bug as a rejected
   PUT rather than a silent accept. Test-utils builds keep fail-open
   so unit tests don't need a real DHT.

2. **9/16 majority was too permissive**: an attacker controlling 7
   neighbourhood peers could fabricate the remaining 7 and still pass.
   Raise to 13/16 — tolerates normal routing-table skew (1-3 peers)
   without letting an attacker plant 7 fabricated candidates.

3. **Per-chunk Kademlia lookups were a DoS amplifier**: a 256-chunk
   batch sharing one winner_pool would trigger 256 iterative lookups,
   and N concurrent forged-pool PUTs for the same hash would trigger N
   parallel lookups. Add:
     - closeness_pass_cache: pool_hash -> () so within-batch repeats
       and cross-connection replays are free after the first check.
     - inflight_closeness: pool_hash -> Notify so concurrent PUTs for
       the same pool coalesce to a single lookup behind a drop-guarded
       waker.

Also bump the per-lookup timeout from 10s to 60s. Iterative Kademlia
lookups with dial cascades of 20-30s per unresponsive peer were timing
out under normal churn at 10s — false-rejecting honest proofs. 60s
keeps DoS bounded while being wide enough for real-world convergence.

Document the known Sybil-grinding limitation: midpoint_proof.address()
is BLAKE3 of attacker-controllable inputs, so an attacker who also
runs Sybil DHT nodes can grind it to land in their own neighbourhood.
Closing that gap needs a Sybil-resistance layer or an
attacker-uncontrolled midpoint binding (on-chain VRF / block hash).
… single-flight

Second concurrency review flagged three real problems with the
single-flight closeness check:

1. **Lost-wakeup race**. `notify.notified().await` creates the Notified
   future AFTER the inflight lock is released. If the leader calls
   `notify_waiters()` between the waiter's lock drop and the future's
   first poll, the notification is lost and the waiter parks forever —
   there was no outer timeout around the wait. Fixed by switching to
   `Arc<Notify>::notified_owned()`, which snapshots the
   `notify_waiters` counter at call time; the counter check on the
   first poll resolves the future immediately even if the leader
   already fired.

2. **LRU eviction orphans waiters**. When `DEFAULT_POOL_CACHE_CAPACITY`
   (1000) unique pool_hashes are in flight, LruCache evicts the
   oldest `Notify` entry. The leader's drop guard did
   `slot.pop(&pool_hash)` which, for an evicted entry, popped nothing
   and called `notify_waiters` on — nothing. Worse, a subsequent
   leader for the same pool_hash inserts a fresh `Notify`, which the
   original leader then erroneously pops. Both failure modes orphan
   waiters. Fixed by having the leader own an `Arc<ClosenessSlot>`
   independent of the cache and calling `notify_waiters` on THAT;
   the pop only runs if `Arc::ptr_eq` confirms the cache still holds
   our slot.

3. **Failure-path waiters each re-run the lookup**. On leader failure
   the pass cache wasn't written, waiters woke, saw cache miss, and
   all became new leaders running their own Kademlia lookups. That
   defeats the documented DoS bound of "N forged PUTs cost at most
   one lookup". Fixed by sharing the leader's `Result<(), String>`
   through an `Arc<ClosenessSlot>` with a `OnceLock`; waiters read
   the stored result directly and skip re-verification on both
   success AND failure.

Other changes:

- Loud startup error log if a binary is accidentally built with the
  `test-utils` feature, since that flips the closeness defence to
  fail-open when P2PNode isn't attached.
- Two new unit tests exercise the cache short-circuit and the
  concurrent-waiter single-flight path.
Third review pass found two remaining issues:

1. **Unbounded retry loop on leader cancellation cascade**. The loop
   retries forever if every leader gets cancelled before publishing.
   Add `MAX_LEADER_RETRIES = 4` so waiters return a visible error
   rather than spinning through indefinite cancellation cycles.

2. **Single-flight test didn't actually exercise the waiter path**.
   The existing concurrent-readers test relies on timing and might
   run both calls serially through the leader path. Add
   `closeness_waiter_reads_leaders_published_failure`, which seeds
   the inflight slot, spawns a waiter task, yields until the waiter
   is parked on `notified_owned().await`, then externally publishes
   a failure + notifies — proving the waiter reads the leader's
   published result directly without running its own inner check.
Point saorsa-core at the saorsa-labs/saorsa-core rc-2026.4.2 branch
and refresh Cargo.lock to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(payment): block merkle pay-yourself via DHT closeness check
The previous bump set the version to 0.24.0-rc.1 by mistake; this
restores the intended 0.11.0 RC series for rc-2026.4.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pick up the latest saorsa-core commit on the rc-2026.4.2 branch
and refresh Cargo.lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch saorsa-core from the RC branch git reference back to the
published 0.24.0 release on crates.io (which pulls in the published
saorsa-transport 0.33.0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 19:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Release promotion to ant-node 0.11.0, switching saorsa-core back to the crates.io 0.24.0 release and refreshing the lockfile, while adding a merkle-payment “pay-yourself” defence that verifies candidate closeness against the live DHT (and wiring P2P into the verifier across runtime paths).

Changes:

  • Bump ant-node to 0.11.0, update saorsa-core to 0.24.0 (crates.io), and refresh Cargo.lock.
  • Add merkle candidate “closeness” verification in PaymentVerifier with caching/single-flight to reduce lookup cost.
  • Wire P2PNode into the payment verifier in production/devnet/e2e startup and add an e2e test for the fabricated-candidate-pool (“pay-yourself”) attack.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/payment/verifier.rs Implements DHT-based merkle candidate closeness checks plus caching/single-flight and associated unit tests.
src/node.rs Attaches the live P2PNode to the payment verifier during node startup.
src/devnet.rs Attaches P2PNode to the payment verifier for devnet nodes.
tests/e2e/testnet.rs Attaches P2PNode to the payment verifier in the e2e test harness startup.
tests/e2e/merkle_payment.rs Adds an e2e attack test for fabricated merkle candidate pools (“pay-yourself”).
Cargo.toml Bumps crate version and switches saorsa-core from a git dependency to 0.24.0 on crates.io.
Cargo.lock Lockfile refresh consistent with dependency/version updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/payment/verifier.rs
Comment on lines +863 to +866
// list counts as unmatched.
let network_set: std::collections::HashSet<PeerId> =
network_peers.iter().map(|n| n.peer_id).collect();
let matched = candidate_peer_ids
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closeness match count can be artificially inflated by duplicate candidates. matched is computed by iterating all candidate PeerIds and counting set membership, but it does not enforce that candidate pub_key/PeerId values are unique. An attacker who runs (or grinds) a single real close peer could repeat that same pub_key across many of the 16 candidates and satisfy the CANDIDATE_CLOSENESS_REQUIRED threshold without having 13 distinct close peers. Consider rejecting pools with duplicate derived PeerIds and/or counting unique matches (e.g., intersect the set of candidate PeerIds with the network set).

Suggested change
// list counts as unmatched.
let network_set: std::collections::HashSet<PeerId> =
network_peers.iter().map(|n| n.peer_id).collect();
let matched = candidate_peer_ids
// list counts as unmatched. Candidate pub_keys / PeerIds must also be
// unique, otherwise duplicate entries could artificially inflate the
// closeness match count.
let network_set: std::collections::HashSet<PeerId> =
network_peers.iter().map(|n| n.peer_id).collect();
let candidate_set: std::collections::HashSet<PeerId> =
candidate_peer_ids.iter().copied().collect();
if candidate_set.len() != candidate_peer_ids.len() {
debug!(
"Merkle closeness rejected: duplicate candidate PeerIds for pool midpoint {}",
hex::encode(pool_address.0),
);
return Err(Error::Payment(
"Merkle candidate pool rejected: candidate pub_keys must derive to distinct \
PeerIds. Duplicate candidates are not allowed."
.into(),
));
}
let matched = candidate_set

Copilot uses AI. Check for mistakes.
Comment thread src/payment/verifier.rs
Comment on lines +128 to +133
/// In-flight closeness lookups, keyed by pool hash. Lets concurrent PUTs
/// for the same pool coalesce onto a single Kademlia lookup AND share
/// its result — on both success and failure — which bounds `DoS`
/// amplification to one lookup per unique `pool_hash` regardless of
/// concurrency.
inflight_closeness: Mutex<LruCache<PoolHash, Arc<ClosenessSlot>>>,
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inflight_closeness is stored in an LruCache, which can evict entries under pressure. If an inflight slot for a given pool_hash is evicted before the leader finishes, new callers won’t find it and will start a second lookup for the same pool_hash, breaking the single-flight/DoS-amplification guarantee described in the comment. Using a non-evicting map for inflight state (and removing on completion), or a dedicated structure sized by actual concurrency, would preserve single-flight semantics for a pool_hash.

Suggested change
/// In-flight closeness lookups, keyed by pool hash. Lets concurrent PUTs
/// for the same pool coalesce onto a single Kademlia lookup AND share
/// its result — on both success and failure — which bounds `DoS`
/// amplification to one lookup per unique `pool_hash` regardless of
/// concurrency.
inflight_closeness: Mutex<LruCache<PoolHash, Arc<ClosenessSlot>>>,
/// In-flight closeness lookups, keyed by pool hash. This must be a
/// non-evicting map so an active leader's slot cannot disappear before it
/// publishes its result; entries are removed explicitly on completion.
/// Lets concurrent PUTs for the same pool coalesce onto a single Kademlia
/// lookup AND share its result — on both success and failure — which
/// bounds `DoS` amplification to one lookup per unique `pool_hash`
/// regardless of concurrency.
inflight_closeness:
Mutex<std::collections::HashMap<PoolHash, Arc<ClosenessSlot>>>,

Copilot uses AI. Check for mistakes.
Comment thread src/payment/verifier.rs
Comment on lines +1908 to +1935
async fn closeness_single_flight_concurrent_readers_share_one_verification() {
// Two concurrent callers for the same pool_hash should produce the
// same outcome, and the cache should end up populated exactly once.
// We use the test-utils fail-open path to short-circuit the inner
// DHT lookup; the purpose of this test is the single-flight
// plumbing, not the lookup itself.
let verifier = Arc::new(create_test_verifier());
let pool_hash = [0x77u8; 32];
let pool = MerklePaymentCandidatePool {
midpoint_proof: fake_midpoint_proof(),
candidate_nodes: make_candidate_nodes(1_700_000_000),
};

let v1 = Arc::clone(&verifier);
let p1 = pool.clone();
let v2 = Arc::clone(&verifier);
let p2 = pool.clone();

let (r1, r2) = tokio::join!(
async move { v1.verify_merkle_candidate_closeness(&p1, pool_hash).await },
async move { v2.verify_merkle_candidate_closeness(&p2, pool_hash).await },
);

assert_eq!(r1.is_ok(), r2.is_ok(), "concurrent callers must agree");
assert!(
r1.is_ok(),
"both callers must succeed on the test-utils path"
);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is intended to validate the single-flight behavior under concurrency, but on the current code path the inner check can complete immediately (fail-open under cfg(test) when no P2PNode is attached). In that case tokio::join! may not actually overlap the two calls, so the test can pass even if coalescing is broken. Consider adding an explicit synchronization point (e.g., barrier/notify) so one task becomes the leader and blocks before publishing, while the second task is forced into the waiter path and must observe the leader’s published result.

Copilot uses AI. Check for mistakes.
// 4-leaf tree → depth 2 (log2(4)). Prices are all 1024 in
// build_candidate_nodes, so per-node payment = 1024 * 2^2 / 2 = 2048.
let depth: u8 = 2;
let per_node = Amount::from(4096u64);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the per-node payment should be 1024 * 2^2 / 2 = 2048, but per_node is set to 4096. Overpaying may still make the proof pass (since verification checks >= expected_per_node), but the mismatch makes the test harder to reason about and can hide regressions in the payment formula. Consider either setting per_node to 2048 or updating the comment to match the intended overpayment.

Suggested change
let per_node = Amount::from(4096u64);
let per_node = Amount::from(2048u64);

Copilot uses AI. Check for mistakes.
@jacderida jacderida merged commit 9d5cb51 into main Apr 24, 2026
13 of 15 checks passed
@jacderida jacderida deleted the rc-2026.4.2 branch April 24, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants