Conversation
Merkle verification never tied the 16 candidate pub_keys to the pool midpoint's actual close group on the network. An attacker could generate 16 ML-DSA keypairs locally, point every reward_address at a self-owned wallet, fund the on-chain merkle payment, and receive their own rewards back. Add a closeness defence: for each MerklePaymentProof, the verifier queries the DHT for the closest peers to the winner_pool's midpoint address and requires a majority of the 16 candidate pub_keys to hash to PeerIds that appear in that returned set. Hooked in between the cheap signature check and the on-chain RPC so a forged pool is rejected without paying for an on-chain round-trip. Every storing node that receives the proof runs the same check against the single winner_pool. PaymentVerifier grows an optional Arc<P2PNode> attached at startup via attach_p2p_node; node.rs, devnet.rs, and the e2e testnet harness all wire it after P2PNode construction. Unit tests that don't exercise merkle verification keep a None handle and log a warning. New e2e test test_attack_merkle_pay_yourself_fabricated_pool fails on main (verifier accepts 16 attacker-generated keys all rewarding the payer) and passes with the fix.
Security review flagged three issues with the initial closeness check:
1. **Fail-open on missing P2PNode** was unsafe in production: any startup
path that forgot to call attach_p2p_node would silently disable the
defence and only log a warning. Fail CLOSED in release builds — the
warn->error and Ok->Err flip surfaces the wiring bug as a rejected
PUT rather than a silent accept. Test-utils builds keep fail-open
so unit tests don't need a real DHT.
2. **9/16 majority was too permissive**: an attacker controlling 7
neighbourhood peers could fabricate the remaining 7 and still pass.
Raise to 13/16 — tolerates normal routing-table skew (1-3 peers)
without letting an attacker plant 7 fabricated candidates.
3. **Per-chunk Kademlia lookups were a DoS amplifier**: a 256-chunk
batch sharing one winner_pool would trigger 256 iterative lookups,
and N concurrent forged-pool PUTs for the same hash would trigger N
parallel lookups. Add:
- closeness_pass_cache: pool_hash -> () so within-batch repeats
and cross-connection replays are free after the first check.
- inflight_closeness: pool_hash -> Notify so concurrent PUTs for
the same pool coalesce to a single lookup behind a drop-guarded
waker.
Also bump the per-lookup timeout from 10s to 60s. Iterative Kademlia
lookups with dial cascades of 20-30s per unresponsive peer were timing
out under normal churn at 10s — false-rejecting honest proofs. 60s
keeps DoS bounded while being wide enough for real-world convergence.
Document the known Sybil-grinding limitation: midpoint_proof.address()
is BLAKE3 of attacker-controllable inputs, so an attacker who also
runs Sybil DHT nodes can grind it to land in their own neighbourhood.
Closing that gap needs a Sybil-resistance layer or an
attacker-uncontrolled midpoint binding (on-chain VRF / block hash).
… single-flight Second concurrency review flagged three real problems with the single-flight closeness check: 1. **Lost-wakeup race**. `notify.notified().await` creates the Notified future AFTER the inflight lock is released. If the leader calls `notify_waiters()` between the waiter's lock drop and the future's first poll, the notification is lost and the waiter parks forever — there was no outer timeout around the wait. Fixed by switching to `Arc<Notify>::notified_owned()`, which snapshots the `notify_waiters` counter at call time; the counter check on the first poll resolves the future immediately even if the leader already fired. 2. **LRU eviction orphans waiters**. When `DEFAULT_POOL_CACHE_CAPACITY` (1000) unique pool_hashes are in flight, LruCache evicts the oldest `Notify` entry. The leader's drop guard did `slot.pop(&pool_hash)` which, for an evicted entry, popped nothing and called `notify_waiters` on — nothing. Worse, a subsequent leader for the same pool_hash inserts a fresh `Notify`, which the original leader then erroneously pops. Both failure modes orphan waiters. Fixed by having the leader own an `Arc<ClosenessSlot>` independent of the cache and calling `notify_waiters` on THAT; the pop only runs if `Arc::ptr_eq` confirms the cache still holds our slot. 3. **Failure-path waiters each re-run the lookup**. On leader failure the pass cache wasn't written, waiters woke, saw cache miss, and all became new leaders running their own Kademlia lookups. That defeats the documented DoS bound of "N forged PUTs cost at most one lookup". Fixed by sharing the leader's `Result<(), String>` through an `Arc<ClosenessSlot>` with a `OnceLock`; waiters read the stored result directly and skip re-verification on both success AND failure. Other changes: - Loud startup error log if a binary is accidentally built with the `test-utils` feature, since that flips the closeness defence to fail-open when P2PNode isn't attached. - Two new unit tests exercise the cache short-circuit and the concurrent-waiter single-flight path.
Third review pass found two remaining issues: 1. **Unbounded retry loop on leader cancellation cascade**. The loop retries forever if every leader gets cancelled before publishing. Add `MAX_LEADER_RETRIES = 4` so waiters return a visible error rather than spinning through indefinite cancellation cycles. 2. **Single-flight test didn't actually exercise the waiter path**. The existing concurrent-readers test relies on timing and might run both calls serially through the leader path. Add `closeness_waiter_reads_leaders_published_failure`, which seeds the inflight slot, spawns a waiter task, yields until the waiter is parked on `notified_owned().await`, then externally publishes a failure + notifies — proving the waiter reads the leader's published result directly without running its own inner check.
Point saorsa-core at the saorsa-labs/saorsa-core rc-2026.4.2 branch and refresh Cargo.lock to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(payment): block merkle pay-yourself via DHT closeness check
The previous bump set the version to 0.24.0-rc.1 by mistake; this restores the intended 0.11.0 RC series for rc-2026.4.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pick up the latest saorsa-core commit on the rc-2026.4.2 branch and refresh Cargo.lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch saorsa-core from the RC branch git reference back to the published 0.24.0 release on crates.io (which pulls in the published saorsa-transport 0.33.0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Release promotion to ant-node 0.11.0, switching saorsa-core back to the crates.io 0.24.0 release and refreshing the lockfile, while adding a merkle-payment “pay-yourself” defence that verifies candidate closeness against the live DHT (and wiring P2P into the verifier across runtime paths).
Changes:
- Bump
ant-nodeto0.11.0, updatesaorsa-coreto0.24.0(crates.io), and refreshCargo.lock. - Add merkle candidate “closeness” verification in
PaymentVerifierwith caching/single-flight to reduce lookup cost. - Wire
P2PNodeinto the payment verifier in production/devnet/e2e startup and add an e2e test for the fabricated-candidate-pool (“pay-yourself”) attack.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/payment/verifier.rs |
Implements DHT-based merkle candidate closeness checks plus caching/single-flight and associated unit tests. |
src/node.rs |
Attaches the live P2PNode to the payment verifier during node startup. |
src/devnet.rs |
Attaches P2PNode to the payment verifier for devnet nodes. |
tests/e2e/testnet.rs |
Attaches P2PNode to the payment verifier in the e2e test harness startup. |
tests/e2e/merkle_payment.rs |
Adds an e2e attack test for fabricated merkle candidate pools (“pay-yourself”). |
Cargo.toml |
Bumps crate version and switches saorsa-core from a git dependency to 0.24.0 on crates.io. |
Cargo.lock |
Lockfile refresh consistent with dependency/version updates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // list counts as unmatched. | ||
| let network_set: std::collections::HashSet<PeerId> = | ||
| network_peers.iter().map(|n| n.peer_id).collect(); | ||
| let matched = candidate_peer_ids |
There was a problem hiding this comment.
The closeness match count can be artificially inflated by duplicate candidates. matched is computed by iterating all candidate PeerIds and counting set membership, but it does not enforce that candidate pub_key/PeerId values are unique. An attacker who runs (or grinds) a single real close peer could repeat that same pub_key across many of the 16 candidates and satisfy the CANDIDATE_CLOSENESS_REQUIRED threshold without having 13 distinct close peers. Consider rejecting pools with duplicate derived PeerIds and/or counting unique matches (e.g., intersect the set of candidate PeerIds with the network set).
| // list counts as unmatched. | |
| let network_set: std::collections::HashSet<PeerId> = | |
| network_peers.iter().map(|n| n.peer_id).collect(); | |
| let matched = candidate_peer_ids | |
| // list counts as unmatched. Candidate pub_keys / PeerIds must also be | |
| // unique, otherwise duplicate entries could artificially inflate the | |
| // closeness match count. | |
| let network_set: std::collections::HashSet<PeerId> = | |
| network_peers.iter().map(|n| n.peer_id).collect(); | |
| let candidate_set: std::collections::HashSet<PeerId> = | |
| candidate_peer_ids.iter().copied().collect(); | |
| if candidate_set.len() != candidate_peer_ids.len() { | |
| debug!( | |
| "Merkle closeness rejected: duplicate candidate PeerIds for pool midpoint {}", | |
| hex::encode(pool_address.0), | |
| ); | |
| return Err(Error::Payment( | |
| "Merkle candidate pool rejected: candidate pub_keys must derive to distinct \ | |
| PeerIds. Duplicate candidates are not allowed." | |
| .into(), | |
| )); | |
| } | |
| let matched = candidate_set |
| /// In-flight closeness lookups, keyed by pool hash. Lets concurrent PUTs | ||
| /// for the same pool coalesce onto a single Kademlia lookup AND share | ||
| /// its result — on both success and failure — which bounds `DoS` | ||
| /// amplification to one lookup per unique `pool_hash` regardless of | ||
| /// concurrency. | ||
| inflight_closeness: Mutex<LruCache<PoolHash, Arc<ClosenessSlot>>>, |
There was a problem hiding this comment.
inflight_closeness is stored in an LruCache, which can evict entries under pressure. If an inflight slot for a given pool_hash is evicted before the leader finishes, new callers won’t find it and will start a second lookup for the same pool_hash, breaking the single-flight/DoS-amplification guarantee described in the comment. Using a non-evicting map for inflight state (and removing on completion), or a dedicated structure sized by actual concurrency, would preserve single-flight semantics for a pool_hash.
| /// In-flight closeness lookups, keyed by pool hash. Lets concurrent PUTs | |
| /// for the same pool coalesce onto a single Kademlia lookup AND share | |
| /// its result — on both success and failure — which bounds `DoS` | |
| /// amplification to one lookup per unique `pool_hash` regardless of | |
| /// concurrency. | |
| inflight_closeness: Mutex<LruCache<PoolHash, Arc<ClosenessSlot>>>, | |
| /// In-flight closeness lookups, keyed by pool hash. This must be a | |
| /// non-evicting map so an active leader's slot cannot disappear before it | |
| /// publishes its result; entries are removed explicitly on completion. | |
| /// Lets concurrent PUTs for the same pool coalesce onto a single Kademlia | |
| /// lookup AND share its result — on both success and failure — which | |
| /// bounds `DoS` amplification to one lookup per unique `pool_hash` | |
| /// regardless of concurrency. | |
| inflight_closeness: | |
| Mutex<std::collections::HashMap<PoolHash, Arc<ClosenessSlot>>>, |
| async fn closeness_single_flight_concurrent_readers_share_one_verification() { | ||
| // Two concurrent callers for the same pool_hash should produce the | ||
| // same outcome, and the cache should end up populated exactly once. | ||
| // We use the test-utils fail-open path to short-circuit the inner | ||
| // DHT lookup; the purpose of this test is the single-flight | ||
| // plumbing, not the lookup itself. | ||
| let verifier = Arc::new(create_test_verifier()); | ||
| let pool_hash = [0x77u8; 32]; | ||
| let pool = MerklePaymentCandidatePool { | ||
| midpoint_proof: fake_midpoint_proof(), | ||
| candidate_nodes: make_candidate_nodes(1_700_000_000), | ||
| }; | ||
|
|
||
| let v1 = Arc::clone(&verifier); | ||
| let p1 = pool.clone(); | ||
| let v2 = Arc::clone(&verifier); | ||
| let p2 = pool.clone(); | ||
|
|
||
| let (r1, r2) = tokio::join!( | ||
| async move { v1.verify_merkle_candidate_closeness(&p1, pool_hash).await }, | ||
| async move { v2.verify_merkle_candidate_closeness(&p2, pool_hash).await }, | ||
| ); | ||
|
|
||
| assert_eq!(r1.is_ok(), r2.is_ok(), "concurrent callers must agree"); | ||
| assert!( | ||
| r1.is_ok(), | ||
| "both callers must succeed on the test-utils path" | ||
| ); |
There was a problem hiding this comment.
This test is intended to validate the single-flight behavior under concurrency, but on the current code path the inner check can complete immediately (fail-open under cfg(test) when no P2PNode is attached). In that case tokio::join! may not actually overlap the two calls, so the test can pass even if coalescing is broken. Consider adding an explicit synchronization point (e.g., barrier/notify) so one task becomes the leader and blocks before publishing, while the second task is forced into the waiter path and must observe the leader’s published result.
| // 4-leaf tree → depth 2 (log2(4)). Prices are all 1024 in | ||
| // build_candidate_nodes, so per-node payment = 1024 * 2^2 / 2 = 2048. | ||
| let depth: u8 = 2; | ||
| let per_node = Amount::from(4096u64); |
There was a problem hiding this comment.
The comment says the per-node payment should be 1024 * 2^2 / 2 = 2048, but per_node is set to 4096. Overpaying may still make the proof pass (since verification checks >= expected_per_node), but the mismatch makes the test harder to reason about and can hide regressions in the payment formula. Consider either setting per_node to 2048 or updating the comment to match the intended overpayment.
| let per_node = Amount::from(4096u64); | |
| let per_node = Amount::from(2048u64); |
Summary
rc-2026.4.2tomainant-nodefrom0.11.0-rc.2to0.11.0saorsa-corefrom the RC branch git dependency back to the published crates.io0.24.0release (which pulls insaorsa-transport0.33.0)Cargo.lock🤖 Generated with Claude Code