feat: avoid clustered restarts during upgrades by jacderida · Pull Request #49 · WithAutonomi/ant-node

jacderida · 2026-03-30T17:12:04Z

Summary

Increase the default staged rollout window from 1 hour to 24 hours, giving nodes more room to spread out their restarts
When a pending upgrade is detected, sleep for exactly the remaining rollout delay instead of waiting for the next check interval tick — this eliminates restart clustering caused by quantization to the check interval
Skip crates.io publish for pre-release versions (RC, alpha, beta) in the release workflow

Test plan

Validated with 100-node testnet using 2-hour rollout window
Scheduled restart times are uniformly distributed across the rollout window
Actual restart times match scheduled times (no burst clustering)
All existing unit tests pass
Clippy and cargo fmt pass

Test results

Auto-Upgrade Test Results: DEV-01

All 91 nodes (90 regular + 1 genesis) successfully upgraded from v0.7.0 to v0.7.10-rc.1.

Check	Result
All nodes upgrade to v0.7.10-rc.1	PASS (91/91)
Binary downloaded once per host	PASS (1 download per VM, rest cached)
Release info fetched once per host	PASS (2-3 fetches due to cache TTL over 2hr window, 34-53 cache hits each)
No upgrade errors	PASS (zero errors found)
Peer ID retention	PASS (identical before/after)
Port retention	N/A (nodes use random ports by design with 0.0.0.0:0)
Restart time distribution	PASS (spread across full 2-hour window, 2-11 per 10-min bucket, no clustering)
Graceful shutdown logged	PASS (91/91)
NRestarts = 1	PASS (91/91)
No /proc/PID/exe (deleted)	PASS (91/91 - no stale binaries)
systemd stop/start cycle	PASS

Key findings:

Binary caching works - only 1 download per host, subsequent nodes detect the binary was already replaced
Restart distribution is even - no clustered bursts, scheduled-to-actual accuracy within 5-11 seconds
No stale binaries - the (deleted) issue from previous tests is fully resolved
All process restarts verified via systemd NRestarts=1 and journalctl stop/start cycles

🤖 Generated with Claude Code

Prevents clustered restarts when a new release is published by spreading node upgrades evenly across a 24-hour window instead of 1 hour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When an upgrade is pending, the monitor task now sleeps for precisely the remaining rollout delay rather than waiting for the next check interval tick. This eliminates restart clustering caused by quantization to the check interval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pre-release versions (alpha, beta, rc) should not be published to crates.io. Also removes publish-crate from the release job dependency chain so pre-release GitHub releases aren't blocked. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Reduces upgrade-induced restart clustering by widening the staged rollout window and aligning sleep timing with each node’s scheduled upgrade time; also updates the release workflow to skip crates.io publishing for prereleases.

Changes:

Increase default staged rollout window from 1 hour to 24 hours.
When an upgrade is pending, sleep until the exact remaining rollout delay rather than the next check interval tick.
Skip crates.io publish for prerelease tags in the GitHub Actions release workflow.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/node.rs`	Adjusts upgrade-monitor loop to sleep until scheduled upgrade time to avoid quantization/clustering.
`src/config.rs`	Changes default staged rollout window from 1h to 24h.
`.github/workflows/release.yml`	Skips crates.io publishing for prereleases and adjusts job dependencies for release creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add backoff when rollout delay has elapsed but upgrade failed, to prevent a tight retry loop on Duration::ZERO - Wrap upgrade sleep in tokio::select! with shutdown.cancelled() so shutdown can interrupt long rollout delay sleeps - Restore publish-crate dependency on release job with conditional logic: release proceeds if publish-crate succeeds or was skipped (pre-release), but blocks if it fails (stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add unit and e2e tests covering the remaining Section 18 scenarios: Unit tests (32 new): - Quorum: #4 fail→abandoned, #16 timeout→inconclusive, #27 single-round dual-evidence, #28 dynamic threshold undersized, #33 batched per-key, #34 partial response unresolved, #42 quorum-derived paid-list auth - Admission: #5 unauthorized peer, #7 out-of-range rejected - Config: #18 invalid config rejected, #26 dynamic paid threshold - Scheduling: #8 dedup safety, #8 replica/paid collapse - Neighbor sync: #35 round-robin cooldown skip, #36 cycle completion, #38 snapshot stability mid-join, #39 unreachable removal + slot fill, #40 cooldown peer removed, #41 cycle termination guarantee, consecutive rounds, cycle preserves sync times - Pruning: #50 hysteresis prevents premature delete, #51 timestamp reset on heal, #52 paid/record timestamps independent, #23 entry removal - Audit: #19/#53 partial failure mixed responsibility, #54 all pass, #55 empty failure discard, #56 repair opportunity filter, response count validation, digest uses full record bytes - Types: #13 bootstrap drain, repair opportunity edge cases, terminal state variants - Bootstrap claims: #46 first-seen recorded, #49 cleared on normal E2e tests (4 new): - #2 fresh offer with empty PoP rejected - #5/#37 neighbor sync request returns response - #11 audit challenge multi-key (present + absent) - Fetch not-found for non-existent key Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jacderida and others added 3 commits March 29, 2026 16:16

feat: increase staged rollout window from 1 hour to 24 hours

37e00f7

Prevents clustered restarts when a new release is published by spreading node upgrades evenly across a 24-hour window instead of 1 hour. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 30, 2026 17:12

Copilot started reviewing on behalf of jacderida March 30, 2026 17:12 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Comment thread src/node.rs Outdated

Comment thread src/node.rs Outdated

Comment thread .github/workflows/release.yml Outdated

mickvandijke approved these changes Mar 30, 2026

View reviewed changes

jacderida merged commit 8495d16 into main Mar 30, 2026
17 checks passed

jacderida deleted the feat-avoid_clustered_restarts branch March 30, 2026 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: avoid clustered restarts during upgrades#49

feat: avoid clustered restarts during upgrades#49
jacderida merged 4 commits intomainfrom
feat-avoid_clustered_restarts

jacderida commented Mar 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jacderida commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Test results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacderida commented Mar 30, 2026 •

edited

Loading