fix(deploy,sync): ensure canister is Running before asset sync by lwshang · Pull Request #614 · dfinity/icp-cli

lwshang · 2026-06-18T15:31:50Z

Problem

icp deploy/icp sync against the asset canister intermittently failed at the sync step, when the sync plugin's first canister call reached the target while it was not serving:

... Canister <id> is stopped and therefore does not have a CallContextManager, error code Some("IC0508")

Investigation found two distinct failure modes:

Mode A — canister genuinely not Running before sync. install_code is status-preserving, so a canister that enters the command already Stopped/Stopping (handed out Stopped from a canister pool, or left Stopped by a deploy that died in the stop→install→start window) is installed but never started, and the plugin's first call hits IC0508. Persistent — reruns don't help. (Why developer-docs preview deploys failed repeatedly.)
Mode B — query path transiently stale right after a restart. When the canister was Running, deploy correctly does stop→upgrade→start, but start_canister only makes it Running in the subnet's certified state. IC query calls are eventually-consistent reads served by a single replica that may still lag the restart's commit height; the plugin's first calls are queries, so they can momentarily observe the just-vacated Stopped state. Transient — a rerun "fixes" it. (Why the always-running docs canister failed intermittently.)

Fix

Mode A — commit e0797fe3:

icp deploy owns the lifecycle (it just installed): after install and before sync it starts every sync-target canister, then leaves it Running. start_canister is idempotent and synchronous per the IC spec, so no status poll is needed. Scoped to canisters with sync steps.
icp sync does NOT own the lifecycle: it prechecks run status and aborts with an actionable error if not Running, replacing the cryptic IC0508. It does not auto-start.
install.rs is deliberately unchanged — it stays status-preserving.

Mode B — commit eb367aa2:

In icp deploy, after starting the canister and before sync, wait until the query path consistently observes the canister Running. Probe with a query for a method no canister exports and classify the reject: "is stopped/stopping" (IC0508/IC0509) means the replica still lags; any other reject (e.g. "no query method") or a reply means it is serving. Require a few consecutive ready observations, spaced so they may land on different replicas.
Confidence-raising, not a hard guarantee — query reads are per-node and boundary nodes load-balance across replicas, an inherent property of the decentralized read path — but it makes the post-restart race rare.

Tests

deploy_starts_stopped_canister_before_sync — deploy, stop, redeploy; deploy starts the canister before the plugin runs and ends Running with synced data. Reproduces Mode A pre-fix.
sync_aborts_when_canister_not_running — icp sync against a stopped canister aborts with the actionable message; the sync step never runs.

Mode B is a timing/replica-lag race that a single-node local replica doesn't exhibit, so it has no e2e test. cargo build, cargo clippy --tests, cargo fmt --check clean.

Out of scope (follow-ups)

Clearer plugin-side error than raw IC0508.
Whether resolve_install_mode_and_status should repair a Stopping canister.

🤖 Generated with Claude Code

Asset sync makes update calls to the target canister, which fail with IC0508 ("canister is stopped ... does not have a CallContextManager") when the canister is Stopped/Stopping. install_code is status-preserving, so a canister that entered the command non-Running (handed out Stopped from a canister pool, or left Stopped by an earlier interrupted deploy) stays non-Running through install and hits IC0508 in the sync plugin's first call. icp deploy owns the lifecycle: after install and before sync it now starts every canister it is about to sync (idempotent; start_canister is synchronous per the IC spec, so no status poll is needed) and leaves it Running. icp sync does not own the lifecycle (the user may have stopped the canister deliberately): it now prechecks the run status up front and aborts with an actionable error instead of letting the plugin fail with a cryptic IC0508. install.rs is deliberately unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

After start_canister the canister is Running in the subnet's certified state, but IC query calls are eventually-consistent reads served by a single replica that may still lag the restart's commit height. The sync plugin's first calls are queries, so right after a restart they can hit a lagging replica and fail with a transient IC0508 ("canister is stopped"). Before handing off to sync, poll each sync-target canister with a query for a method no canister exports, classifying the reject: "is stopped" (IC0508/IC0509) means the replica still lags; any other reject or a reply means it observes the canister Running. Require a few consecutive ready observations. Confidence-raising, not a hard guarantee (query reads are per-node) but it makes the post-restart race rare. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The previous loop capped the attempt count (60) but not total wall-clock: with a 10s per-probe timeout, an unresponsive canister could stall the deploy for ~10.5 minutes before failing. Wrap the poll loop in a single 30s timeout (the hard cap) and drop the per-probe timeout to 2s so retries keep flowing within the budget. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR hardens icp deploy/icp sync asset synchronization by ensuring the target canister is Running (and query-serving) before sync operations begin, preventing intermittent IC0508 failures during the sync plugin’s first calls.

Changes:

icp sync: prechecks canister status and aborts early with an actionable error if the canister is not Running.
icp deploy: starts any sync-target canisters post-install and waits until the query path consistently observes them as Running before running sync steps.
Adds e2e regression tests covering deploy-starts-before-sync and sync-aborts-when-stopped.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`crates/icp-cli/src/commands/sync.rs`	Adds a canister-status precheck to abort sync early when the canister isn’t Running.
`crates/icp-cli/src/commands/deploy.rs`	Starts sync-target canisters after install and adds a query-path readiness wait to reduce post-restart IC0508 races.
`crates/icp-cli/tests/sync_tests.rs`	Adds a test ensuring `icp sync` aborts (and does not run sync steps) when the canister is stopped.
`crates/icp-cli/tests/deploy_tests.rs`	Adds a regression test ensuring `icp deploy` starts a stopped canister before running plugin-based sync.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The probe classified any non-stopped query error as "ready", including transport/HTTP/timeout AgentError variants — contradicting the doc comment and letting deploy proceed to sync on a transient network blip. Treat only a genuine replica reject (e.g. method-not-found) that is not stopped/stopping as ready; every non-reject error is inconclusive and triggers a retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The "Setup image (Linux)" step intermittently hangs for many minutes (observed 45s-7m40s for the same step within a single run). Timestamped logs show individual small-package downloads from azure.archive.ubuntu.com stalling 1-2 minutes each, with apt eventually emitting `Ign:` and retrying the mirror -- flaky Azure-hosted Ubuntu mirror connections, not a declared GitHub/Azure incident. - Bound each fetch with Acquire::http::Timeout=30 and Acquire::Retries=3 so a stalled connection fails fast and retries instead of hanging. - Install softhsm2/pipx with --no-install-recommends: those pulled in an unused doc toolchain (mkdocs, sphinx, tornado, livereload, several libjs-*) -- exactly the packages seen stalling. Hard deps still install; python3-venv (needed by pipx) is already on the runner image. Left the build script's dbus install with recommends intact, since the dbus daemon is used by the keyring tests at runtime. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lwshang and others added 3 commits June 18, 2026 11:31

chore: fmt

110f830

raymondk reviewed Jun 18, 2026

View reviewed changes

Comment thread crates/icp-cli/src/commands/deploy.rs Outdated

lwshang requested a review from Copilot June 18, 2026 20:02

Copilot started reviewing on behalf of lwshang June 18, 2026 20:02 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread crates/icp-cli/src/commands/deploy.rs

lwshang and others added 2 commits June 18, 2026 16:08

lwshang marked this pull request as ready for review June 18, 2026 21:33

lwshang requested a review from a team as a code owner June 18, 2026 21:33

raymondk approved these changes Jun 18, 2026

View reviewed changes

lwshang merged commit 1662a43 into main Jun 19, 2026
155 of 157 checks passed

lwshang deleted the lwshang/ensure-canister-running-before-sync branch June 19, 2026 00:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deploy,sync): ensure canister is Running before asset sync#614

fix(deploy,sync): ensure canister is Running before asset sync#614
lwshang merged 6 commits into
mainfrom
lwshang/ensure-canister-running-before-sync

lwshang commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lwshang commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Out of scope (follow-ups)

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lwshang commented Jun 18, 2026 •

edited

Loading