fix(deploy,sync): ensure canister is Running before asset sync#614
Merged
Conversation
Asset sync makes update calls to the target canister, which fail with
IC0508 ("canister is stopped ... does not have a CallContextManager")
when the canister is Stopped/Stopping. install_code is status-preserving,
so a canister that entered the command non-Running (handed out Stopped
from a canister pool, or left Stopped by an earlier interrupted deploy)
stays non-Running through install and hits IC0508 in the sync plugin's
first call.
icp deploy owns the lifecycle: after install and before sync it now
starts every canister it is about to sync (idempotent; start_canister is
synchronous per the IC spec, so no status poll is needed) and leaves it
Running.
icp sync does not own the lifecycle (the user may have stopped the
canister deliberately): it now prechecks the run status up front and
aborts with an actionable error instead of letting the plugin fail with
a cryptic IC0508. install.rs is deliberately unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After start_canister the canister is Running in the subnet's certified
state, but IC query calls are eventually-consistent reads served by a
single replica that may still lag the restart's commit height. The sync
plugin's first calls are queries, so right after a restart they can hit a
lagging replica and fail with a transient IC0508 ("canister is stopped").
Before handing off to sync, poll each sync-target canister with a query
for a method no canister exports, classifying the reject: "is stopped"
(IC0508/IC0509) means the replica still lags; any other reject or a reply
means it observes the canister Running. Require a few consecutive ready
observations. Confidence-raising, not a hard guarantee (query reads are
per-node) but it makes the post-restart race rare.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
raymondk
reviewed
Jun 18, 2026
The previous loop capped the attempt count (60) but not total wall-clock: with a 10s per-probe timeout, an unresponsive canister could stall the deploy for ~10.5 minutes before failing. Wrap the poll loop in a single 30s timeout (the hard cap) and drop the per-probe timeout to 2s so retries keep flowing within the budget. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens icp deploy/icp sync asset synchronization by ensuring the target canister is Running (and query-serving) before sync operations begin, preventing intermittent IC0508 failures during the sync plugin’s first calls.
Changes:
icp sync: prechecks canister status and aborts early with an actionable error if the canister is not Running.icp deploy: starts any sync-target canisters post-install and waits until the query path consistently observes them as Running before running sync steps.- Adds e2e regression tests covering deploy-starts-before-sync and sync-aborts-when-stopped.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
crates/icp-cli/src/commands/sync.rs |
Adds a canister-status precheck to abort sync early when the canister isn’t Running. |
crates/icp-cli/src/commands/deploy.rs |
Starts sync-target canisters after install and adds a query-path readiness wait to reduce post-restart IC0508 races. |
crates/icp-cli/tests/sync_tests.rs |
Adds a test ensuring icp sync aborts (and does not run sync steps) when the canister is stopped. |
crates/icp-cli/tests/deploy_tests.rs |
Adds a regression test ensuring icp deploy starts a stopped canister before running plugin-based sync. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The probe classified any non-stopped query error as "ready", including transport/HTTP/timeout AgentError variants — contradicting the doc comment and letting deploy proceed to sync on a transient network blip. Treat only a genuine replica reject (e.g. method-not-found) that is not stopped/stopping as ready; every non-reject error is inconclusive and triggers a retry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "Setup image (Linux)" step intermittently hangs for many minutes (observed 45s-7m40s for the same step within a single run). Timestamped logs show individual small-package downloads from azure.archive.ubuntu.com stalling 1-2 minutes each, with apt eventually emitting `Ign:` and retrying the mirror -- flaky Azure-hosted Ubuntu mirror connections, not a declared GitHub/Azure incident. - Bound each fetch with Acquire::http::Timeout=30 and Acquire::Retries=3 so a stalled connection fails fast and retries instead of hanging. - Install softhsm2/pipx with --no-install-recommends: those pulled in an unused doc toolchain (mkdocs, sphinx, tornado, livereload, several libjs-*) -- exactly the packages seen stalling. Hard deps still install; python3-venv (needed by pipx) is already on the runner image. Left the build script's dbus install with recommends intact, since the dbus daemon is used by the keyring tests at runtime. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
raymondk
approved these changes
Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
icp deploy/icp syncagainst the asset canister intermittently failed at the sync step, when the sync plugin's first canister call reached the target while it was not serving:Investigation found two distinct failure modes:
install_codeis status-preserving, so a canister that enters the command alreadyStopped/Stopping(handed out Stopped from a canister pool, or left Stopped by a deploy that died in the stop→install→start window) is installed but never started, and the plugin's first call hits IC0508. Persistent — reruns don't help. (Why developer-docs preview deploys failed repeatedly.)start_canisteronly makes it Running in the subnet's certified state. IC query calls are eventually-consistent reads served by a single replica that may still lag the restart's commit height; the plugin's first calls are queries, so they can momentarily observe the just-vacated Stopped state. Transient — a rerun "fixes" it. (Why the always-running docs canister failed intermittently.)Fix
Mode A — commit
e0797fe3:icp deployowns the lifecycle (it just installed): after install and before sync it starts every sync-target canister, then leaves it Running.start_canisteris idempotent and synchronous per the IC spec, so no status poll is needed. Scoped to canisters with sync steps.icp syncdoes NOT own the lifecycle: it prechecks run status and aborts with an actionable error if not Running, replacing the cryptic IC0508. It does not auto-start.install.rsis deliberately unchanged — it stays status-preserving.Mode B — commit
eb367aa2:icp deploy, after starting the canister and before sync, wait until the query path consistently observes the canister Running. Probe with a query for a method no canister exports and classify the reject: "is stopped/stopping" (IC0508/IC0509) means the replica still lags; any other reject (e.g. "no query method") or a reply means it is serving. Require a few consecutive ready observations, spaced so they may land on different replicas.Tests
deploy_starts_stopped_canister_before_sync— deploy, stop, redeploy; deploy starts the canister before the plugin runs and ends Running with synced data. Reproduces Mode A pre-fix.sync_aborts_when_canister_not_running—icp syncagainst a stopped canister aborts with the actionable message; the sync step never runs.Mode B is a timing/replica-lag race that a single-node local replica doesn't exhibit, so it has no e2e test.
cargo build,cargo clippy --tests,cargo fmt --checkclean.Out of scope (follow-ups)
resolve_install_mode_and_statusshould repair aStoppingcanister.🤖 Generated with Claude Code