Skip to content

fix(deploy,sync): ensure canister is Running before asset sync#614

Merged
lwshang merged 6 commits into
mainfrom
lwshang/ensure-canister-running-before-sync
Jun 19, 2026
Merged

fix(deploy,sync): ensure canister is Running before asset sync#614
lwshang merged 6 commits into
mainfrom
lwshang/ensure-canister-running-before-sync

Conversation

@lwshang

@lwshang lwshang commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Problem

icp deploy/icp sync against the asset canister intermittently failed at the sync step, when the sync plugin's first canister call reached the target while it was not serving:

... Canister <id> is stopped and therefore does not have a CallContextManager, error code Some("IC0508")

Investigation found two distinct failure modes:

  • Mode A — canister genuinely not Running before sync. install_code is status-preserving, so a canister that enters the command already Stopped/Stopping (handed out Stopped from a canister pool, or left Stopped by a deploy that died in the stop→install→start window) is installed but never started, and the plugin's first call hits IC0508. Persistent — reruns don't help. (Why developer-docs preview deploys failed repeatedly.)
  • Mode B — query path transiently stale right after a restart. When the canister was Running, deploy correctly does stop→upgrade→start, but start_canister only makes it Running in the subnet's certified state. IC query calls are eventually-consistent reads served by a single replica that may still lag the restart's commit height; the plugin's first calls are queries, so they can momentarily observe the just-vacated Stopped state. Transient — a rerun "fixes" it. (Why the always-running docs canister failed intermittently.)

Fix

Mode A — commit e0797fe3:

  • icp deploy owns the lifecycle (it just installed): after install and before sync it starts every sync-target canister, then leaves it Running. start_canister is idempotent and synchronous per the IC spec, so no status poll is needed. Scoped to canisters with sync steps.
  • icp sync does NOT own the lifecycle: it prechecks run status and aborts with an actionable error if not Running, replacing the cryptic IC0508. It does not auto-start.
  • install.rs is deliberately unchanged — it stays status-preserving.

Mode B — commit eb367aa2:

  • In icp deploy, after starting the canister and before sync, wait until the query path consistently observes the canister Running. Probe with a query for a method no canister exports and classify the reject: "is stopped/stopping" (IC0508/IC0509) means the replica still lags; any other reject (e.g. "no query method") or a reply means it is serving. Require a few consecutive ready observations, spaced so they may land on different replicas.
  • Confidence-raising, not a hard guarantee — query reads are per-node and boundary nodes load-balance across replicas, an inherent property of the decentralized read path — but it makes the post-restart race rare.

Tests

  • deploy_starts_stopped_canister_before_sync — deploy, stop, redeploy; deploy starts the canister before the plugin runs and ends Running with synced data. Reproduces Mode A pre-fix.
  • sync_aborts_when_canister_not_runningicp sync against a stopped canister aborts with the actionable message; the sync step never runs.

Mode B is a timing/replica-lag race that a single-node local replica doesn't exhibit, so it has no e2e test. cargo build, cargo clippy --tests, cargo fmt --check clean.

Out of scope (follow-ups)

  • Clearer plugin-side error than raw IC0508.
  • Whether resolve_install_mode_and_status should repair a Stopping canister.

🤖 Generated with Claude Code

lwshang and others added 3 commits June 18, 2026 11:31
Asset sync makes update calls to the target canister, which fail with
IC0508 ("canister is stopped ... does not have a CallContextManager")
when the canister is Stopped/Stopping. install_code is status-preserving,
so a canister that entered the command non-Running (handed out Stopped
from a canister pool, or left Stopped by an earlier interrupted deploy)
stays non-Running through install and hits IC0508 in the sync plugin's
first call.

icp deploy owns the lifecycle: after install and before sync it now
starts every canister it is about to sync (idempotent; start_canister is
synchronous per the IC spec, so no status poll is needed) and leaves it
Running.

icp sync does not own the lifecycle (the user may have stopped the
canister deliberately): it now prechecks the run status up front and
aborts with an actionable error instead of letting the plugin fail with
a cryptic IC0508. install.rs is deliberately unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After start_canister the canister is Running in the subnet's certified
state, but IC query calls are eventually-consistent reads served by a
single replica that may still lag the restart's commit height. The sync
plugin's first calls are queries, so right after a restart they can hit a
lagging replica and fail with a transient IC0508 ("canister is stopped").

Before handing off to sync, poll each sync-target canister with a query
for a method no canister exports, classifying the reject: "is stopped"
(IC0508/IC0509) means the replica still lags; any other reject or a reply
means it observes the canister Running. Require a few consecutive ready
observations. Confidence-raising, not a hard guarantee (query reads are
per-node) but it makes the post-restart race rare.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread crates/icp-cli/src/commands/deploy.rs Outdated
The previous loop capped the attempt count (60) but not total wall-clock:
with a 10s per-probe timeout, an unresponsive canister could stall the
deploy for ~10.5 minutes before failing. Wrap the poll loop in a single
30s timeout (the hard cap) and drop the per-probe timeout to 2s so
retries keep flowing within the budget.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens icp deploy/icp sync asset synchronization by ensuring the target canister is Running (and query-serving) before sync operations begin, preventing intermittent IC0508 failures during the sync plugin’s first calls.

Changes:

  • icp sync: prechecks canister status and aborts early with an actionable error if the canister is not Running.
  • icp deploy: starts any sync-target canisters post-install and waits until the query path consistently observes them as Running before running sync steps.
  • Adds e2e regression tests covering deploy-starts-before-sync and sync-aborts-when-stopped.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
crates/icp-cli/src/commands/sync.rs Adds a canister-status precheck to abort sync early when the canister isn’t Running.
crates/icp-cli/src/commands/deploy.rs Starts sync-target canisters after install and adds a query-path readiness wait to reduce post-restart IC0508 races.
crates/icp-cli/tests/sync_tests.rs Adds a test ensuring icp sync aborts (and does not run sync steps) when the canister is stopped.
crates/icp-cli/tests/deploy_tests.rs Adds a regression test ensuring icp deploy starts a stopped canister before running plugin-based sync.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/icp-cli/src/commands/deploy.rs
lwshang and others added 2 commits June 18, 2026 16:08
The probe classified any non-stopped query error as "ready", including
transport/HTTP/timeout AgentError variants — contradicting the doc
comment and letting deploy proceed to sync on a transient network blip.
Treat only a genuine replica reject (e.g. method-not-found) that is not
stopped/stopping as ready; every non-reject error is inconclusive and
triggers a retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "Setup image (Linux)" step intermittently hangs for many minutes
(observed 45s-7m40s for the same step within a single run). Timestamped
logs show individual small-package downloads from azure.archive.ubuntu.com
stalling 1-2 minutes each, with apt eventually emitting `Ign:` and retrying
the mirror -- flaky Azure-hosted Ubuntu mirror connections, not a declared
GitHub/Azure incident.

- Bound each fetch with Acquire::http::Timeout=30 and Acquire::Retries=3 so
  a stalled connection fails fast and retries instead of hanging.
- Install softhsm2/pipx with --no-install-recommends: those pulled in an
  unused doc toolchain (mkdocs, sphinx, tornado, livereload, several
  libjs-*) -- exactly the packages seen stalling. Hard deps still install;
  python3-venv (needed by pipx) is already on the runner image.

Left the build script's dbus install with recommends intact, since the
dbus daemon is used by the keyring tests at runtime.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lwshang lwshang marked this pull request as ready for review June 18, 2026 21:33
@lwshang lwshang requested a review from a team as a code owner June 18, 2026 21:33
@lwshang lwshang merged commit 1662a43 into main Jun 19, 2026
155 of 157 checks passed
@lwshang lwshang deleted the lwshang/ensure-canister-running-before-sync branch June 19, 2026 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants