Skip to content

fix(import): assign imported session to current project and directory#15826

Closed
Juhwa-Lee1023 wants to merge 3 commits into
anomalyco:devfrom
Juhwa-Lee1023:fix/import-session-project-assignment
Closed

fix(import): assign imported session to current project and directory#15826
Juhwa-Lee1023 wants to merge 3 commits into
anomalyco:devfrom
Juhwa-Lee1023:fix/import-session-project-assignment

Conversation

@Juhwa-Lee1023
Copy link
Copy Markdown

@Juhwa-Lee1023 Juhwa-Lee1023 commented Mar 3, 2026

Issue for this PR

Closes #15797

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

opencode import runs inside bootstrap(process.cwd(), ...), which sets up the Instance context for the current directory. However, the imported session row was built entirely from the exported session data via Session.toRow(exportData.info), which preserves the original session's directory. That means the session ends up pointing at wherever it was exported from, not the current project.

Two fixes in one line each:

  1. Override directory with Instance.worktree (the resolved worktree of the current project) so the imported session is associated with the correct project directory.
  2. Add directory to the onConflictDoUpdate set so that re-importing an already-present session also updates its directory, not just project_id.

How did you verify your code works?

  • Ran the existing unit tests: bun test test/cli/import.test.ts → 4 pass, 0 fail.
  • Traced the call path: bootstrap(process.cwd(), cb)Instance.provide({ directory: process.cwd(), ... }) → inside cb, Instance.worktree resolves to the current project's worktree. Confirmed Instance.worktree is the correct field to use (it is what Session.create uses when creating new sessions in the same context).

Screenshots / recordings

N/A — no UI changes.

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

When importing a session, override both project_id and directory with
the current project context (Instance.project.id and Instance.worktree)
instead of preserving the original session's project binding.

Also include directory in onConflictDoUpdate so that re-importing a
session correctly updates the directory to the current project.

Fixes anomalyco#15797
@github-actions github-actions Bot added needs:compliance This means the issue will auto-close after 2 hours. and removed needs:compliance This means the issue will auto-close after 2 hours. labels Mar 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 3, 2026

Thanks for updating your PR! It now meets our contributing guidelines. 👍

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Closing this pull request because it has had no updates for more than 60 days. If you plan to continue working on it, feel free to reopen or open a new PR.

@github-actions github-actions Bot closed this May 5, 2026
Acksell added a commit to Acksell/clank that referenced this pull request May 14, 2026
* sync: session export/import substrate (push leg)

Adds the building blocks for syncing opencode sessions alongside code
checkpoints during `clank push --migrate`. Sessions are stored as
opaque opencode export blobs in the same S3 prefix as the code
bundles; one ownership-flip atomically covers both legs.

Empirically pinned `opencode import` semantics first (Step 0 in
TestOpenCodeImportSemantics) — confirmed §E path (a): import
preserves info.id and additive-merges messages by ID, so the
RegisterImportedSession upsert can key on host ULID + external_id.

New primitives:
- pkg/sync/checkpoint/sessionmanifest.go: SessionManifest schema +
  versioning, mirrors Manifest's discipline.
- pkg/sync/storage/sessions.go: KeyForSession +
  BlobSessionManifest, sibling layout under checkpoints/.
- internal/agent/opencode_export.go: thin wrappers around
  `opencode export` / `import` subprocesses with ANSI-stripped
  stdout parsing for the imported session ID.
- internal/host/sessions_export.go: Service.ExportSessions —
  enumerate sessions for a worktree, abort busy ones immediately
  (per user UX call: no idle wait), invoke opencode export per
  session. Claude-code sessions are skipped with a warning.
- internal/host/sessions_import.go: Service.RegisterImportedSession
  — idempotent upsert keyed on host ULID + worktreeID.
- internal/host/mux/sessions_sync.go: /sync/sessions/{build,upload,
  delete,apply-from-urls} handlers mirroring the existing code
  trio.
- pkg/sync/sessions.go: Server.PresignSessionPuts +
  DownloadSessionURLs.
- pkg/gateway/migrate_sessions.go: gateway orchestration
  (pushSessionsToSprite) — fetches the session manifest inline,
  mints per-session GET URLs, POSTs to the sprite's
  /sync/sessions/apply-from-urls. Wired into handleMigrateWorktree
  between code apply and TransferOwnership.

What's NOT in this commit (follow-up):
- Laptop-side push.go orchestration of session build + upload
  (gateway-facing HTTP route on the sync server + daemonclient/
  syncclient methods).
- Symmetric pull-back session leg (migrate_back.go +
  daemonclient.MaterializeResponse session URL fields + pull.go
  apply).
- The inject-"please continue" message after migrate (v1.1).
- Auto-sync on regular `clank push` (v1.1).
- Claude-code backend session sync (future).

Tests:
- internal/agent: TestOpenCodeImportSemantics (Step-0 diagnostic),
  TestOpenCodeExportImport_RoundTrip.
- internal/host: TestExportSessions_* (skip-claude, empty
  worktree, happy-path with real opencode binary),
  TestRegisterImportedSession_RoundTrip.
- internal/host/mux: TestSyncSessions{Build,Apply}_* input
  validation, malformed manifest rejection.
- pkg/sync: TestPresignSessionPuts_*, TestDownloadSessionURLs_*.
- pkg/sync/checkpoint: TestSessionManifest_* round-trip + version
  rejection.
- pkg/sync/storage: TestKeyForSession_* (valid, path-escape,
  cross-tenant isolation, shares-checkpoint-prefix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* sync: wire session leg into clank push --migrate

End-to-end push flow now ships session blobs alongside code:

  laptop                       gateway                       sprite
  ──────                       ───────                       ──────
  clank push --migrate
   ├─ PushCheckpoint           (code → S3 via presign)
   ├─ pushSessionLeg
   │   ├─ BuildSessionCheckpoint  →  clank-host quiesces + exports
   │   ├─ RequestSessionUploadURLs   →  /v1/checkpoints/{id}/sessions
   │   └─ UploadSessionCheckpoint    →  clank-host PUTs blobs to S3
   └─ MigrateWorktree         →  handleMigrateWorktree
                                  ├─ pushCheckpointToSprite (code apply)
                                  ├─ pushSessionsToSprite   →  /sync/sessions/apply-from-urls
                                  │                            (sprite imports via RegisterImportedSession)
                                  └─ TransferOwnership      (atomic flip)

New pieces:
- pkg/sync/sessions_handler.go: POST /v1/checkpoints/{id}/sessions
  HTTP route on the sync server (tenancy + ownership gated).
- pkg/sync/client/sessions.go: Client.RequestSessionUploadURLs.
- internal/daemonclient/sessions_sync.go: BuildSessionCheckpoint +
  UploadSessionCheckpoint targeting clank-host via the local
  clankd proxy.
- internal/cli/clankcli/push_sessions.go: pushSessionLeg helper
  that orchestrates the three-step flow between PushCheckpoint
  and MigrateWorktree. Surfaces interrupted sessions and skipped
  Claude sessions to the user.

Tests:
- pkg/sync: TestSessionPresignHandler_{HappyPath,EmptySessions,
  UnknownCheckpointReturns404,WrongTenantForbidden}.

Still follow-up:
- Symmetric pull-back session leg (materialize → apply on the
  laptop side; needs MaterializeResponse field expansion +
  pull.go integration).
- The inject-"please continue" message after migration (v1.1).
- Auto-sync on regular push (v1.1).
- Claude-code backend session sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* gateway: narrow laptop-gateway sync-route block to code leg only

The proxy guard in pkg/gateway/proxy.go was blocking the entire
/sync/* namespace on laptop gateways (when cfg.Sync == nil),
including the new /sync/sessions/* family from the session-sync
substrate. Result: `clank push --migrate` failed with a 404 from
the gateway itself before any of the new code ran.

The original block exists to keep arbitrary socket-holding clients
from writing into ~/work/<repo>/ via /sync/build, /sync/builds/*,
/sync/apply-from-urls. /sync/sessions/* does not touch ~/work/ —
its handlers call into the host Service to drive opencode export
/import + host.db upserts, confined to ~/.local/share/opencode/
and the host DB. Same socket-access requirement, different blast
radius.

Carve out /sync/sessions/* via isBlockedCodeSyncRoute. The
existing TestProxy_BlocksSyncPathsWhenSyncNil is extended to
cover /sync/build and /sync/builds/{id}/upload too, and a new
TestProxy_AllowsSessionSyncOnLaptopGateway pins the carve-out
end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* host: don't chdir into the source's projectDir on session import

RegisterImportedSession was passing entry.ProjectDir (the SOURCE
host's local path) as cmd.Dir to `opencode import` on the
DESTINATION. On the sprite this manifested as:

  apply sessions to sprite: sprite sessions/apply-from-urls
  (500, code=session_import_failed): register session 01K…:
  opencode import /tmp/clank-session-import-…/01K….json:
  chdir /Users/axelengstrom/github.com/acksell/mindmouth:
  no such file or directory

`opencode import` reads/writes its own storage (HOME-relative) and
ignores cwd — confirmed by TestOpenCodeImportSemantics. The
projectDir pass-through was speculative "future-proofing" with no
real benefit; remove it on the import path.

Adds TestRegisterImportedSession_IgnoresSourceProjectDir to pin the
regression: the destination MUST succeed even when entry.ProjectDir
points at a path that doesn't exist locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* host: rewrite info.directory in opencode blob to destination path

Imported sessions on a sprite were landing under opencode's
synthetic 'global' project because info.directory in the export
blob held the SOURCE host's local path (e.g. /Users/foo/repo on
the laptop), which doesn't resolve on the sprite.

Add a small rewrite shim in RegisterImportedSession: read the
blob, replace info.directory with workRoot/<worktreeID> and
clear info.projectID so opencode rederives it on import. Then
pass the rewritten file to opencode import.

This is a workaround for upstream opencode not supporting a
target-directory or path-rebase flag on import. Once that lands,
both the rewrite and its regression test can be removed in
favor of passing the flag through:
  - anomalyco/opencode#15797
  - anomalyco/opencode#15826

Tests:
  - TestRegisterImportedSession_RewritesDirectoryToDestination
    seeds a blob with a deliberately-fake source directory +
    projectID, imports via RegisterImportedSession, re-exports
    the imported session, and asserts (a) info.directory matches
    the destination's workRoot/<worktreeID>, (b) info.projectID
    is no longer the source's hash (proxy for opencode having
    rederived it).
  - buildSyntheticOCBlob is refactored to delegate to a new
    parameterized buildOCBlobWithDir so the new test can vary
    directory + projectID independently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* store: host.db time columns → INTEGER unix millis (modernc bug fix)

modernc.org/sqlite v1.x serialises time.Time parameters via
t.String() (including the monotonic-clock suffix m=+... for
time.Now() values), then can't parse the result back. Reproduced
empirically on v1.50.1 (latest as of this commit). Symptom in
production:

  daemon: list sessions: sql: Scan error on column index 14,
  name "created_at": unsupported Scan, storing driver.Value
  type string into type *time.Time

The bug fires whenever any row written via time.Now() is read
back through ListSessions / GetSession. Some rows escape because
.UTC(), .Round(0), or JSON round-trips happen to strip monotonic.
This is why it's intermittent — every dataset has a mix of
parseable and unparseable timestamps.

Fix: store all time columns as INTEGER unix milliseconds, with
boundary conversion in store/sessions.go. modernc never gets to
"interpret" a time.Time at all, so the entire round-trip class
is gone.

Scope: host.db only (sessions + primary_agents). The provisioner-
side clank.db (hosts/worktrees/checkpoints) almost certainly has
the same problem and warrants its own migration, but it's not
firing today and the migration is its own concern.

Schema migration v3:
  - Build sessions_v3 / primary_agents_v3 with INTEGER time cols.
  - Read every row from the legacy table, scanning time columns
    as raw strings (CAST AS TEXT, to avoid the very bug we're
    fixing).
  - parseLegacyTimeMs (timeparse.go) converts each string to
    millis. Covers the four formats observed on real sprite DBs:
      * Go String() + monotonic:  "... +0000 UTC m=+..."
      * Go String() UTC named:    "... +0000 UTC"
      * Go String() fixed offset: "... +0200 +0200"
      * RFC3339Nano (JSON-derived)
    Last-ditch fallback: extract millis from the row's ULID
    (which encodes them directly), so a single unparseable row
    never blocks startup.
  - Drop legacy table, rename _v3 to the original name, recreate
    indexes. Whole thing runs in a transaction.

Boundary conversion at the store layer:
  - UpsertSession: time.Time → timeToMs(t) → int64
  - sessionFromRow: int64 → msToTime(ms) → time.Time (UTC)
  - SearchSessions: Since/Until filters pass int64 millis
  - LastReadAt: sql.NullInt64 mirrors the nullable column
  - UpsertPrimaryAgents: time.Now().UnixMilli()

Trade-off: lose sub-millisecond precision. For our use case (UI
sort order, inbox display, sweep timestamps) this is invisible.
Pinned by TestUpsertSession_TimeNowRoundTripsLosslessly, which
asserts at ms precision exactly because that's the new contract.

Tests:
  - timeparse_test.go: every observed legacy format parses to
    the expected millis; ULID fallback is deterministic; bogus
    input + bogus ULID falls back to time.Now().
  - TestUpsertSession_TimeNowRoundTripsLosslessly: the explicit
    regression test for the original symptom — time.Now() is
    inserted, ListSessions reads it back without driver errors.
  - TestOpen_MigrationV2ToV3_BackfillsTimeFormats: hand-crafts
    a v2 DB with one row per observed format, opens via store.
    Open (triggering v3 migration), asserts all six rows survive
    with the correct UnixMilli.
  - TestOpen_MigrationV2ToV3_UnparseableTimeFallsBackToULID:
    proves a single corrupt row doesn't block startup — its
    timestamp gets backfilled from the ULID prefix.
  - The existing persistence_e2e test that compared UpdatedAt
    via time.Equal is loosened to ms precision (the test's
    intent — "sweep didn't bump UpdatedAt" — is preserved).

Future direction: the user has expressed interest in migrating
the entire host store to entgo.io/ent, which handles all of this
natively. That's a separate, larger lift; this commit unblocks
the immediate failure mode without committing to that path yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* sync: wire session leg into clank pull --migrate

Symmetric to the push side. After the gateway materializes the
sprite's code checkpoint, it now also asks the sprite to export
its sessions, mints presigned PUT URLs from the sync server, and
has the sprite upload session blobs + the session manifest to S3.
The materialize response gains session_manifest_url and
session_blob_urls; the laptop's pull.go forwards those to its own
clank-host's /sync/sessions/apply-from-urls before committing the
ownership flip.

Materialize flow (additions in CAPS):
  1. Sprite builds code bundles to local disk.
  2. Gateway creates checkpoint row + mints code PUT URLs.
  3. Sprite uploads code bundles via PUT URLs.
  4. Gateway commits the checkpoint.
  5. SPRITE BUILDS SESSION BLOBS TO LOCAL DISK.
  6. GATEWAY MINTS SESSION PUT URLs VIA PresignSessionPuts.
  7. SPRITE UPLOADS SESSION BLOBS + MANIFEST VIA PUT URLs.
  8. Gateway mints GET URLs (code + session) for the laptop.
  9. Return materializeResponse with everything.

Laptop flow (additions in CAPS):
  1. Materialize.
  2. Download + apply code checkpoint locally.
  3. CALL LOCAL CLANK-HOST /sync/sessions/apply-from-urls WITH
     THE SESSION URLs FROM THE RESPONSE.
  4. Commit ownership transfer.

If step 3 fails, commit is skipped — the worktree stays
sprite-owned so the user can retry without a half-migrated state
(code applied locally, sessions still on the sprite).

Empty session manifest (sprite had no opencode sessions in the
worktree) is a silent no-op throughout — no per-session PUT/GET
URLs are minted, and the laptop's apply-from-urls call is skipped
entirely via the empty-URL fast path in ApplySessionCheckpoint.

Files:
  - pkg/gateway/migrate_back.go: adds triggerSpriteSessionBuild,
    triggerSpriteSessionUpload, deleteSpriteSessionBuild (mirror
    the existing code-leg helpers); extends materializeResponse
    with session URL fields; threads the session leg through
    handleMigrateMaterialize between the code commit and the
    GET-URL mint.
  - pkg/gateway/migrate_test.go: extends the inline sprite stub
    with no-op /sync/sessions/{build,builds/{id}/upload,
    builds/{id}} handlers so TestMigrate_TwoPhaseRoundTrip's
    empty-sessions path passes.
  - internal/daemonclient/migrate.go: extends MaterializeResponse
    with SessionManifestURL + SessionBlobURLs (omitempty).
  - internal/daemonclient/sessions_sync.go: adds
    ApplySessionCheckpoint, a thin wrapper around clank-host's
    /sync/sessions/apply-from-urls. Empty manifest URL is a
    silent no-op so the pull-side caller doesn't need a special
    case.
  - internal/cli/clankcli/pull.go: after applyRemoteCheckpoint,
    invoke ApplySessionCheckpoint on a NewLocalClient before
    CommitMigration. Logs "imported N session(s) locally".

Tests:
  - TestApplySessionCheckpoint_EmptyManifestURLIsNoOp pins the
    fast-path that makes pull --migrate keep working when the
    sprite had no sessions.
  - TestApplySessionCheckpoint_ForwardsBodyToDaemon pins the
    wire shape against drift between the daemonclient wrapper
    and the host mux handler.
  - TestApplySessionCheckpoint_PropagatesDaemonError ensures a
    daemon-side failure aborts the migration before the
    ownership commit.
  - TestApplySessionCheckpoint_RejectsEmptyWorktreeID covers
    input validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* host: don't chdir into a stale LocalPath on session export

Mirror of the import-side fix in commit bd49c8b. ExportSessions
was passing info.GitRef.LocalPath to `opencode export` as cmd.Dir.
That field can hold a path that doesn't exist on the current host
— e.g. a session imported earlier from a laptop has the laptop's
filesystem path baked into project_dir. chdir fails before
opencode runs, surfaced on pull --migrate as:

  sprite session build: sprite sessions/build 500:
  {"code":"session_export_failed","error":"export sessions: 01K…:
   opencode export ses_…: chdir /Users/…/repo:
   no such file or directory"}

`opencode export` doesn't need cwd (verified by
TestOpenCodeImportSemantics), so passing "" is correct AND robust.

Regression: TestExportSessions_IgnoresStaleLocalPath stamps
host.db with a definitely-nonexistent SOURCE-style LocalPath,
then asserts ExportSessions still succeeds.

This pair (this commit + bd49c8b) closes the chdir-on-cross-
machine-path footgun in both directions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* daemonclient: stop silently truncating migrate responses past 4 KiB

MaterializeMigration / CommitMigration / MigrateWorktree all read
the response via:

  io.ReadAll(io.LimitReader(resp.Body, 4<<10))

io.ReadAll on a saturated LimitReader returns happily — there's no
"hit the cap" signal — so any response over 4 KiB gets silently
truncated and json.Unmarshal then fails with the unhelpful:

  decode materialize response: unexpected end of JSON input

Hit in production by `clank pull --migrate` after the session-leg
work landed: each session adds a presigned S3 URL (~600 bytes
signed) to materializeResponse.SessionBlobURLs, so a worktree with
~5 sessions easily clears 4 KiB.

Fixes:
  - Bump cap to 1 MiB. Generous; a worktree would need hundreds
    of sessions to approach it.
  - Replace io.ReadAll(LimitReader(N)) with readAllCapped(N) that
    reads N+1 bytes and ERRORS when actual size > N. Future
    overflows surface loudly with "exceeds N-byte cap (would
    truncate)" instead of corrupting JSON parsing.
  - Apply uniformly to all three migrate calls (CommitMigration
    and MigrateWorktree have small responses today but the
    silent-failure mode shouldn't be a footgun anywhere).

Tests:
  - TestMaterializeMigration_LargeResponseDoesNotSilentlyTruncate
    seeds an 8 KiB+ response (well past the old 4 KiB cap, well
    under the new 1 MiB cap) and asserts every field round-trips
    cleanly. Tightly-coupled to the regression — flips red the
    moment anyone re-introduces a sub-cap LimitReader.
  - TestMaterializeMigration_OversizedResponseErrorsLoudly streams
    1.25 MiB of bytes back and asserts the loud-error path fires
    instead of silent corruption. Documents the new contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* host: coerce undefined diff before/after to "" before opencode import

Second instance of opencode export emitting values its own import
validator rejects. Surfaced via pull --migrate on a real sprite
session:

  Invalid input: expected string, received undefined
  path: ["summary", "diffs", 0, "before"]
  path: ["summary", "diffs", 0, "after"]

Inside messages[*].info.summary.diffs[*], opencode export
sometimes leaves before/after undefined while file/additions/
deletions are populated. The import validator then refuses the
whole blob — opencode never even starts writing the session,
and our daemon's RegisterImportedSession surfaces the Zod error
verbatim.

Extend the rewriteImportBlob shim (renamed from
rewriteImportBlobDir) to walk every diff entry and coerce
non-string before/after to "". Diff entries that already have
strings are untouched.

Renamed because the helper now applies multiple opencode
export/import workarounds; the call site in
RegisterImportedSession enumerates them. Each workaround can be
deleted independently when opencode tightens its export schema
or relaxes its import validator. No upstream issue identified
yet for the diff-undefined case — would be worth filing.

Tests:
  - TestRegisterImportedSession_RepairsUndefinedDiffStrings
    seeds a blob with the production shape (file/additions/
    deletions set, before/after absent) and asserts import
    succeeds. Tightly mirrors the failing payload — flips red
    the moment the shim regresses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* host: revert diff-coercion shim, keep only the directory rebase

The diff-before/after coercion in rewriteImportBlob was paving
over an opencode export/import schema asymmetry: opencode export
emits messages[*].info.summary.diffs[*].{before,after} as
undefined, opencode import then rejects undefined where it
expects strings. Coercing to "" satisfied the validator but
misrepresented the data — we don't actually know what before/
after were on the source, and "" is a lie.

The directory rebase, by contrast, is justified: info.directory
in the source's blob is a filesystem path that genuinely doesn't
exist on the destination. Without rebasing, opencode files the
imported session under a synthetic "global" project. That shim
stays.

Going forward, validator failures from `opencode import` surface
verbatim. Future work in handleSyncSessionsApplyFromURLs can
choose to skip-and-warn rather than abort the whole migration —
that's the architecturally correct response, not "fabricate the
field opencode wanted to see".

Reverts the shim + regression test added in 73f4102. The
TestRegisterImportedSession_RewritesDirectoryToDestination case
covers the remaining (justified) blob rewrite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* host: skip-and-warn on orphan host.db rows during export

Service.ExportSessions used to abort the entire migration on the
first 'opencode export <external_id>' failure, including the
'Session not found' case where opencode's storage no longer holds
a session that host.db still references. That made one orphan row
fatal to push --migrate / pull --migrate.

How an orphan happens: the user (or any tool) runs 'opencode
session delete' directly. opencode prunes its storage; clank-host
keeps watching its own host.db; the next migration hits the dead
external_id and the whole flow dies at the first row.

Surfaced today after my own diagnostic exec on a sprite
accidentally deleted a session; the user's next pull --migrate
failed with:

    sprite session build: sprite sessions/build 500:
    {"code":"session_export_failed","error":"… Session not found:
    ses_1dd78ccccffeN4I5i9kcgvmAkh"}

Fix: detect 'Session not found' from opencode's stderr and treat
it as a SkippedSession with a clear reason ("opencode storage
missing this session (host.db orphan; was it deleted via opencode
CLI?)") instead of returning an error from ExportSessions. Other
opencode-export failures (chdir errors, JSON corruption, etc.)
still abort the migration loudly — the skip is narrowly scoped to
"the row is provably orphaned".

Substring-matches on opencode's English error message because the
opencode CLI doesn't surface error codes to subprocess callers.
Documented; if opencode rephrases, tests catch it and we update
the matcher.

We deliberately do NOT auto-delete the host.db row. Leaving it in
place lets the user see what was lost (via the Skipped slice
surfaced through the CLI) before deciding whether to manually
clean up. Auto-self-heal is a reasonable follow-up but it's its
own design choice that deserves its own PR.

Test: TestExportSessions_SkipsMissingOpencodeSession seeds two
host.db rows — one mapping to a real opencode session, one with
a fabricated external_id opencode has never seen — and asserts
ExportSessions returns 1 entry + 1 Skipped, no error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* clankcli: refuse migrations on incompatible opencode versions

Production case that motivated this: laptop on opencode 1.3.15,
sprite on 1.14.48. The 1.14 export schema replaced before/after
diff strings with patch/status — a forward-incompatible schema
change between minor versions. clank pull --migrate would happily
upload code, mint URLs, materialize on the laptop, then fail at
opencode import with an opaque Zod validator error and a
half-migrated state.

This commit catches the mismatch before any work happens.

Policy (in agent.AssertOpencodeVersionsCompatible):
  - exact match              → silent OK
  - patch-only diff          → log a one-line warning, proceed
  - minor or major diff      → refuse with upgrade hint
  - either side empty/bogus  → refuse (can't reason about compat)

Architecture:
  - agent.OpenCodeVersion: runs `opencode --version` on the
    local host, returns the bare version string.
  - clank-host mux gains GET /opencode-version, returning
    {"version": "1.14.48"}. Reachable through both the local
    clankd proxy AND the cloud gateway proxy without any
    new firewall changes (it's not under /sync/*).
  - daemonclient.OpenCodeVersion: thin wrapper around the new
    endpoint. Symmetric — same client targets both ends.
  - clankcli.assertOpencodeCompatible: queries both versions,
    runs the policy check, formats the user-facing error with
    upgrade instructions.
  - push.go and pull.go call it ONCE up-front (before
    PushCheckpoint / MaterializeMigration), so we fail before
    any expensive S3 / sprite work. Fails with:

      opencode version mismatch (local=1.3.15, remote=1.14.48):
      minor version differs — opencode has shipped breaking
      schema changes between minors before; upgrade one side
      to match the other before retrying

      Upgrade either side to match the other:
        laptop:  `opencode upgrade`
        remote:  the sprite picks up a fresh opencode whenever
                 it's reprovisioned; ask your admin or rebuild
                 the sprite image

Tests:
  - TestAssertOpencodeVersionsCompatible: 9-case table covering
    exact match, patch drift, minor drift (the production
    repro), major drift, empty/garbage inputs, and the
    two-segment "1.14" parses-as-(1,14,0) edge case.
  - TestOpencodeVersionEndpoint_ReturnsBareVersion: smoke-tests
    the mux handler against a real opencode binary.

Refactor: pull.go now constructs localCli once at the top and
reuses it both for the version check and the later
ApplySessionCheckpoint call (was a redundant NewLocalClient
inside the session-apply block).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* flyio: pin opencode version on the sprite to agent.PinnedOpencodeVersion

Closes the door on the version-skew bug class at the source. The
sprite-side install used to do `bun install -g opencode-ai`, which
always picks up the latest npm version — meaning every fresh
sprite gets whatever opencode happened to be released the day it
was provisioned. Combined with laptops that drift behind on
manual installs, you get the 1.3 ↔ 1.14 schema mismatch we lived
through.

This commit makes the sprite always run a deliberately-chosen
opencode version (currently 1.14.49, defined in
agent.PinnedOpencodeVersion), with three layered defenses:

  1. Probe-and-upgrade (not just probe-and-skip).
     ensureOpenCodeInstalled now captures the installed version
     and only fast-paths if it exactly matches the pin. A drift
     (older opencode lingering on a long-lived sprite) triggers
     reinstall instead of being silently accepted.

  2. Install at the pinned version explicitly.
     `bun install -g opencode-ai@<pinned>` (or npm fallback). The
     pin is interpolated via __PINNED_VERSION__ template
     substitution, gated by validatePinForShell which rejects
     anything outside [A-Za-z0-9.-] — defense against a future
     pin constant that would shell-inject.

  3. Post-install version verification.
     The script's tail now compares `opencode --version` to the
     pin and exits 1 if they disagree. Catches the (rare but
     possible) case where bun's registry routing serves a
     different version than requested. Without this, we'd
     silently lie to the caller about install success.

Companion change in clankcli: when the laptop-side version-skew
check refuses a migration (commit 7768138), the error message
now suggests the EXACT pinned version to upgrade to, not just
"match the other side":

    clank pins opencode at version 1.14.49. Upgrade your laptop
    to match:
      opencode upgrade --version 1.14.49

Why pinning at the clank level + runtime check is belt-and-
suspenders, not redundancy:
  - Pinning makes the common case impossible (sprite always at
    the right version on next EnsureHost).
  - The runtime check catches what pinning can't: laptops are
    outside clank's control. A user on a 5-year-old opencode
    that ignores `opencode upgrade` would still be caught.
  - Together they give "boring, predictable opencode versions
    everywhere" as the default state.

Bumping the pin is a deliberate, reviewable change:
  1. Update agent.PinnedOpencodeVersion.
  2. `make install` — laptop binaries learn the new pin.
  3. Sprites probe-and-reinstall on next EnsureHost (~30-90s
     one-shot cost per sprite).
  4. Laptop users `opencode upgrade --version <new>` — runtime
     check refuses migrations until they do.

Tests:
  - TestValidatePinForShell_Allows / _Rejects: shell-injection
    boundary (only [A-Za-z0-9.-] passes).
  - TestPinnedOpencodeVersion_PassesShellSafetyCheck: meta-test
    that the constant clank actually ships with isn't shaped in
    a way the safety check rejects.
  - TestOpencodeInstallScript_HasPinnedSubstitution: regression
    against a future refactor that removes the
    __PINNED_VERSION__ placeholder while leaving the
    substitution call (which would silently install
    opencode-ai@__PINNED_VERSION__, a nonexistent package).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* flyio: drop the validatePinForShell theatre

The pin is a Go constant compiled into clank's binary. It's not
configurable at runtime, not read from a file, not derived from
user input. Defending it against shell-injection assumes a threat
actor that doesn't exist: any clank developer who wanted to run
arbitrary shell on a sprite has direct access to the install
script three lines above.

Delete validatePinForShell, the call to it in
ensureOpenCodeInstalled, and the entire opencode_pin_test.go
file (-108 LOC).

Fat-finger protection comes for free: a malformed pin produces
`bun install -g opencode-ai@<bogus>` which npm rejects loudly
("no such version"), failing the install before anything
dangerous executes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* clankcli: --timing flag printing per-phase breakdown of push/pull --migrate

Refactor-prep instrumentation. Adds opt-in latency measurement
across the laptop-side phases of clank push --migrate and
clank pull --migrate so we can see where wall-clock time
actually goes before deciding what to optimize.

Enable with --timing or CLANK_TIMING=1. Output (sorted largest-
first, with percentage of total):

  pushed checkpoint 01K... (HEAD aca7ca8d)
  pushed 5 session(s) for checkpoint 01K...
  migrated worktree 01K... → remote/01K...

    timing (sorted, largest first):
      migrate worktree (gateway)        9.7s   44%
      build session export              8.3s   38%
      upload sessions                   2.1s   10%
      push checkpoint                   1.4s    6%
      version compatibility check     400ms    2%
      request session upload URLs      150ms    1%
      ──────────────────────────────────────────
      total                            22.1s

Phases instrumented:
  push --migrate
    register worktree                (first push only)
    push checkpoint                  (git bundle build + S3 upload + commit)
    version compatibility check      (laptop + sprite version round-trip)
    build session export             (clank-host enumerate + opencode export per session)
    request session upload URLs      (gateway presign call)
    upload sessions                  (clank-host PUTs blobs to S3)
    migrate worktree (gateway)       (EnsureHost + sprite apply for code+sessions + ownership flip)

  pull --migrate
    version compatibility check
    materialize (gateway)            (sprite build + presign + GETs minted)
    apply code locally               (download bundles + checkpoint.Apply)
    apply sessions locally           (clank-host /sync/sessions/apply-from-urls)
    commit migration                 (ownership flip)

Design choices:
  - Disabled by default; zero-overhead when off (one extra
    method dispatch per Start, immeasurable at ms granularity).
  - Sorted largest-first because that's how you read it for
    "what to optimize next." Percentages because absolute
    numbers don't tell you ratios at a glance.
  - Closure-on-call shape (done := timer.Start(...); ...; done())
    instead of wrapping every operation in a func — keeps the
    call sites readable when most phases produce a value the
    next line uses.
  - humanDuration formats at the right granularity (µs / ms /
    s.x) instead of d.String()'s default "1.234567s" noise.

What's NOT measured here:
  - Server-side (gateway) sub-phases of "migrate worktree"
    and "materialize". These are single round trips from the
    laptop's perspective; subdividing them needs a Server-Timing
    response header, which is a follow-up once the laptop
    numbers say "we always lose 8s in materialize, let's
    decompose it."
  - Sub-phases of cli.PushCheckpoint (bundle build vs S3
    upload vs commit). One call in pkg/sync/client today;
    breaking it apart adds API surface — defer until the
    higher-level numbers say it's worth it.
  - Distributed tracing. Way too heavy for "I want to know
    where 22 seconds go."

Tests:
  - TestPhaseTimer_DisabledIsNoOp: zero-overhead when off.
  - TestPhaseTimer_RecordsAndSorts: sort order and total
    percentages.
  - TestPhaseTimer_StartActuallyTimes: end-to-end timing.
  - TestHumanDuration: formatting boundaries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* clankcli: run version check BEFORE PushCheckpoint on --migrate

Wiring bug from the previous commit: the version-compatibility
check was inside the `if alsoMig {}` block but AFTER
cli.PushCheckpoint, so a sprite that's unreachable or runs the
wrong opencode version still cost the user a full code upload
before the migration aborted.

Move the pre-flight (NewRemoteClient + NewLocalClient + version
check) ahead of PushCheckpoint when --migrate is set. Both
clients are reused downstream for the session leg + migrate call;
no extra round trips.

Also improve the user-facing error when the version check fails
on `host unavailable` (the gateway's signal that EnsureHost
failed — sprite suspended, provisioner creds expired, etc.).
Previously the error read like a version mismatch:

  Error: read remote opencode version: opencode-version:
  daemon returned status 502: host unavailable

Now it tells the user what actually went wrong and where to look:

  Error: can't reach the remote sprite: ...host unavailable

  The gateway returned 'host unavailable', which means its
  provisioner.EnsureHost failed. Check the gateway's logs for
  the underlying reason. For the dev stack:
    make docker-logs
  Look for a line like 'gateway: ensure host for user …: …'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* clank-host: generalise /opencode-version → /software-manifest, cache it,
            asymmetric timeouts in compat check

Two changes wrapped into one because they share a stage:

1) Replace the narrow GET /opencode-version endpoint with a general
   GET /software-manifest. The wire format is forward-extensible —
   today populated only with opencode's version, but the struct
   tag layout reserves room for claude, clank-host's own version,
   and future tools without breaking the protocol. The CLI's
   compat check now reads .OpenCode.Version off the manifest
   instead of a top-level scalar.

   Replacement is a clean swap: the only consumer of the old
   endpoint was clankcli's compat check, which this commit moves
   atomically. -113 LOC of one-off plumbing files, +123 LOC of
   reusable manifest plumbing. Net wash on size, but the new
   endpoint is what we'd want to evolve toward anyway.

2) Cache the manifest in-process on clank-host. First call probes
   `opencode --version`, subsequent calls serve from memory in
   nanoseconds. Eliminates ~300-500ms of JS startup cost from
   every migration's compat-check phase — which timing
   instrumentation just showed was the single largest phase even
   when both ends were warm (1.1-1.6s consistently).

   Cache invalidation: only on clank-host restart. Safe because
   clank-host restarts whenever its binary changes (sidecar-hash
   mechanic in the flyio provisioner), and the only reason
   opencode's version would change under a running clank-host
   is an out-of-band manual upgrade — which the runtime compat
   check would catch on the laptop side immediately anyway.

Asymmetric timeouts in assertOpencodeCompatible:

  localCompatCheckTimeout  = 5 * time.Second
  remoteCompatCheckTimeout = 2 * time.Minute

The previous symmetric 10s timeout caused spurious "context
deadline exceeded" errors on first push to a fresh sprite —
EnsureHost on a cold sprite legitimately takes 30-90s for the
embedded clank-host install + opencode install. Local clank-host
serves /software-manifest from cache in milliseconds; remote
clank-host may cold-start. Independent budgets reflect that.

Tests:
  - TestSoftwareManifestEndpoint_ReturnsOpenCodeVersion: smoke-tests
    the new endpoint against a real opencode binary.
  - TestSoftwareManifestEndpoint_SubsequentCallsAreFast: pins
    the caching contract — second call must be <50ms (catches
    the regression where someone removes the cache and falls
    back to re-shelling opencode per request, which would be
    ~200ms+ minimum from JS startup).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mux: trim verbose handleSoftwareManifest comment

* make docker-down: remove volumes too

* fix: address coderabbit correctness findings on PR #18

- opencode_compat: parse before comparing, so unparseable-but-equal
  inputs still refuse, and 1.14 ≡ 1.14.0 (semantic patch equality)
- software_manifest: detach sync.Once probe from the caller's ctx so
  a canceled first request can't poison the process-global cache
- timeparse: pass pure-digit input through as int64 millis to make
  parseLegacyTimeMs idempotent — a half-committed v3 migration can
  re-enter without rewriting timestamps via fallbackTimeMs/now()
- sessions.ListSessionsByWorktree: return []SessionInfo{} (not nil)
  for empty worktreeID to honor the docstring
- clankcli: error message says 'opencode upgrade v<version>'
  (positional), not the wrong --version flag

Regression tests added for each, including a half-committed v3
re-entry case (sessions table already INTEGER, user_version
hand-rewound to 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* gateway: surface S3/sprite error bodies on the session-migrate path

- fetchURL non-200 path includes a 4 KiB body preview so SigV4 /
  object-not-found XML errors stop showing up as opaque "GET 403"
- session retry loop captures the last sprite error and wraps it
  into the "exhausted retries" message so the underlying cause
  isn't silently dropped after url_expired ping-pong

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test/doc: opt the opencode-import semantics test into t.Parallel and
retract the unverified phaseTimer concurrency claim

- TestOpenCodeImportSemantics is hermetic (per-test TempDir + per-cmd
  env array), so add t.Parallel per CLAUDE.md's "use t.Parallel where
  safe" rule
- phaseTimer's doc claimed nested/concurrent Start calls were fine, but
  every caller (push/pull/push_sessions) drives migration phases
  sequentially. Drop the aspirational sentence rather than add a
  synchronization that nothing exercises — keeps the contract honest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* gateway: TODO(coderabbit) marker for the duplicated sprite-request blocks

Six near-identical NewRequestWithContext+Authorization+Transport blocks
across the code and session legs of migrate_back.go are real
duplication, but collapsing them is out of scope for the
import-export-sessions PR (+5936 already). Anchor the deferred work at
the first occurrence so a future contributor editing any of the six
sites lands on the link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* gateway: commit checkpoint AFTER session upload in materialize handler

CodeRabbit caught a real ordering bug. handleMigrateMaterialize used to
run g.cfg.Sync.CommitCheckpoint — which advances
latest_synced_checkpoint via UpdateWorktreePointer — BEFORE the session
leg. If session-build or session-upload failed afterwards, the response
was a 502 to the laptop but the worktree pointer had already moved to a
checkpoint whose session blobs were missing from storage. Subsequent
push/pull operations that read LatestSyncedCheckpoint would then 404 on
session-manifest.json and silently skip sessions.

Re-ordered: session build + session upload come first; CommitCheckpoint
only runs once both code AND session uploads succeeded. The session
DownloadSessionURLs call moves after the commit (it gates on
UploadedAt). The "no sessions in this worktree" path is unchanged.

Regression test pins the invariant: with a sprite mock that succeeds
on code-leg and returns 500 on session-upload, the worktree's
LatestSyncedCheckpoint must remain at the pre-materialize value. Test
fails without the fix (pointer advances to the new checkpoint id),
passes with it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] opencode import always assigns session to global project, ignoring current working directory

1 participant