Skip to content

code-mode: make create and observe retry-safe#29397

Closed
cconger wants to merge 1 commit into
cconger/code-mode-runtime-compact-03g-create-observe-apifrom
cconger/code-mode-runtime-compact-03g1-retry-safe-operations
Closed

code-mode: make create and observe retry-safe#29397
cconger wants to merge 1 commit into
cconger/code-mode-runtime-compact-03g-create-observe-apifrom
cconger/code-mode-runtime-compact-03g1-retry-safe-operations

Conversation

@cconger

@cconger cconger commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Why

The session protocol crosses a cancelable IPC boundary. A lost create or observe response must be retryable without starting another cell or consuming output that Core never received.

What

  • Add caller-provided idempotency_key fields to the wire CreateCellRequest and ObserveRequest types.
  • Reuse the admitted cell for a repeated matching create key.
  • Detach host-side observation work from the lifetime of the caller's response future.
  • Cache and replay the same ObserveOutcome wire value when an observation key is retried.
  • Reject reuse of a key for different request contents.
  • Leave the operation-specific outcome unions from code-mode: expose create and observe operations #29291 unchanged.
  • Add canceled-create, canceled-observe, and replay contract tests.

Host/Core wire sequence

sequenceDiagram
    participant Core
    participant Host
    Core->>Host: CreateCellRequest { idempotency_key, ... }
    Host-->>Core: CellId
    opt create response is lost
        Core->>Host: same CreateCellRequest
        Host-->>Core: same CellId
    end
    Core->>Host: ObserveRequest { idempotency_key, cell_id, yield_time_ms }
    Host->>Host: finish observation independently
    opt observe response is lost
        Core->>Host: same ObserveRequest
        Host-->>Core: same ObserveOutcome
    end
Loading

Stack boundary

Observation records are keyed independently and retained for the session lifetime in this PR. #29398 replaces arbitrary observation keys with per-cell generations and bounds retention to the latest observation.

Validation

  • just test -p codex-code-mode
  • just test -p codex-code-mode-protocol
  • Contract coverage for canceled and repeated create/observe operations.

Stack parent: #29291.

@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g1-retry-safe-operations branch from c79e006 to 4a76c0a Compare June 22, 2026 06:48
@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g-create-observe-api branch 2 times, most recently from 4232322 to 5cc94cd Compare June 22, 2026 06:53
@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g1-retry-safe-operations branch from 4a76c0a to 457f6f2 Compare June 22, 2026 06:53
@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g-create-observe-api branch from 5cc94cd to e647c2f Compare June 22, 2026 07:02
@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g1-retry-safe-operations branch 2 times, most recently from 1152f3f to 5df525b Compare June 22, 2026 07:09
@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g-create-observe-api branch from e647c2f to b3ba157 Compare June 22, 2026 07:09
@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g1-retry-safe-operations branch from 5df525b to a6be94f Compare June 22, 2026 07:22
@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g-create-observe-api branch from b3ba157 to a9ed5af Compare June 22, 2026 07:22
@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g-create-observe-api branch from a9ed5af to 96581e9 Compare June 22, 2026 08:37
@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g1-retry-safe-operations branch from a6be94f to 3081aba Compare June 22, 2026 08:37
@cconger cconger force-pushed the cconger/code-mode-runtime-compact-03g1-retry-safe-operations branch from 3081aba to 098e9f0 Compare June 22, 2026 08:56

@jif-oai jif-oai left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deadlock is a major blocker

if tool_name.namespace.is_none() && tool_name.name.as_str() == WAIT_TOOL_NAME =>
{
let args: ExecWaitArgs = parse_arguments(&arguments)?;
let idempotency_key = format!("{}:{call_id}", session.thread_id());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generate a retry key for wait, but the terminate=true branch drops it
Termination claims the cell before awaiting its outcome, so a lost IPC response makes the retry return already terminating or missing instead of the original terminal output

},
);
let runtime = Arc::clone(&self.runtime);
tokio::spawn(async move {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task keeps the runtime alive even after the service and all replay receivers are gone, so dropping the session no longer cancels the cell until the observation deadline finishes
This is a standard ownership deadlock. This needs a proper teardown or a weaken hold

@cconger cconger closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants