Skip to content

fix(remote-control): avoid server token refresh retry storms#30201

Merged
apanasenko-oai merged 5 commits into
mainfrom
fix-remote-control-refresh-retry-storms
Jun 27, 2026
Merged

fix(remote-control): avoid server token refresh retry storms#30201
apanasenko-oai merged 5 commits into
mainfrom
fix-remote-control-refresh-retry-storms

Conversation

@apanasenko-oai

@apanasenko-oai apanasenko-oai commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Why

Remote-control websocket reconnects and pairing requests proactively refresh their server token. When /server/refresh returns a transient error such as 502, the still-valid token was discarded as a usable connection path, causing reconnect failures and repeated refresh attempts that could amplify an upstream incident.

What Changed

  • Start proactive refresh five minutes before token expiry and distinguish it from a required refresh for missing or expired tokens.
  • Continue websocket and pairing operations with the existing valid token after 429, 5xx, or timeout failures.
  • Share an in-memory next_refresh_at throttle across websocket and pairing callers, honoring both Retry-After formats and otherwise using a jittered 24–36 second delay.
  • Keep required refreshes strict, preserve 404 enrollment replacement, and clear token/throttle state for 401 and 403 auth recovery.
  • Preserve refresh response metadata internally and add focused wire-level and integration coverage.

Verification

Added behavioral coverage proving that:

  • a valid near-expiry token still completes websocket and pairing requests after transient refresh failures;
  • Retry-After suppresses a subsequent refresh across websocket and pairing callers;
  • request and response-body timeouts are classified as transient;
  • an expired token, including one that expires during refresh, cannot proceed to websocket connection;
  • auth failures clear the attempted token without overwriting a concurrently rotated token.

@apanasenko-oai apanasenko-oai changed the title Protect remote control refresh from retry storms fix(remote-control): avoid server token refresh retry storms Jun 26, 2026
@apanasenko-oai apanasenko-oai marked this pull request as ready for review June 26, 2026 07:20

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bac2ddd2c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/Cargo.toml
apanasenko-oai added a commit that referenced this pull request Jun 26, 2026
apanasenko-oai added a commit that referenced this pull request Jun 26, 2026
@apanasenko-oai

Copy link
Copy Markdown
Collaborator Author

@codex

Copy link
Copy Markdown
Collaborator

One issue to consider: required refreshes still ignore Retry-After.

In codex-rs/app-server-transport/src/transport/remote_control/server_api.rs:150, deferral only happens for Proactive refreshes. Once the token is missing or expired, a 429/5xx response with Retry-After falls through as an error, and the websocket loop retries using its generic backoff. That means the refresh endpoint can still receive repeated attempts while unavailable or rate-limiting—the retry-storm scenario this PR is trying to prevent.

Could we preserve a refresh retry deadline for required refreshes too, while still blocking websocket/pairing use until refresh succeeds?

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66750d558b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/app-server-transport/Cargo.toml
@apanasenko-oai

Copy link
Copy Markdown
Collaborator Author

@owenlin0 Addressed in 485865b894. Required transient refresh failures now preserve the same Retry-After or fallback deadline used by proactive refreshes. Until that deadline passes, websocket reconnect and pairing both remain blocked and return locally without issuing another /server/refresh request; they still cannot proceed until a required refresh succeeds.

Added focused websocket and pairing coverage, and updated the stale-enrollment integration test to verify fallback deadline preservation. Validation: just test -p codex-app-server-transport (137/137), just fix -p codex-app-server-transport, just fmt, and git diff --check.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

@apanasenko-oai

Copy link
Copy Markdown
Collaborator Author

@codex

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 485865b894

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/app-server-transport/Cargo.toml
Comment thread codex-rs/app-server-transport/src/transport/remote_control/mod.rs Outdated
Comment thread codex-rs/app-server-transport/src/transport/remote_control/enroll.rs Outdated
@apanasenko-oai

Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e3fa17653

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/app-server-transport/Cargo.toml
@apanasenko-oai apanasenko-oai force-pushed the fix-remote-control-refresh-retry-storms branch from 2559139 to e470e94 Compare June 27, 2026 00:02
@rpelevin

Copy link
Copy Markdown

I would make the acceptance boundary the shared refresh lease, not just the successful reconnect path.

The failure mode this closes is subtle: a still-valid token is allowed to carry work while proactive refresh is unavailable, but a required refresh must fail closed without turning every caller into a separate retry loop. That means websocket reconnect and pairing need to observe the same throttle record, including Retry-After, timeout classification, and auth recovery behavior.

The regression shape I would want is:

  1. with a near-expiry but still-valid token, a transient proactive refresh failure records a retry deadline and both websocket and pairing continue with the existing token;
  2. a second caller before that deadline returns locally without issuing a new refresh;
  3. with a missing or expired token, the same transient failure records the deadline but blocks connection and pairing until refresh succeeds;
  4. Retry-After absolute date and delta seconds produce the same shared lease semantics;
  5. auth failures clear attempted auth state without clobbering a concurrently rotated token;
  6. a successful refresh after the lease expires clears the throttle and attaches response metadata to the new authority record.

The useful invariant is that refresh pressure is governed per server authority, not per caller or per transport path. Proactive refresh can degrade to the still-valid token, but required refresh can only degrade to a local terminal wait state, never to an upstream retry storm.

Boundary: architecture and regression-test feedback only; no claim about using this project, running this branch, validating implementation behavior, implementation correctness, performance measurements, merge readiness, security review, production readiness, partnership, customer interest, official alignment, OpenAI usage, Codex usage, conformance certification, or Neura usage.

@apanasenko-oai

Copy link
Copy Markdown
Collaborator Author

[codex] @rpelevin Added focused reverse-path regression coverage in d07f62c: pairing receives a transient 502 with an HTTP-date Retry-After, records the shared deadline, continues with the still-valid token, and then websocket connects without issuing another refresh. The test passes on the current implementation, confirming that this lease is shared across pairing and websocket callers. Full validation: just test -p codex-app-server-transport (141/141), just fix -p codex-app-server-transport, just fmt, and git diff --check.

@apanasenko-oai apanasenko-oai merged commit d047c33 into main Jun 27, 2026
46 checks passed
@apanasenko-oai apanasenko-oai deleted the fix-remote-control-refresh-retry-storms branch June 27, 2026 00:34
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants