fix(remote-control): avoid server token refresh retry storms#30201
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bac2ddd2c7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
One issue to consider: required refreshes still ignore In Could we preserve a refresh retry deadline for required refreshes too, while still blocking websocket/pairing use until refresh succeeds? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 66750d558b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@owenlin0 Addressed in Added focused websocket and pairing coverage, and updated the stale-enrollment integration test to verify fallback deadline preservation. Validation: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 485865b894
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e3fa17653
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
2559139 to
e470e94
Compare
|
I would make the acceptance boundary the shared refresh lease, not just the successful reconnect path. The failure mode this closes is subtle: a still-valid token is allowed to carry work while proactive refresh is unavailable, but a required refresh must fail closed without turning every caller into a separate retry loop. That means websocket reconnect and pairing need to observe the same throttle record, including Retry-After, timeout classification, and auth recovery behavior. The regression shape I would want is:
The useful invariant is that refresh pressure is governed per server authority, not per caller or per transport path. Proactive refresh can degrade to the still-valid token, but required refresh can only degrade to a local terminal wait state, never to an upstream retry storm. Boundary: architecture and regression-test feedback only; no claim about using this project, running this branch, validating implementation behavior, implementation correctness, performance measurements, merge readiness, security review, production readiness, partnership, customer interest, official alignment, OpenAI usage, Codex usage, conformance certification, or Neura usage. |
|
[codex] @rpelevin Added focused reverse-path regression coverage in d07f62c: pairing receives a transient 502 with an HTTP-date Retry-After, records the shared deadline, continues with the still-valid token, and then websocket connects without issuing another refresh. The test passes on the current implementation, confirming that this lease is shared across pairing and websocket callers. Full validation: just test -p codex-app-server-transport (141/141), just fix -p codex-app-server-transport, just fmt, and git diff --check. |
Why
Remote-control websocket reconnects and pairing requests proactively refresh their server token. When
/server/refreshreturns a transient error such as502, the still-valid token was discarded as a usable connection path, causing reconnect failures and repeated refresh attempts that could amplify an upstream incident.What Changed
429,5xx, or timeout failures.next_refresh_atthrottle across websocket and pairing callers, honoring bothRetry-Afterformats and otherwise using a jittered 24–36 second delay.404enrollment replacement, and clear token/throttle state for401and403auth recovery.Verification
Added behavioral coverage proving that:
Retry-Aftersuppresses a subsequent refresh across websocket and pairing callers;