Skip to content

fix(moq-native): back off when a session flaps instead of busy-looping#1806

Merged
kixelated merged 1 commit into
mainfrom
claude/dazzling-lichterman-69ec87
Jun 19, 2026
Merged

fix(moq-native): back off when a session flaps instead of busy-looping#1806
kixelated merged 1 commit into
mainfrom
claude/dazzling-lichterman-69ec87

Conversation

@kixelated

Copy link
Copy Markdown
Collaborator

Summary

A user hit a busy loop where the reconnect logs scrolled by at full CPU: the client connected (connected version=moq-lite-05-wip), the session dropped in the same millisecond (session closed, reconnecting), and it reconnected instantly, over and over.

The backoff config (initial/multiplier/max/timeout) was already wired up, but the sleep only ran on the Err branch when client.connect() itself failed. The observed failure is the opposite: connect succeeds, then the established session is severed immediately. That path (Ok(session)session.closed()) had no sleep, so it spun.

A second, compounding bug: retry_start reset on every drop, so during a flap retry_start.elapsed() was always ~0 and the 5-minute give-up timeout never fired. The loop could never escape on its own.

Changes

In rs/moq-native/src/reconnect.rs, restructured the loop so the backoff sleep runs at the end of every iteration (connect-failure and session-drop alike):

  • Stable connection (uptime ≥ backoff.initial, default 1s) → reset delay to initial and reset the timeout window, then sleep the initial delay. A one-off drop after a healthy session reconnects promptly and isn't rate-escalated.
  • Immediate flap (uptime < initial, e.g. server accepts then resets) → treated as a failed connection: the close reason from session.closed() (the code=0 / code=5 invalid value in the logs) is captured into last_error instead of being discarded, logged as session severed immediately, retrying, and the loop falls through to the shared backoff sleep. Delay escalates 1s → 2s → … → 30s and retry_start keeps accumulating, so a server that perpetually accepts-then-resets eventually fails with reconnect timed out after 300s: <close reason> — a real cause, not a generic timeout.

Also updated the Backoff::timeout doc to note it resets after a stable connection, not after every drop.

Notes for reviewers

  • No public API or wire change; behavioral bug fix in moq-native, so targeting main.
  • A healthy session now waits ~1s before reconnecting rather than reconnecting instantly, which also rate-limits a thundering-herd reconnect after a server restart.
  • Reproduced on dev; the fix on main flows to dev on the next merge.

Test plan

  • cargo clippy -p moq-native --all-targets (via nix develop) — clean
  • cargo fmt -p moq-native --check — clean
  • Existing moq-native reconnect integration test passes
  • A flapping-session integration test would need a server that accepts then instantly resets (timing-flaky); not added

(Written by Claude)

@kixelated kixelated enabled auto-merge (squash) June 19, 2026 19:24
The reconnect loop only slept on the `Err` branch (when `client.connect()`
itself fails). When the connection succeeded but the session dropped
immediately (server accepts then resets), the `Ok(session)` path looped
straight back with no delay, spinning the CPU. `retry_start` also reset on
every drop, so the give-up timeout never fired during a flap and the loop
ran forever.

Apply the backoff sleep at the end of every iteration. A session that
outlives the initial backoff is healthy and resets the window; one that is
severed faster is treated as a failed connection: the close reason is kept
in `last_error` (so the give-up timeout reports a real cause), backoff
escalates, and the timeout keeps accumulating toward giving up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kixelated kixelated force-pushed the claude/dazzling-lichterman-69ec87 branch from 5d3d742 to 6770740 Compare June 19, 2026 19:25
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 64a83992-e46e-46f4-8a48-1c0c25b2cd09

📥 Commits

Reviewing files that changed from the base of the PR and between 230e34d and 6770740.

📒 Files selected for processing (1)
  • rs/moq-native/src/reconnect.rs

Walkthrough

The reconnect backoff documentation is updated so that the timeout counter resets only after a session remains connected for at least backoff.initial, not after every successful connection. The reconnect loop is reworked to record session uptime after each successful connect: if the uptime meets or exceeds backoff.initial, the delay window and retry start are reset and last_error is cleared; otherwise, the close error is preserved when available, no reset occurs, and the shared exponential backoff sleep and escalation path continues as if the connection had failed.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: preventing busy-looping by implementing backoff when a session flaps (connects then immediately disconnects).
Description check ✅ Passed The description is well-structured and directly related to the changeset, explaining the root causes, solution, and specific implementation details for fixing the reconnect busy-loop issue.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch claude/dazzling-lichterman-69ec87

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kixelated kixelated merged commit 6ce3708 into main Jun 19, 2026
1 check passed
@kixelated kixelated deleted the claude/dazzling-lichterman-69ec87 branch June 19, 2026 19:40
@moq-bot moq-bot Bot mentioned this pull request Jun 19, 2026
@kixelated kixelated mentioned this pull request Jun 19, 2026
4 tasks
@moq-bot moq-bot Bot mentioned this pull request Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant