fix(moq-native): back off when a session flaps instead of busy-looping#1806
Conversation
The reconnect loop only slept on the `Err` branch (when `client.connect()` itself fails). When the connection succeeded but the session dropped immediately (server accepts then resets), the `Ok(session)` path looped straight back with no delay, spinning the CPU. `retry_start` also reset on every drop, so the give-up timeout never fired during a flap and the loop ran forever. Apply the backoff sleep at the end of every iteration. A session that outlives the initial backoff is healthy and resets the window; one that is severed faster is treated as a failed connection: the close reason is kept in `last_error` (so the give-up timeout reports a real cause), backoff escalates, and the timeout keeps accumulating toward giving up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5d3d742 to
6770740
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe reconnect backoff documentation is updated so that the timeout counter resets only after a session remains connected for at least 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
A user hit a busy loop where the reconnect logs scrolled by at full CPU: the client connected (
connected version=moq-lite-05-wip), the session dropped in the same millisecond (session closed, reconnecting), and it reconnected instantly, over and over.The backoff config (
initial/multiplier/max/timeout) was already wired up, but the sleep only ran on theErrbranch whenclient.connect()itself failed. The observed failure is the opposite: connect succeeds, then the established session is severed immediately. That path (Ok(session)→session.closed()) had no sleep, so it spun.A second, compounding bug:
retry_startreset on every drop, so during a flapretry_start.elapsed()was always ~0 and the 5-minute give-uptimeoutnever fired. The loop could never escape on its own.Changes
In
rs/moq-native/src/reconnect.rs, restructured the loop so the backoff sleep runs at the end of every iteration (connect-failure and session-drop alike):backoff.initial, default 1s) → reset delay to initial and reset the timeout window, then sleep the initial delay. A one-off drop after a healthy session reconnects promptly and isn't rate-escalated.session.closed()(thecode=0/code=5 invalid valuein the logs) is captured intolast_errorinstead of being discarded, logged assession severed immediately, retrying, and the loop falls through to the shared backoff sleep. Delay escalates 1s → 2s → … → 30s andretry_startkeeps accumulating, so a server that perpetually accepts-then-resets eventually fails withreconnect timed out after 300s: <close reason>— a real cause, not a generic timeout.Also updated the
Backoff::timeoutdoc to note it resets after a stable connection, not after every drop.Notes for reviewers
moq-native, so targetingmain.dev; the fix onmainflows todevon the next merge.Test plan
cargo clippy -p moq-native --all-targets(vianix develop) — cleancargo fmt -p moq-native --check— cleanmoq-nativereconnect integration test passes(Written by Claude)