feat(huddle): fix concurrent-speaker mixing + jitter buffer + protocol v2 (up to 10 peers)#609
Merged
Merged
Conversation
…rness Three integration tests in desktop/src-tauri/tests/rodio_mixer_diagnostic.rs lock in the diagnosis we landed on for huddle's 3+ speaker failure: 1. single_queue_serializes_sources: a rodio::queue plays sources back-to-back. This is the shape rodio::Player::connect_new produces, and the reason every peer's decoded 20 ms frame serialized when appended to one Player. 2. mixer_sums_overlapping_sources: rodio::mixer sums concurrent sources. 3. mixer_of_queues_mixes_per_peer_streams: a mixer fed by N independent queues (one per peer) sums them — the shape the upcoming huddle fix moves to. No audio device is touched; sources are SamplesBuffer<f32>, drained directly. Deterministic on CI. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Introduces desktop/src-tauri/src/huddle/jitter.rs containing:
* OpusFrameDecoder — implements neteq::codec::AudioDecoder over our existing
opus = "0.3" dependency, so we don't have to enable neteq's 'native' feature
(which would pull in cpal/axum/clap).
* PeerJitterBuffer — wraps neteq::NetEq configured for 48 kHz mono with
min_delay=40ms / max_delay=200ms / max_packets=50. One per remote peer.
Dep added:
* neteq = { version = "0.8", default-features = false }
License is MIT/Apache-2.0. Default features include 'native' which drags
in cpal/axum/clap/tokio; we skip them and only get the NetEq state
machine + the AudioDecoder trait.
Three unit tests cover the API surface.
Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
…nt playout
Rewrites the WS receive path to fix the FIFO bug pinned by
tests/rodio_mixer_diagnostic.rs:
* OLD shape: one rodio::Player on the device mixer, every peer's decoded
20 ms frame appended to it. rodio::Player is a single-queue FIFO; 3+
simultaneous speakers serialized — peer A's frame, then peer B's, then
peer C's — with audible voice-flipping every 20 ms and unbounded queue
growth.
* NEW shape: each peer gets its own rodio::Player (queue) added to the same
device mixer. The device mixer sums concurrent peers correctly. A 10 ms
playout tick drains one frame from every active peer's NetEq into its
Player; each per-peer queue stays at ~10 ms depth.
Each peer's NetEq instance (from huddle::jitter):
* uses peer_index as a stable synthetic SSRC,
* has its full state dropped + recreated on 'left' or peer_index reuse
with a different pubkey (flush-on-rejoin),
* receives synthesized seq + 48 kHz RTP-style timestamps for now;
protocol v2 (next commit) replaces these with sender-authored values.
The receive task's body moves to a new huddle::playout module so
relay_api.rs stays under the desktop file-size budget. relay_api now owns
WS handshake + send-side encoding; playout owns the receive-side
jitter/mixer state machine. WsStream + REMOTE_SPEECH_THRESHOLD are exposed
pub(crate) for the split.
TTS interrupt counting moves to packet-arrival time (was decode time);
behaviorally identical (DTX silence frames never produced decoder output
anyway, so they weren't counted before either).
Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Adds a fixed 8-byte header in front of every huddle audio Opus payload,
negotiated via 'protocol_version' in the WS auth message. The header
carries sender-authored sequence + 48 kHz RTP-style media timestamp +
audio level + flags, in network byte order:
seq: u16 // wrapping
ts_48k: u32 // sender 48 kHz RTP-style media time, +960 per 20 ms
level_dbov: i8 // RMS dBov, canonical range [-127, 0]
flags: u8 // bit 0 = DTX/comfort; other bits reserved/ignored
## Relay-side (crates/sprout-relay)
* AuthMsg.protocol_version defaults to 1 so legacy clients keep working.
* Rooms are pinned to the first joiner's version; later joiners must
match or receive an 'upgrade_required' error with pinned + requested
versions in the JSON. Pin lives in the existing AdmissionGuard mutex.
* AdmissionError replaces the boolean-ish Option<...> return of
Room::add_peer so callers can distinguish Full / Ended /
VersionMismatch and emit the right error code.
* CURRENT_PROTOCOL_VERSION = 2 rejects future requests up-front.
* Five new unit tests in audio::room::tests cover Max's review checklist:
first-pin, match, mismatch, ended, manager-cleanup-resets-pin, and
pin-persists-across-peer-churn.
## Desktop side
* New huddle::wire module owns the header struct, encode/parse,
audio_level_dbov, and 9 unit tests (round-trip, byte-order,
out-of-range clamping per Max's 'don't drop audio on bad VU',
reserved-flag preservation, silence/full-scale dBov, realistic-speech
dBov range).
* Encoder builds the header from per-encoder seq/ts counters and the
pre-encode RMS dBov. DTX/comfort packets (≤2 Opus bytes) set FLAG_DTX.
* Receiver parses the header out of the wire frame, hands the
sender-authored seq + ts_48k to NetEq's RtpHeader so the jitter buffer
can detect real packet reordering & loss (not just arrival jitter).
* Auth message now advertises protocol_version = 2.
## Wire-format compatibility
The relay forwards binary frames opaquely; it never strips or rewrites
the v2 header. Mixed v1/v2 rooms are not supported — the first joiner
pins the room. Existing v1 clients still work in v1-pinned rooms; new
v2 clients won't downgrade.
Per the team's review: relay never trusts level_dbov for trust
decisions (admission, moderation). It is parsed only for diagnostics.
Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Adds a session-dismissable banner inside HuddleBar that warns users to use headphones until echo cancellation lands. Lives in a new tiny HeadphonesNotice component so it can be deleted in one diff when the AEC follow-up PR flips aecMissing to false. Why it's needed for this PR: the FIFO/jitter fix in commit 25b380f and the protocol v2 frame header in commit b6dd829 together make 10-person huddles work *for everyone wearing headphones*. The play path is still native rodio (outside the WebView render graph), so the browser's WebRTC echo canceller has nothing to cancel against. Two users on speakers in the same physical room would hear themselves echo. The banner makes that limitation visible up-front rather than letting users discover it when they're already in a 10-person meeting. Quinn's tripwire from the design thread: AEC v1 is the next PR after this one merges, not 'soon' or 'in the backlog'. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
…TX speaker Addresses three review items on commit 25b380f: 1. Silence-buffer leak (Mari + Quinn): the 10 ms playout tick was appending a SamplesBuffer for every peer every tick, even when NetEq was emitting pure Expand/silence. A peer that disconnected without a 'left' control message — or simply went idle — would have its rodio Player accumulate 100 silence buffers/sec indefinitely. Adds last_packet_at to PeerSlot, set on every successful insert_packet, and gates the append on is_active() (last_packet_at within IDLE_PEER_GRACE = 500ms). NetEq is still drained on idle peers to keep its internal clock advancing; we just don't enqueue the frame for playback. 2. MissedTickBehavior::Skip → Delay (Quinn): if the WS branch of the tokio::select is mid-handle (notably while ws_tx_for_pongs.lock() contends with the encode-side task on a Ping), a Skip-mode tick would drop entirely and all peer Players would underrun. Delay catches up the next tick immediately on return to the select, which is the right default for an audio playout clock. 3. Active-speaker DTX guard (Max): active_indices.insert(peer_idx) now only fires for non-DTX frames. DTX/comfort packets are emitted by an idle peer to keep the codec alive — they don't mean the peer is speaking and shouldn't flash the active-speaker UI. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Addresses Max's protocol-lane review: the relay was pinning the room
version on admission but the recv_loop still treated binary frames as
opaque bytes — no min-length check, no header sanity, no level_dbov
handling. That contradicted the PR claim that the relay 'parses for
metrics only' and meant a v2 client could send any garbage and the
relay would faithfully fan it out to other v2 peers.
Adds crates/sprout-relay/src/audio/wire.rs — a thin parse-only mirror
of the desktop client's huddle::wire. Same canonical layout:
seq: u16 BE
ts_48k: u32 BE
level_dbov: i8 (clamped to [-127, 0] on parse)
flags: u8 (bit 0 = DTX)
Plus four unit tests covering the layout invariant — network byte
order, short-input rejection, level clamp keeping the payload, and
reserved-flag passthrough.
recv_loop now takes the per-connection requested_version (we already
had it for room admission) and, when protocol_version >= 2:
* drops frames that don't carry header + at least one Opus byte,
* drops frames that fail FrameHeader::parse,
* logs the header at tracing::trace! for diagnostics,
* otherwise forwards the original bytes opaquely.
The relay never strips, rewrites, or re-encodes the header. The
level_dbov field stays untrusted — clamped on parse, never consumed
for any admission/moderation decision (the rule is enforced in code,
not just commentary).
Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Addresses Quinn's review on commit ce34e5d: 1. **Banner placement misses pre-join.** HeadphonesNotice in HuddleBar only renders when phase == active/connected — i.e., the user is already in the huddle with their mic open by the time they see the warning. The whole point of the warning is to let them grab headphones before they're hot. Adds a pre-join AlertDialog (HeadphonesGate) wired into both the start and join click paths in HuddleIndicator. First click in the browser-tab session shows the dialog; the action only fires on Continue. Cancel/Escape/click-outside all decline silently. The 'confirmed this session' bit is stored in sessionStorage (recipe from the design thread: 'session-dismissable, self-removing once AEC lands'). useHeadphonesGate hook owns the decision/state; HeadphonesGate is a presentational component. Both delete in one diff when the AEC follow-up flips aecMissing to false. 2. **<output> should be <div role="status">.** Biome's useSemanticElements autofix steered me to <output> on the previous commit, but <output> is for form computation results — a status notice is the canonical use case for <div role="status">. Quinn was right. The current biome build accepts the role on a div without the explicit suppression I tried first, so it's a clean one-attribute change. ARIA-correct, no escape hatch needed. Keeps the in-call banner from ce34e5d in addition to the pre-join dialog, per Quinn's option (b): pre-join warns up-front, in-call banner is the persistent reminder for users who joined another way (autojoin, channel-level rejoin, etc). Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Mari's review on d3e2f55 asked for the gate to be 'last_packet_at grace AND !jitter.is_empty()' rather than grace-only. The grace-only check covers realistic cases today (max_delay=200ms is well inside the 500ms grace), but the predicate doesn't match the stated recipe and isn't robust against future tuning of NetEq's max_delay. Combines both halves with OR so: * recent packet → keep appending (covers normal speech gaps + DTX), * buffer not empty → keep appending (a burst-then-disconnect peer still has buffered audio to drain). Promotes PeerJitterBuffer::is_empty out of #[allow(dead_code)] now that it's exercised in the playout path, with a doc comment explaining what drives the call. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Two correctness improvements from Sami + Perci's prior-art review of PR #609. Both <40 LOC, no API changes. ## 1. Bound per-peer rodio Player queue depth The playout pipeline has two clocks: producer is a tokio 10 ms interval that drains NetEq and appends to each peer's Player; consumer is the cpal audio callback that pulls samples at hardware sample rate. NetEq's rate adaptation (accelerate / expand) only sees its own input pattern, not the device's actual consumption rate, and rodio::Player wraps an unbounded MPSC. Scheduler jitter or tiny clock skew between the two clocks would let the producer outpace the consumer; over a long call the queue would grow without bound, adding monotonic latency. Adds PLAYOUT_QUEUE_HIGH_WATER = 4 (4 × 10 ms = 40 ms, well inside NetEq's max_delay_ms = 200 ms) and calls Player::skip_one() before the append when the queue is at or above the threshold. Logs the drop. Worst-case latency is now bounded at 40 ms regardless of clock drift. The 'right' long-term fix is to drive playout from the audio render callback rather than a tokio interval, but that's a much larger change (and effectively part of the AEC v1 follow-up's WebAudio rewrite). This gets us the correctness property today. ## 2. Move soft-cap check under the admission lock, define error precedence Old shape: peers.len() >= MAX_PEERS_PER_ROOM check was outside the guard mutex, then the real gate (alloc) was inside. Two problems: - Multiple joiners could pass the outside check concurrently before any insert landed, so the cap was racey (bounded by the index space but still leaked the invariant). - A joiner who was both over-cap AND requesting the wrong version got VersionMismatch instead of Full, leaking the room's pinned protocol version to a client that couldn't have joined anyway. Both fixed by moving the cap check inside the lock and defining error precedence explicitly: Ended > Full > VersionMismatch. Single in-lock admission path is now the only source of truth. New test admit_full_wins_over_version_mismatch fills the room to capacity then admits a wrong-version peer and asserts the result is Full. Also fixes the pinned_version doc-comment: the pin clears when the manager evicts the Room (cleanup_if_empty), not when peers.is_empty(). The actual behavior was correct (test version_pin_persists_across_peer_churn already covered it); only the comment was misleading. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes huddle audio actually mix at 3+ simultaneous speakers, adds a proper jitter buffer, lands the protocol v2 frame header (parsed end-to-end), and warns users to use headphones until echo cancellation lands. Sized for the "huddle working for up to 10 people" ask.
Eight commits, each independently reviewable:
c35acfca2d6319neteqcrate dep +huddle::jitter(NetEq + custom Opus decoder impl)25b380frodio::Player+ per-peer NetEq, 10 ms playout tickb6dd829ce34e5dd3e2f55Delaytick, DTX speaker7b98d1c0fb094f<div role="status">notice (Quinn's review)The bug
desktop/src-tauri/src/huddle/relay_api.rs(old code) kept a singlerodio::Playeron the device mixer andplayer.append'd every peer's decoded 20 ms frame to it.rodio::Playeris a single-queue FIFO — it plays sources back-to-back, not mixed. With 2 peers it accidentally worked because Opus DTX guarantees only one is decoding at a time. With 3+ simultaneous speakers, decoded frames serialized — you heard one voice flipping speakers every 20 ms, and the queue grew without bound.The
desktop/src-tauri/tests/rodio_mixer_diagnostic.rsintegration test pins this diagnosis deterministically: a single queue serializes, a mixer sums, and a mixer fed by N queues (the new shape) sums concurrently. No audio device required.The fix (~1k LOC across desktop + relay)
Per-peer
rodio::Player+ per-peer NetEq with 48 kHz mono / min/max delay 40/200 ms / max packets 50 /peer_indexas the stable synthetic SSRC. State is dropped and recreated onleftor peer-index reuse with a different pubkey (flush-on-rejoin). A 10 ms playout tick (MissedTickBehavior::Delay) drains one frame from each active peer'sNetEqinto its own Player. Idle peers (no packet for 500 ms) are skipped on append so the Player queue can't accumulate silence indefinitely.Protocol v2 frame header — 8 fixed bytes, network byte order:
seq | ts_48k | level_dbov | flags(DTX flag set for ≤2-byte Opus comfort packets, reserved bits ignored on parse). Version negotiation in the WS auth message. Rooms pinned to first joiner; mismatched joiners getupgrade_required. Header parsed by the relay (crates/sprout-relay/src/audio/wire.rs) for sanity + tracing; bytes still forwarded opaquely.level_dbovis untrusted telemetry — clamped on parse, never consumed for trust decisions.Headphones-recommended UX while AEC is missing:
AlertDialog(one-time per browser-tab session, "Continue / Cancel").HuddleBar(session-dismissable).aecMissing = trueconstant that the AEC follow-up flips. The dialog + hook + notice are 3 small files designed for one-diff deletion.Tests (24 new)
tests/rodio_mixer_diagnostic.rs(FIFO vs mixer diagnosis)huddle::jitter(NetEq construction, insert+get_audio, empty-buffer contract)huddle::wire(round-trip, byte-order, short-input, payload-remainder, level clamp, reserved flag preservation, dBov silence/full-scale/realistic-speech)sprout-relay::audio::room(Max's full version-pinning matrix: first-pin, match, mismatch, ended, manager-cleanup-resets-pin, pin-persists-across-peer-churn)sprout-relay::audio::wire(network byte order, short input, level-clamp keeps frame, reserved flags preserved)All existing tests still green. Pre-push lefthook all-green: rust-fmt, clippy, all rust tests, desktop-build, web-build, mobile-test, desktop-tauri-check.
What's NOT in this PR (deferred, in priority order)
webrtc-audio-processingmacOS/Linux +sonoraWindows path) — separate spike after AEC v1 lands.Crate adds
neteq = { version = "0.8", default-features = false }— pure-Rust libWebRTC NetEq port (security-union/videocall-rs). MIT/Apache-2.0. Disabling default features avoids pulling in cpal/axum/clap; we implementAudioDecoderdirectly over our existingopus = "0.3".Reviewers
huddle::wire,audio::wire),audio::roomversion-pinning matrix, opaque relay forwarding invariant.DCO / signing
All commits are signed with Tyler's SSH key and carry
Signed-off-by: tlongwell-block <…>for DCO.🤖 Authored by Dawn — code review by Max, Mari, and Quinn in the design thread.