feat(huddle): fix concurrent-speaker mixing + jitter buffer + protocol v2 (up to 10 peers) by tlongwell-block · Pull Request #609 · block/sprout

tlongwell-block · 2026-05-18T19:21:57Z

Summary

Makes huddle audio actually mix at 3+ simultaneous speakers, adds a proper jitter buffer, lands the protocol v2 frame header (parsed end-to-end), and warns users to use headphones until echo cancellation lands. Sized for the "huddle working for up to 10 people" ask.

Next PR after this merges: feat(huddle): WebAudio AEC v1 — spec is the consolidated one in the design thread (shared AudioContext, per-peer AudioWorkletNode with MessagePort PCM, Channel<InvokeResponseBody>::Raw from Rust at 100 batched sends/sec, 0-gain mic destination leg). When it lands, the aecMissing = true constants in HuddleBar and HuddleIndicator flip to false and the pre-join gate + in-call banner + HeadphonesGate.tsx + HeadphonesNotice.tsx + useHeadphonesGate.ts are mechanically deletable in one diff.

Eight commits, each independently reviewable:

Commit	Scope
`c35acfc`	Deterministic regression test pinning the rodio FIFO vs mixer behavior
`a2d6319`	`neteq` crate dep + `huddle::jitter` (NetEq + custom Opus decoder impl)
`25b380f`	The actual fix: per-peer `rodio::Player` + per-peer NetEq, 10 ms playout tick
`b6dd829`	Protocol v2 — 8-byte sender-authored frame header + relay-side version pinning
`ce34e5d`	"Headphones recommended" banner shown while AEC is missing
`d3e2f55`	Review feedback on playout: idle-peer guard, `Delay` tick, DTX speaker
`7b98d1c`	Relay-side v2 parse helper + tests (Max's review)
`0fb094f`	Pre-join headphones gate + `<div role="status">` notice (Quinn's review)

The bug

desktop/src-tauri/src/huddle/relay_api.rs (old code) kept a single rodio::Player on the device mixer and player.append'd every peer's decoded 20 ms frame to it. rodio::Player is a single-queue FIFO — it plays sources back-to-back, not mixed. With 2 peers it accidentally worked because Opus DTX guarantees only one is decoding at a time. With 3+ simultaneous speakers, decoded frames serialized — you heard one voice flipping speakers every 20 ms, and the queue grew without bound.

The desktop/src-tauri/tests/rodio_mixer_diagnostic.rs integration test pins this diagnosis deterministically: a single queue serializes, a mixer sums, and a mixer fed by N queues (the new shape) sums concurrently. No audio device required.

The fix (~1k LOC across desktop + relay)

Per-peer rodio::Player + per-peer NetEq with 48 kHz mono / min/max delay 40/200 ms / max packets 50 / peer_index as the stable synthetic SSRC. State is dropped and recreated on left or peer-index reuse with a different pubkey (flush-on-rejoin). A 10 ms playout tick (MissedTickBehavior::Delay) drains one frame from each active peer's NetEq into its own Player. Idle peers (no packet for 500 ms) are skipped on append so the Player queue can't accumulate silence indefinitely.
Protocol v2 frame header — 8 fixed bytes, network byte order: seq | ts_48k | level_dbov | flags (DTX flag set for ≤2-byte Opus comfort packets, reserved bits ignored on parse). Version negotiation in the WS auth message. Rooms pinned to first joiner; mismatched joiners get upgrade_required. Header parsed by the relay (crates/sprout-relay/src/audio/wire.rs) for sanity + tracing; bytes still forwarded opaquely. level_dbov is untrusted telemetry — clamped on parse, never consumed for trust decisions.
Headphones-recommended UX while AEC is missing:
- Pre-join AlertDialog (one-time per browser-tab session, "Continue / Cancel").
- Persistent in-call banner inside HuddleBar (session-dismissable).
- Both gated on a single aecMissing = true constant that the AEC follow-up flips. The dialog + hook + notice are 3 small files designed for one-diff deletion.

Tests (24 new)

3 integration tests in tests/rodio_mixer_diagnostic.rs (FIFO vs mixer diagnosis)
3 unit tests in huddle::jitter (NetEq construction, insert+get_audio, empty-buffer contract)
9 unit tests in huddle::wire (round-trip, byte-order, short-input, payload-remainder, level clamp, reserved flag preservation, dBov silence/full-scale/realistic-speech)
5 unit tests in sprout-relay::audio::room (Max's full version-pinning matrix: first-pin, match, mismatch, ended, manager-cleanup-resets-pin, pin-persists-across-peer-churn)
4 unit tests in sprout-relay::audio::wire (network byte order, short input, level-clamp keeps frame, reserved flags preserved)

All existing tests still green. Pre-push lefthook all-green: rust-fmt, clippy, all rust tests, desktop-build, web-build, mobile-test, desktop-tauri-check.

What's NOT in this PR (deferred, in priority order)

WebAudio AEC v1 — the next PR, spec locked in the design thread.
Native AEC (webrtc-audio-processing macOS/Linux + sonora Windows path) — separate spike after AEC v1 lands.
Server-side AudioBridge / MCU — only if/when we need recording, transcription, mobile decode×11 savings, or rooms > 12. Protocol v2 preserves the option with zero protocol break.
Adaptive packet-loss feedback / encoder FEC — explicitly deferred because we have no loss measurement showing 10-person rooms are loss-limited today.

Crate adds

neteq = { version = "0.8", default-features = false } — pure-Rust libWebRTC NetEq port (security-union/videocall-rs). MIT/Apache-2.0. Disabling default features avoids pulling in cpal/axum/clap; we implement AudioDecoder directly over our existing opus = "0.3".

Reviewers

@max — protocol v2 (huddle::wire, audio::wire), audio::room version-pinning matrix, opaque relay forwarding invariant.
@mari — NetEq config + per-peer Player playout shape, idle-peer guard.
@quinn — pre-join headphones gate, in-call banner, AEC v1 follow-up commitment (above).

DCO / signing

All commits are signed with Tyler's SSH key and carry Signed-off-by: tlongwell-block <…> for DCO.

🤖 Authored by Dawn — code review by Max, Mari, and Quinn in the design thread.

…rness Three integration tests in desktop/src-tauri/tests/rodio_mixer_diagnostic.rs lock in the diagnosis we landed on for huddle's 3+ speaker failure: 1. single_queue_serializes_sources: a rodio::queue plays sources back-to-back. This is the shape rodio::Player::connect_new produces, and the reason every peer's decoded 20 ms frame serialized when appended to one Player. 2. mixer_sums_overlapping_sources: rodio::mixer sums concurrent sources. 3. mixer_of_queues_mixes_per_peer_streams: a mixer fed by N independent queues (one per peer) sums them — the shape the upcoming huddle fix moves to. No audio device is touched; sources are SamplesBuffer<f32>, drained directly. Deterministic on CI. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>

Introduces desktop/src-tauri/src/huddle/jitter.rs containing: * OpusFrameDecoder — implements neteq::codec::AudioDecoder over our existing opus = "0.3" dependency, so we don't have to enable neteq's 'native' feature (which would pull in cpal/axum/clap). * PeerJitterBuffer — wraps neteq::NetEq configured for 48 kHz mono with min_delay=40ms / max_delay=200ms / max_packets=50. One per remote peer. Dep added: * neteq = { version = "0.8", default-features = false } License is MIT/Apache-2.0. Default features include 'native' which drags in cpal/axum/clap/tokio; we skip them and only get the NetEq state machine + the AudioDecoder trait. Three unit tests cover the API surface. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>

…nt playout Rewrites the WS receive path to fix the FIFO bug pinned by tests/rodio_mixer_diagnostic.rs: * OLD shape: one rodio::Player on the device mixer, every peer's decoded 20 ms frame appended to it. rodio::Player is a single-queue FIFO; 3+ simultaneous speakers serialized — peer A's frame, then peer B's, then peer C's — with audible voice-flipping every 20 ms and unbounded queue growth. * NEW shape: each peer gets its own rodio::Player (queue) added to the same device mixer. The device mixer sums concurrent peers correctly. A 10 ms playout tick drains one frame from every active peer's NetEq into its Player; each per-peer queue stays at ~10 ms depth. Each peer's NetEq instance (from huddle::jitter): * uses peer_index as a stable synthetic SSRC, * has its full state dropped + recreated on 'left' or peer_index reuse with a different pubkey (flush-on-rejoin), * receives synthesized seq + 48 kHz RTP-style timestamps for now; protocol v2 (next commit) replaces these with sender-authored values. The receive task's body moves to a new huddle::playout module so relay_api.rs stays under the desktop file-size budget. relay_api now owns WS handshake + send-side encoding; playout owns the receive-side jitter/mixer state machine. WsStream + REMOTE_SPEECH_THRESHOLD are exposed pub(crate) for the split. TTS interrupt counting moves to packet-arrival time (was decode time); behaviorally identical (DTX silence frames never produced decoder output anyway, so they weren't counted before either). Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>

Adds a fixed 8-byte header in front of every huddle audio Opus payload, negotiated via 'protocol_version' in the WS auth message. The header carries sender-authored sequence + 48 kHz RTP-style media timestamp + audio level + flags, in network byte order: seq: u16 // wrapping ts_48k: u32 // sender 48 kHz RTP-style media time, +960 per 20 ms level_dbov: i8 // RMS dBov, canonical range [-127, 0] flags: u8 // bit 0 = DTX/comfort; other bits reserved/ignored ## Relay-side (crates/sprout-relay) * AuthMsg.protocol_version defaults to 1 so legacy clients keep working. * Rooms are pinned to the first joiner's version; later joiners must match or receive an 'upgrade_required' error with pinned + requested versions in the JSON. Pin lives in the existing AdmissionGuard mutex. * AdmissionError replaces the boolean-ish Option<...> return of Room::add_peer so callers can distinguish Full / Ended / VersionMismatch and emit the right error code. * CURRENT_PROTOCOL_VERSION = 2 rejects future requests up-front. * Five new unit tests in audio::room::tests cover Max's review checklist: first-pin, match, mismatch, ended, manager-cleanup-resets-pin, and pin-persists-across-peer-churn. ## Desktop side * New huddle::wire module owns the header struct, encode/parse, audio_level_dbov, and 9 unit tests (round-trip, byte-order, out-of-range clamping per Max's 'don't drop audio on bad VU', reserved-flag preservation, silence/full-scale dBov, realistic-speech dBov range). * Encoder builds the header from per-encoder seq/ts counters and the pre-encode RMS dBov. DTX/comfort packets (≤2 Opus bytes) set FLAG_DTX. * Receiver parses the header out of the wire frame, hands the sender-authored seq + ts_48k to NetEq's RtpHeader so the jitter buffer can detect real packet reordering & loss (not just arrival jitter). * Auth message now advertises protocol_version = 2. ## Wire-format compatibility The relay forwards binary frames opaquely; it never strips or rewrites the v2 header. Mixed v1/v2 rooms are not supported — the first joiner pins the room. Existing v1 clients still work in v1-pinned rooms; new v2 clients won't downgrade. Per the team's review: relay never trusts level_dbov for trust decisions (admission, moderation). It is parsed only for diagnostics. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>

Adds a session-dismissable banner inside HuddleBar that warns users to use headphones until echo cancellation lands. Lives in a new tiny HeadphonesNotice component so it can be deleted in one diff when the AEC follow-up PR flips aecMissing to false. Why it's needed for this PR: the FIFO/jitter fix in commit 25b380f and the protocol v2 frame header in commit b6dd829 together make 10-person huddles work *for everyone wearing headphones*. The play path is still native rodio (outside the WebView render graph), so the browser's WebRTC echo canceller has nothing to cancel against. Two users on speakers in the same physical room would hear themselves echo. The banner makes that limitation visible up-front rather than letting users discover it when they're already in a 10-person meeting. Quinn's tripwire from the design thread: AEC v1 is the next PR after this one merges, not 'soon' or 'in the backlog'. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>

…TX speaker Addresses three review items on commit 25b380f: 1. Silence-buffer leak (Mari + Quinn): the 10 ms playout tick was appending a SamplesBuffer for every peer every tick, even when NetEq was emitting pure Expand/silence. A peer that disconnected without a 'left' control message — or simply went idle — would have its rodio Player accumulate 100 silence buffers/sec indefinitely. Adds last_packet_at to PeerSlot, set on every successful insert_packet, and gates the append on is_active() (last_packet_at within IDLE_PEER_GRACE = 500ms). NetEq is still drained on idle peers to keep its internal clock advancing; we just don't enqueue the frame for playback. 2. MissedTickBehavior::Skip → Delay (Quinn): if the WS branch of the tokio::select is mid-handle (notably while ws_tx_for_pongs.lock() contends with the encode-side task on a Ping), a Skip-mode tick would drop entirely and all peer Players would underrun. Delay catches up the next tick immediately on return to the select, which is the right default for an audio playout clock. 3. Active-speaker DTX guard (Max): active_indices.insert(peer_idx) now only fires for non-DTX frames. DTX/comfort packets are emitted by an idle peer to keep the codec alive — they don't mean the peer is speaking and shouldn't flash the active-speaker UI. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>

Addresses Max's protocol-lane review: the relay was pinning the room version on admission but the recv_loop still treated binary frames as opaque bytes — no min-length check, no header sanity, no level_dbov handling. That contradicted the PR claim that the relay 'parses for metrics only' and meant a v2 client could send any garbage and the relay would faithfully fan it out to other v2 peers. Adds crates/sprout-relay/src/audio/wire.rs — a thin parse-only mirror of the desktop client's huddle::wire. Same canonical layout: seq: u16 BE ts_48k: u32 BE level_dbov: i8 (clamped to [-127, 0] on parse) flags: u8 (bit 0 = DTX) Plus four unit tests covering the layout invariant — network byte order, short-input rejection, level clamp keeping the payload, and reserved-flag passthrough. recv_loop now takes the per-connection requested_version (we already had it for room admission) and, when protocol_version >= 2: * drops frames that don't carry header + at least one Opus byte, * drops frames that fail FrameHeader::parse, * logs the header at tracing::trace! for diagnostics, * otherwise forwards the original bytes opaquely. The relay never strips, rewrites, or re-encodes the header. The level_dbov field stays untrusted — clamped on parse, never consumed for any admission/moderation decision (the rule is enforced in code, not just commentary). Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>

Addresses Quinn's review on commit ce34e5d: 1. **Banner placement misses pre-join.** HeadphonesNotice in HuddleBar only renders when phase == active/connected — i.e., the user is already in the huddle with their mic open by the time they see the warning. The whole point of the warning is to let them grab headphones before they're hot. Adds a pre-join AlertDialog (HeadphonesGate) wired into both the start and join click paths in HuddleIndicator. First click in the browser-tab session shows the dialog; the action only fires on Continue. Cancel/Escape/click-outside all decline silently. The 'confirmed this session' bit is stored in sessionStorage (recipe from the design thread: 'session-dismissable, self-removing once AEC lands'). useHeadphonesGate hook owns the decision/state; HeadphonesGate is a presentational component. Both delete in one diff when the AEC follow-up flips aecMissing to false. 2. **<output> should be <div role="status">.** Biome's useSemanticElements autofix steered me to <output> on the previous commit, but <output> is for form computation results — a status notice is the canonical use case for <div role="status">. Quinn was right. The current biome build accepts the role on a div without the explicit suppression I tried first, so it's a clean one-attribute change. ARIA-correct, no escape hatch needed. Keeps the in-call banner from ce34e5d in addition to the pre-join dialog, per Quinn's option (b): pre-join warns up-front, in-call banner is the persistent reminder for users who joined another way (autojoin, channel-level rejoin, etc). Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>

Mari's review on d3e2f55 asked for the gate to be 'last_packet_at grace AND !jitter.is_empty()' rather than grace-only. The grace-only check covers realistic cases today (max_delay=200ms is well inside the 500ms grace), but the predicate doesn't match the stated recipe and isn't robust against future tuning of NetEq's max_delay. Combines both halves with OR so: * recent packet → keep appending (covers normal speech gaps + DTX), * buffer not empty → keep appending (a burst-then-disconnect peer still has buffered audio to drain). Promotes PeerJitterBuffer::is_empty out of #[allow(dead_code)] now that it's exercised in the playout path, with a doc comment explaining what drives the call. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>

Two correctness improvements from Sami + Perci's prior-art review of PR #609. Both <40 LOC, no API changes. ## 1. Bound per-peer rodio Player queue depth The playout pipeline has two clocks: producer is a tokio 10 ms interval that drains NetEq and appends to each peer's Player; consumer is the cpal audio callback that pulls samples at hardware sample rate. NetEq's rate adaptation (accelerate / expand) only sees its own input pattern, not the device's actual consumption rate, and rodio::Player wraps an unbounded MPSC. Scheduler jitter or tiny clock skew between the two clocks would let the producer outpace the consumer; over a long call the queue would grow without bound, adding monotonic latency. Adds PLAYOUT_QUEUE_HIGH_WATER = 4 (4 × 10 ms = 40 ms, well inside NetEq's max_delay_ms = 200 ms) and calls Player::skip_one() before the append when the queue is at or above the threshold. Logs the drop. Worst-case latency is now bounded at 40 ms regardless of clock drift. The 'right' long-term fix is to drive playout from the audio render callback rather than a tokio interval, but that's a much larger change (and effectively part of the AEC v1 follow-up's WebAudio rewrite). This gets us the correctness property today. ## 2. Move soft-cap check under the admission lock, define error precedence Old shape: peers.len() >= MAX_PEERS_PER_ROOM check was outside the guard mutex, then the real gate (alloc) was inside. Two problems: - Multiple joiners could pass the outside check concurrently before any insert landed, so the cap was racey (bounded by the index space but still leaked the invariant). - A joiner who was both over-cap AND requesting the wrong version got VersionMismatch instead of Full, leaking the room's pinned protocol version to a client that couldn't have joined anyway. Both fixed by moving the cap check inside the lock and defining error precedence explicitly: Ended > Full > VersionMismatch. Single in-lock admission path is now the only source of truth. New test admit_full_wins_over_version_mismatch fills the room to capacity then admits a wrong-version peer and asserts the result is Full. Also fixes the pinned_version doc-comment: the pin clears when the manager evicts the Room (cleanup_if_empty), not when peers.is_empty(). The actual behavior was correct (test version_pin_persists_across_peer_churn already covered it); only the comment was misleading. Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>

tlongwell-block added 5 commits May 18, 2026 14:50

tlongwell-block requested a review from wesbillman as a code owner May 18, 2026 19:21

tlongwell-block added 5 commits May 18, 2026 15:39

tlongwell-block merged commit 6e1418e into main May 18, 2026
14 of 15 checks passed

tlongwell-block deleted the dawn/huddle-up-to-10 branch May 18, 2026 20:34

This was referenced May 28, 2026

chore(release): release version 0.3.1 #769

Merged

chore(release): release version 0.3.2 #774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(huddle): fix concurrent-speaker mixing + jitter buffer + protocol v2 (up to 10 peers)#609

feat(huddle): fix concurrent-speaker mixing + jitter buffer + protocol v2 (up to 10 peers)#609
tlongwell-block merged 10 commits into
mainfrom
dawn/huddle-up-to-10

tlongwell-block commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tlongwell-block commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug

The fix (~1k LOC across desktop + relay)

Tests (24 new)

What's NOT in this PR (deferred, in priority order)

Crate adds

Reviewers

DCO / signing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tlongwell-block commented May 18, 2026 •

edited

Loading