Skip to content

feat(huddle): fix concurrent-speaker mixing + jitter buffer + protocol v2 (up to 10 peers)#609

Merged
tlongwell-block merged 10 commits into
mainfrom
dawn/huddle-up-to-10
May 18, 2026
Merged

feat(huddle): fix concurrent-speaker mixing + jitter buffer + protocol v2 (up to 10 peers)#609
tlongwell-block merged 10 commits into
mainfrom
dawn/huddle-up-to-10

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

@tlongwell-block tlongwell-block commented May 18, 2026

Summary

Makes huddle audio actually mix at 3+ simultaneous speakers, adds a proper jitter buffer, lands the protocol v2 frame header (parsed end-to-end), and warns users to use headphones until echo cancellation lands. Sized for the "huddle working for up to 10 people" ask.

Next PR after this merges: feat(huddle): WebAudio AEC v1 — spec is the consolidated one in the design thread (shared AudioContext, per-peer AudioWorkletNode with MessagePort PCM, Channel<InvokeResponseBody>::Raw from Rust at 100 batched sends/sec, 0-gain mic destination leg). When it lands, the aecMissing = true constants in HuddleBar and HuddleIndicator flip to false and the pre-join gate + in-call banner + HeadphonesGate.tsx + HeadphonesNotice.tsx + useHeadphonesGate.ts are mechanically deletable in one diff.

Eight commits, each independently reviewable:

Commit Scope
c35acfc Deterministic regression test pinning the rodio FIFO vs mixer behavior
a2d6319 neteq crate dep + huddle::jitter (NetEq + custom Opus decoder impl)
25b380f The actual fix: per-peer rodio::Player + per-peer NetEq, 10 ms playout tick
b6dd829 Protocol v2 — 8-byte sender-authored frame header + relay-side version pinning
ce34e5d "Headphones recommended" banner shown while AEC is missing
d3e2f55 Review feedback on playout: idle-peer guard, Delay tick, DTX speaker
7b98d1c Relay-side v2 parse helper + tests (Max's review)
0fb094f Pre-join headphones gate + <div role="status"> notice (Quinn's review)

The bug

desktop/src-tauri/src/huddle/relay_api.rs (old code) kept a single rodio::Player on the device mixer and player.append'd every peer's decoded 20 ms frame to it. rodio::Player is a single-queue FIFO — it plays sources back-to-back, not mixed. With 2 peers it accidentally worked because Opus DTX guarantees only one is decoding at a time. With 3+ simultaneous speakers, decoded frames serialized — you heard one voice flipping speakers every 20 ms, and the queue grew without bound.

The desktop/src-tauri/tests/rodio_mixer_diagnostic.rs integration test pins this diagnosis deterministically: a single queue serializes, a mixer sums, and a mixer fed by N queues (the new shape) sums concurrently. No audio device required.

The fix (~1k LOC across desktop + relay)

  1. Per-peer rodio::Player + per-peer NetEq with 48 kHz mono / min/max delay 40/200 ms / max packets 50 / peer_index as the stable synthetic SSRC. State is dropped and recreated on left or peer-index reuse with a different pubkey (flush-on-rejoin). A 10 ms playout tick (MissedTickBehavior::Delay) drains one frame from each active peer's NetEq into its own Player. Idle peers (no packet for 500 ms) are skipped on append so the Player queue can't accumulate silence indefinitely.

  2. Protocol v2 frame header — 8 fixed bytes, network byte order: seq | ts_48k | level_dbov | flags (DTX flag set for ≤2-byte Opus comfort packets, reserved bits ignored on parse). Version negotiation in the WS auth message. Rooms pinned to first joiner; mismatched joiners get upgrade_required. Header parsed by the relay (crates/sprout-relay/src/audio/wire.rs) for sanity + tracing; bytes still forwarded opaquely. level_dbov is untrusted telemetry — clamped on parse, never consumed for trust decisions.

  3. Headphones-recommended UX while AEC is missing:

    • Pre-join AlertDialog (one-time per browser-tab session, "Continue / Cancel").
    • Persistent in-call banner inside HuddleBar (session-dismissable).
    • Both gated on a single aecMissing = true constant that the AEC follow-up flips. The dialog + hook + notice are 3 small files designed for one-diff deletion.

Tests (24 new)

  • 3 integration tests in tests/rodio_mixer_diagnostic.rs (FIFO vs mixer diagnosis)
  • 3 unit tests in huddle::jitter (NetEq construction, insert+get_audio, empty-buffer contract)
  • 9 unit tests in huddle::wire (round-trip, byte-order, short-input, payload-remainder, level clamp, reserved flag preservation, dBov silence/full-scale/realistic-speech)
  • 5 unit tests in sprout-relay::audio::room (Max's full version-pinning matrix: first-pin, match, mismatch, ended, manager-cleanup-resets-pin, pin-persists-across-peer-churn)
  • 4 unit tests in sprout-relay::audio::wire (network byte order, short input, level-clamp keeps frame, reserved flags preserved)

All existing tests still green. Pre-push lefthook all-green: rust-fmt, clippy, all rust tests, desktop-build, web-build, mobile-test, desktop-tauri-check.

What's NOT in this PR (deferred, in priority order)

  • WebAudio AEC v1 — the next PR, spec locked in the design thread.
  • Native AEC (webrtc-audio-processing macOS/Linux + sonora Windows path) — separate spike after AEC v1 lands.
  • Server-side AudioBridge / MCU — only if/when we need recording, transcription, mobile decode×11 savings, or rooms > 12. Protocol v2 preserves the option with zero protocol break.
  • Adaptive packet-loss feedback / encoder FEC — explicitly deferred because we have no loss measurement showing 10-person rooms are loss-limited today.

Crate adds

  • neteq = { version = "0.8", default-features = false } — pure-Rust libWebRTC NetEq port (security-union/videocall-rs). MIT/Apache-2.0. Disabling default features avoids pulling in cpal/axum/clap; we implement AudioDecoder directly over our existing opus = "0.3".

Reviewers

  • @max — protocol v2 (huddle::wire, audio::wire), audio::room version-pinning matrix, opaque relay forwarding invariant.
  • @mari — NetEq config + per-peer Player playout shape, idle-peer guard.
  • @quinn — pre-join headphones gate, in-call banner, AEC v1 follow-up commitment (above).

DCO / signing

All commits are signed with Tyler's SSH key and carry Signed-off-by: tlongwell-block <…> for DCO.

🤖 Authored by Dawn — code review by Max, Mari, and Quinn in the design thread.

…rness

Three integration tests in desktop/src-tauri/tests/rodio_mixer_diagnostic.rs lock
in the diagnosis we landed on for huddle's 3+ speaker failure:

1. single_queue_serializes_sources: a rodio::queue plays sources back-to-back.
   This is the shape rodio::Player::connect_new produces, and the reason every
   peer's decoded 20 ms frame serialized when appended to one Player.
2. mixer_sums_overlapping_sources: rodio::mixer sums concurrent sources.
3. mixer_of_queues_mixes_per_peer_streams: a mixer fed by N independent queues
   (one per peer) sums them — the shape the upcoming huddle fix moves to.

No audio device is touched; sources are SamplesBuffer<f32>, drained directly.
Deterministic on CI.

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Introduces desktop/src-tauri/src/huddle/jitter.rs containing:

* OpusFrameDecoder — implements neteq::codec::AudioDecoder over our existing
  opus = "0.3" dependency, so we don't have to enable neteq's 'native' feature
  (which would pull in cpal/axum/clap).
* PeerJitterBuffer — wraps neteq::NetEq configured for 48 kHz mono with
  min_delay=40ms / max_delay=200ms / max_packets=50. One per remote peer.

Dep added:
* neteq = { version = "0.8", default-features = false }
  License is MIT/Apache-2.0. Default features include 'native' which drags
  in cpal/axum/clap/tokio; we skip them and only get the NetEq state
  machine + the AudioDecoder trait.

Three unit tests cover the API surface.

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
…nt playout

Rewrites the WS receive path to fix the FIFO bug pinned by
tests/rodio_mixer_diagnostic.rs:

* OLD shape: one rodio::Player on the device mixer, every peer's decoded
  20 ms frame appended to it. rodio::Player is a single-queue FIFO; 3+
  simultaneous speakers serialized — peer A's frame, then peer B's, then
  peer C's — with audible voice-flipping every 20 ms and unbounded queue
  growth.
* NEW shape: each peer gets its own rodio::Player (queue) added to the same
  device mixer. The device mixer sums concurrent peers correctly. A 10 ms
  playout tick drains one frame from every active peer's NetEq into its
  Player; each per-peer queue stays at ~10 ms depth.

Each peer's NetEq instance (from huddle::jitter):
  * uses peer_index as a stable synthetic SSRC,
  * has its full state dropped + recreated on 'left' or peer_index reuse
    with a different pubkey (flush-on-rejoin),
  * receives synthesized seq + 48 kHz RTP-style timestamps for now;
    protocol v2 (next commit) replaces these with sender-authored values.

The receive task's body moves to a new huddle::playout module so
relay_api.rs stays under the desktop file-size budget. relay_api now owns
WS handshake + send-side encoding; playout owns the receive-side
jitter/mixer state machine. WsStream + REMOTE_SPEECH_THRESHOLD are exposed
pub(crate) for the split.

TTS interrupt counting moves to packet-arrival time (was decode time);
behaviorally identical (DTX silence frames never produced decoder output
anyway, so they weren't counted before either).

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Adds a fixed 8-byte header in front of every huddle audio Opus payload,
negotiated via 'protocol_version' in the WS auth message. The header
carries sender-authored sequence + 48 kHz RTP-style media timestamp +
audio level + flags, in network byte order:

    seq:        u16  // wrapping
    ts_48k:     u32  // sender 48 kHz RTP-style media time, +960 per 20 ms
    level_dbov: i8   // RMS dBov, canonical range [-127, 0]
    flags:      u8   // bit 0 = DTX/comfort; other bits reserved/ignored

## Relay-side (crates/sprout-relay)

* AuthMsg.protocol_version defaults to 1 so legacy clients keep working.
* Rooms are pinned to the first joiner's version; later joiners must
  match or receive an 'upgrade_required' error with pinned + requested
  versions in the JSON. Pin lives in the existing AdmissionGuard mutex.
* AdmissionError replaces the boolean-ish Option<...> return of
  Room::add_peer so callers can distinguish Full / Ended /
  VersionMismatch and emit the right error code.
* CURRENT_PROTOCOL_VERSION = 2 rejects future requests up-front.
* Five new unit tests in audio::room::tests cover Max's review checklist:
  first-pin, match, mismatch, ended, manager-cleanup-resets-pin, and
  pin-persists-across-peer-churn.

## Desktop side

* New huddle::wire module owns the header struct, encode/parse,
  audio_level_dbov, and 9 unit tests (round-trip, byte-order,
  out-of-range clamping per Max's 'don't drop audio on bad VU',
  reserved-flag preservation, silence/full-scale dBov, realistic-speech
  dBov range).
* Encoder builds the header from per-encoder seq/ts counters and the
  pre-encode RMS dBov. DTX/comfort packets (≤2 Opus bytes) set FLAG_DTX.
* Receiver parses the header out of the wire frame, hands the
  sender-authored seq + ts_48k to NetEq's RtpHeader so the jitter buffer
  can detect real packet reordering & loss (not just arrival jitter).
* Auth message now advertises protocol_version = 2.

## Wire-format compatibility

The relay forwards binary frames opaquely; it never strips or rewrites
the v2 header. Mixed v1/v2 rooms are not supported — the first joiner
pins the room. Existing v1 clients still work in v1-pinned rooms; new
v2 clients won't downgrade.

Per the team's review: relay never trusts level_dbov for trust
decisions (admission, moderation). It is parsed only for diagnostics.

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Adds a session-dismissable banner inside HuddleBar that warns users to use
headphones until echo cancellation lands. Lives in a new tiny
HeadphonesNotice component so it can be deleted in one diff when the AEC
follow-up PR flips aecMissing to false.

Why it's needed for this PR: the FIFO/jitter fix in commit 25b380f and
the protocol v2 frame header in commit b6dd829 together make 10-person
huddles work *for everyone wearing headphones*. The play path is still
native rodio (outside the WebView render graph), so the browser's WebRTC
echo canceller has nothing to cancel against. Two users on speakers in
the same physical room would hear themselves echo. The banner makes
that limitation visible up-front rather than letting users discover it
when they're already in a 10-person meeting.

Quinn's tripwire from the design thread: AEC v1 is the next PR after
this one merges, not 'soon' or 'in the backlog'.

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
…TX speaker

Addresses three review items on commit 25b380f:

1. Silence-buffer leak (Mari + Quinn): the 10 ms playout tick was
   appending a SamplesBuffer for every peer every tick, even when NetEq
   was emitting pure Expand/silence. A peer that disconnected without a
   'left' control message — or simply went idle — would have its rodio
   Player accumulate 100 silence buffers/sec indefinitely. Adds
   last_packet_at to PeerSlot, set on every successful insert_packet,
   and gates the append on is_active() (last_packet_at within
   IDLE_PEER_GRACE = 500ms). NetEq is still drained on idle peers to
   keep its internal clock advancing; we just don't enqueue the frame
   for playback.

2. MissedTickBehavior::Skip → Delay (Quinn): if the WS branch of the
   tokio::select is mid-handle (notably while ws_tx_for_pongs.lock()
   contends with the encode-side task on a Ping), a Skip-mode tick would
   drop entirely and all peer Players would underrun. Delay catches up
   the next tick immediately on return to the select, which is the
   right default for an audio playout clock.

3. Active-speaker DTX guard (Max): active_indices.insert(peer_idx) now
   only fires for non-DTX frames. DTX/comfort packets are emitted by an
   idle peer to keep the codec alive — they don't mean the peer is
   speaking and shouldn't flash the active-speaker UI.

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Addresses Max's protocol-lane review: the relay was pinning the room
version on admission but the recv_loop still treated binary frames as
opaque bytes — no min-length check, no header sanity, no level_dbov
handling. That contradicted the PR claim that the relay 'parses for
metrics only' and meant a v2 client could send any garbage and the
relay would faithfully fan it out to other v2 peers.

Adds crates/sprout-relay/src/audio/wire.rs — a thin parse-only mirror
of the desktop client's huddle::wire. Same canonical layout:

    seq:        u16 BE
    ts_48k:     u32 BE
    level_dbov: i8  (clamped to [-127, 0] on parse)
    flags:      u8  (bit 0 = DTX)

Plus four unit tests covering the layout invariant — network byte
order, short-input rejection, level clamp keeping the payload, and
reserved-flag passthrough.

recv_loop now takes the per-connection requested_version (we already
had it for room admission) and, when protocol_version >= 2:

* drops frames that don't carry header + at least one Opus byte,
* drops frames that fail FrameHeader::parse,
* logs the header at tracing::trace! for diagnostics,
* otherwise forwards the original bytes opaquely.

The relay never strips, rewrites, or re-encodes the header. The
level_dbov field stays untrusted — clamped on parse, never consumed
for any admission/moderation decision (the rule is enforced in code,
not just commentary).

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Addresses Quinn's review on commit ce34e5d:

1. **Banner placement misses pre-join.** HeadphonesNotice in HuddleBar
   only renders when phase == active/connected — i.e., the user is
   already in the huddle with their mic open by the time they see the
   warning. The whole point of the warning is to let them grab
   headphones before they're hot.

   Adds a pre-join AlertDialog (HeadphonesGate) wired into both the
   start and join click paths in HuddleIndicator. First click in the
   browser-tab session shows the dialog; the action only fires on
   Continue. Cancel/Escape/click-outside all decline silently. The
   'confirmed this session' bit is stored in sessionStorage (recipe
   from the design thread: 'session-dismissable, self-removing once
   AEC lands').

   useHeadphonesGate hook owns the decision/state; HeadphonesGate
   is a presentational component. Both delete in one diff when the
   AEC follow-up flips aecMissing to false.

2. **<output> should be <div role="status">.** Biome's
   useSemanticElements autofix steered me to <output> on the previous
   commit, but <output> is for form computation results — a status
   notice is the canonical use case for <div role="status">. Quinn
   was right. The current biome build accepts the role on a div
   without the explicit suppression I tried first, so it's a clean
   one-attribute change. ARIA-correct, no escape hatch needed.

Keeps the in-call banner from ce34e5d in addition to the pre-join
dialog, per Quinn's option (b): pre-join warns up-front, in-call
banner is the persistent reminder for users who joined another way
(autojoin, channel-level rejoin, etc).

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Mari's review on d3e2f55 asked for the gate to be 'last_packet_at grace
AND !jitter.is_empty()' rather than grace-only. The grace-only check
covers realistic cases today (max_delay=200ms is well inside the 500ms
grace), but the predicate doesn't match the stated recipe and isn't
robust against future tuning of NetEq's max_delay.

Combines both halves with OR so:
  * recent packet → keep appending (covers normal speech gaps + DTX),
  * buffer not empty → keep appending (a burst-then-disconnect peer
    still has buffered audio to drain).

Promotes PeerJitterBuffer::is_empty out of #[allow(dead_code)] now that
it's exercised in the playout path, with a doc comment explaining what
drives the call.

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Two correctness improvements from Sami + Perci's prior-art review of
PR #609. Both <40 LOC, no API changes.

## 1. Bound per-peer rodio Player queue depth

The playout pipeline has two clocks: producer is a tokio 10 ms interval
that drains NetEq and appends to each peer's Player; consumer is the
cpal audio callback that pulls samples at hardware sample rate. NetEq's
rate adaptation (accelerate / expand) only sees its own input pattern,
not the device's actual consumption rate, and rodio::Player wraps an
unbounded MPSC. Scheduler jitter or tiny clock skew between the two
clocks would let the producer outpace the consumer; over a long call
the queue would grow without bound, adding monotonic latency.

Adds PLAYOUT_QUEUE_HIGH_WATER = 4 (4 × 10 ms = 40 ms, well inside
NetEq's max_delay_ms = 200 ms) and calls Player::skip_one() before the
append when the queue is at or above the threshold. Logs the drop.
Worst-case latency is now bounded at 40 ms regardless of clock drift.

The 'right' long-term fix is to drive playout from the audio render
callback rather than a tokio interval, but that's a much larger change
(and effectively part of the AEC v1 follow-up's WebAudio rewrite). This
gets us the correctness property today.

## 2. Move soft-cap check under the admission lock, define error precedence

Old shape: peers.len() >= MAX_PEERS_PER_ROOM check was outside the
guard mutex, then the real gate (alloc) was inside. Two problems:
- Multiple joiners could pass the outside check concurrently before
  any insert landed, so the cap was racey (bounded by the index space
  but still leaked the invariant).
- A joiner who was both over-cap AND requesting the wrong version got
  VersionMismatch instead of Full, leaking the room's pinned protocol
  version to a client that couldn't have joined anyway.

Both fixed by moving the cap check inside the lock and defining error
precedence explicitly: Ended > Full > VersionMismatch. Single in-lock
admission path is now the only source of truth. New test
admit_full_wins_over_version_mismatch fills the room to capacity then
admits a wrong-version peer and asserts the result is Full.

Also fixes the pinned_version doc-comment: the pin clears when the
manager evicts the Room (cleanup_if_empty), not when peers.is_empty().
The actual behavior was correct (test version_pin_persists_across_peer_churn
already covered it); only the comment was misleading.

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
@tlongwell-block tlongwell-block merged commit 6e1418e into main May 18, 2026
14 of 15 checks passed
@tlongwell-block tlongwell-block deleted the dawn/huddle-up-to-10 branch May 18, 2026 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant