refactor(huddle): switch local TTS to Pocket TTS (Kyutai/sherpa-onnx) by tlongwell-block · Pull Request #607 · block/sprout

tlongwell-block · 2026-05-18T13:33:47Z

What

Swaps the local huddle TTS from Kokoro (loaded via ort) to Pocket TTS (Kyutai Labs, Apache-2.0) running through the same sherpa-onnx static library that already powers STT. This is a Kokoro→Pocket model swap, but the user-visible win is the removal of the ort direct dependency tree and its runtime libonnxruntime.dylib — the dylib that was the source of the gatekeeper-quarantine "damaged app" failures on older Macs.

Why this is the headline change

Before:

ort (Rust crate) dlopen'd a prebuilt onnxruntime dylib at runtime.
sherpa-onnx-sys linked a second onnxruntime statically into the same binary.
Two onnxruntime copies in-process, no shared env/allocator/thread-pool, ~30 MB of duplicate runtime.
libonnxruntime.dylib shipped in Sprout.app/Contents/Frameworks/, ad-hoc-signed by the Rust crate's build script, periodically quarantined by Gatekeeper on first launch.

After:

One statically-linked onnxruntime (the one sherpa was already using).
Zero *.dylib in the macOS bundle. Verified with find … '*onnx*' -o '*.dylib' and otool -L Contents/MacOS/Sprout on a full unsigned tauri build.
cargo tree -i ort / -i ort-sys / -i ndarray → "not found."

Tradeoff: CoreML

We lose access to CoreML acceleration. The upstream k2-fsa static archive doesn't ship a CoreML-enabled onnxruntime, and building one ourselves would re-introduce the same dylib distribution + signing pain we're deleting. Bench on M1 8GB:

Metric	Pocket TTS (CPU)
Engine load (one-time)	~289 ms
Warm synthesis (75-char sentence)	~680 ms
Throughput	~5× realtime

(No measured Kokoro baseline to compare against — the previous "2.6× realtime / 200ms TTFA" numbers in the codebase turned out to be an unverified doc-comment about batching strategy, not a benchmark. If anyone needs an apples-to-apples comparison, we can re-run Kokoro on the same rig as a follow-up.)

What's in the diff

huddle/pocket.rs (new, 238 lines) — sherpa-onnx Pocket TTS wrapper, mirrors huddle/stt.rs config pattern.
huddle/kokoro.rs deleted (973 lines).
huddle/tts.rs — engine import swap, voice path becomes .wav, new normalize_for_playback(samples) function (per-sentence peak normalization to −6 dBFS with MAX_GAIN=8.0 to cap over-amplification of near-silent buffers; +5 unit tests).
huddle/models.rs — Kokoro file list → Pocket bundle (5 ONNX + 2 JSON + voice WAV + LICENSE + MODEL_LICENSE.txt). Readiness fails-closed without the attribution sidecar (+1 test).
Cargo.toml / Cargo.lock — drops ort + ndarray direct deps; lockfile shrinks 334 lines.
API rename kokoro → tts across mod.rs, pipeline.rs, lib.rs, and the frontend HuddleBar.tsx. Status surface is now engine-agnostic.
desktop/scripts/check-file-sizes.mjs — net −1 override entry: removes the now-deleted kokoro.rs budget (980); bumps models.rs (900→930) and tts.rs (1030→1130) to fit the normalize fn + new tests.

Net diff: 12 files, +700 / −1436 lines (most of it the kokoro.rs deletion).

Voice change

Bundled reference is now Kyutai's American-male sample (the previous voice was Kokoro's af_heart, American-female). Per-sentence peak normalization brings playback level to −6 dBFS regardless of the bundled-voice peak, with MAX_GAIN=8.0 keeping near-silent buffers from being amplified to noise.

Verification

$ cargo fmt --check                                        ✅
$ cargo test --lib                                          ✅ 305/305 (was 299; +5 normalize, +1 sidecar)
$ cargo test --lib huddle::tts                              ✅ 24/24
$ cargo build --release --lib                               ✅ 23s clean
$ just desktop-tauri-check                                  ✅
$ pnpm build                                                ✅ (frontend)
$ just desktop-release-build aarch64-apple-darwin           ✅ (signed-ready Sprout.app + DMG)
$ cargo tree -i ort         / -i ort-sys / -i ndarray       ✅ all "not found"
$ find Sprout.app -iname '*onnx*' -o -iname '*ort*' -o -name '*.dylib'   ✅ empty
$ otool -L Sprout.app/Contents/MacOS/Sprout | rg -i onnx|ort  ✅ empty

How to validate locally

Build + launch the desktop app.
Join a huddle. Watch stderr for TTS warmup completed in {N}ms — expect ≤1500 ms on Apple Silicon with 8 GB+. >1500 ms suggests memory pressure / paging from concurrent STT+TTS init.
Confirm no OfflineRecognizer::create returned None in stderr (STT load path).
Activity Monitor: sprout RSS should settle <800 MB after huddle join. If >800 MB, the spawn order can be serialized (load TTS engine before spawning STT worker) — a one-line follow-up, not a re-architect.

Follow-ups (deliberately out of scope)

A measured Kokoro-vs-Pocket bench on the same rig, if anyone wants the comparison defended.
An Intel-Mac bench to confirm no regression on x86_64 (the x64 sherpa static archive is supported by sherpa-onnx-sys 1.12.38; should just work).
Investigate sherpa-onnx's per-call resampler creation (Creating a resampler: in=16000 out=24000 fires every synth call) — upstream caching opportunity.

Co-investigation

This change was co-investigated with the Goosetown bot crew (Max, Mari, Perci, Quinn, Sami) in the original research/coordination thread. Reviews covered: bundle/signing verification end-to-end (Mari, Sami, Perci), TTS pipeline structural review (Perci, Max), per-sentence loudness normalization design (Quinn, Max), sidecar readiness test (Mari, Max), and PR-claim accuracy (Quinn).

Replace the Kokoro-via-`ort` TTS path with Pocket TTS (Kyutai Labs, Apache-2.0) running through the same `sherpa-onnx` static library already used by STT. This removes the entire `ort`/`ort-sys`/`ndarray` direct dependency tree (and the runtime `libonnxruntime.dylib` it shipped) in exchange for one statically-linked onnxruntime shared with STT. Headline wins ------------- * No more `libonnxruntime.dylib` in the macOS bundle — eliminates the ad-hoc-signed dylib + gatekeeper-quarantine class of "damaged app" failures on older Macs. Verified by `find … '*onnx*'`, `find … '*.dylib'`, and `otool -L Contents/MacOS/Sprout` on a full unsigned `tauri build` artifact: zero hits, no dynamic ORT link. * `cargo tree -i ort` / `-i ort-sys` / `-i ndarray` all return "package ID specification did not match any packages." Cargo.lock shrinks by 334 lines (those crates + their now-orphaned support deps). * Net -735 LOC across the codebase (973-line `huddle/kokoro.rs` deleted, 238 added in `huddle/pocket.rs`). Tradeoff: CoreML ---------------- Pocket runs CPU-only through sherpa's statically-linked onnxruntime — the upstream k2-fsa static archive does not bundle the CoreML execution provider, and building one ourselves would re-introduce the same dylib distribution problem we are deleting. Bench on M1 8GB shows ~5× realtime warm synthesis (~680ms for a 75-char sentence) and ~289ms one-time engine load, which is acceptable for huddle TTS without CoreML. Voice ----- Bundled reference is the Kyutai-provided American-male sample. Previous `af_heart` (Kokoro, American-female) is gone. Per-sentence peak-normalized to −6 dBFS in `tts.rs::normalize_for_playback` with `MAX_GAIN=8.0` to cap over-amplification of near-silent buffers. Model bundle ------------ ~289 MB pre-download at first huddle (was ~187 MB for Kokoro): five ONNX models (text+duration+prosody+depth+acoustic), two JSON tables (vocab + token_scores), one voice WAV, the upstream LICENSE, and a Sprout `MODEL_LICENSE.txt` attribution sidecar. All files SHA-256-pinned in `huddle/models.rs`; readiness now fails-closed without the sidecar (`tts_readiness_requires_license_sidecar` test). API rename ---------- `VoiceModelStatus.kokoro` → `.tts`; `ModelManager.kokoro_*` → `ModelManager.tts_*`; `start_kokoro_download` → `start_tts_download`; `is_kokoro_ready` → `is_tts_ready`; `modelStatus.kokoro` → `modelStatus.tts` in the frontend. Status surface is now engine-agnostic, so a future TTS swap won't need another rename cycle. Verification ------------ - `cargo fmt --check` clean - `cargo test --lib` 305/305 (was 299; +5 normalization, +1 sidecar readiness) - `cargo test --lib huddle::tts` 24/24 - `cargo build --release --lib` clean in 23s - `just desktop-tauri-check` clean - `pnpm build` clean (frontend) - Full unsigned `tauri build` (aarch64-apple-darwin) produces signed-ready Sprout.app + DMG; bundle inspection above confirms no ORT artifacts ship. How to validate locally ----------------------- 1. Build + launch the desktop app. 2. Join a huddle. Watch stderr for `TTS warmup completed in {N}ms` — expect ≤1500ms on Apple Silicon with 8GB+. >1500ms suggests memory pressure / paging from concurrent STT+TTS init. 3. Confirm no `OfflineRecognizer::create returned None` in stderr (STT load path). 4. Activity Monitor: sprout RSS should settle <800 MB after huddle join. If >800 MB, the spawn order can be serialized (load TTS engine before spawning STT worker) as a one-line follow-up. Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

Replace the anonymous reference WAV bundled with KevinAHM's Pocket TTS ONNX export with the 'Mary (f, conversation)' preset from the Kyutai TTS demo (https://kyutai.org/tts), which maps to vctk/p333_023_enhanced.wav in kyutai/tts-voices. Source pin: - repo: huggingface.co/kyutai/tts-voices - commit: 323332d33f997de8394f24a193e1a76df720e01a - path: vctk/p333_023_enhanced.wav - format: 16-bit mono PCM, 32 kHz, 639,084 bytes - sha256: a35b0468382218e9f37a9a7494d1e4b74deaf18d7ced22265b4e325bb55c183f - license: CC-BY-4.0 (VCTK base + ai-coustics enhancement) The on-disk filename remains 'reference_sample.wav' so engine and bench code stay voice-agnostic. sherpa-onnx resamples internally via reference_sample_rate, so the 16 kHz → 32 kHz source change is transparent to the synthesis pipeline (only the load_voice_style doc comment in pocket.rs needed updating). Changes: - models.rs: new POCKET_REFERENCE_WAV_URL pin, hash swap, TTS_MODEL_VERSION 1→2 (so existing dev installs re-download cleanly without hitting the hash-fail-then-refetch transient), expanded TTS_LICENSE_TEXT block with VCTK + ai-coustics attribution per CC-BY-4.0 §3(a)(1). - pocket.rs: module-doc attribution entry + load_voice_style doc comment reflect the new 32 kHz source and Mary's provenance. - check-file-sizes.mjs: models.rs override 930 → 950 (attribution block added ~22 lines). Verified: - cargo test --lib (huddle::): 48/48 pass, including tts_readiness_requires_license_sidecar. - cargo test --lib (full): 305/305 pass. - cargo check --lib: clean. - check-file-sizes.mjs: clean. - Live URL fetch: SHA-256 of downloaded file matches TTS_FILE_HASHES entry. Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

Pocket TTS' autoregressive LM has a stochastic sampler (temp=0.7, random seed) and a hard 500-frame ceiling (~40s of audio at 12.5Hz Mimi frame rate). On very short, unpunctuated, or lowercase inputs, the EOS logit (threshold > -4) sometimes never fires within those 500 frames, so the model produces tens of seconds of nonsensical "breathing" output. Tyler hit this on the first 'yep' utterance after the Mary swap landed: ~30s of monster noise, then subsequent utterances were fine. Non- deterministic by design — the next sampled trajectory escaped the trap. Sherpa-onnx's C++ Pocket TTS impl does not run the prompt preparation that upstream kyutai-labs/pocket-tts applies in Python. This commit mirrors that recipe locally: prepare_pocket_prompt(): 1. Collapse interior whitespace. 2. Capitalize the first letter. 3. Append '.' if no terminal punctuation. 4. If <=4 words: prepend 8 spaces; bump frames_after_eos 1 -> 3. 5. Compute adaptive max_frames from word count (saturates at 500). synth_chunk(): - Routes input through prepare_pocket_prompt before calling generate_with_config. - Plumbs frames_after_eos + max_frames via GenerationConfig.extra (HashMap<String, serde_json::Value>) — exposed by the sherpa-onnx 1.12 Rust binding as a per-call escape hatch. Source: pocket_tts.models.tts_model.prepare_text_prompt and _estimate_max_gen_len in https://github.com/kyutai-labs/pocket-tts. 12 new unit tests cover: empty input, the literal 'yep' case (must become ' Yep.'), threshold inclusivity at 4 vs 5 words, preserved existing punctuation, whitespace collapsing, no-double-capitalize, non-ASCII first letter (Cyrillic д -> Д), max_frames tightness/clamping/ monotonicity. Test results: cargo test --lib huddle::pocket -> 12/12 pass cargo test --lib -> 317/317 pass (was 305 + 12 new) cargo fmt --check -> clean cargo clippy huddle::pocket -> clean (preexisting doc-list overindent warning at line 41 not introduced here) node check-file-sizes.mjs -> clean (pocket.rs override added) Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

…mpts The earlier prep-prompt fix (commit 773a2a1) bounded the runaway 'monster breathing' bug on short inputs, but did so by forcing `frames_after_eos = 1` on every prompt with ≥5 words. That's *lower* than the sherpa-onnx upstream default of 3, and it clipped the leading audio of multi-clause sentences: Input: 'Yep, I can hear you. What can I help with?' Output: a quick burst of static where 'Yep, I can hear you.' should be. The sentence splitter emits 'Yep, I can hear you.' (5 words, comma not a boundary) as the first chunk, which under 773a2a1 hit the long-input branch and got frames_after_eos=1 — too few trailing LM frames after EOS fires for the codec to settle, hence the static-burst. Re-reading offline-tts-pocket-impl.h confirmed two things: 1. The upstream default for frames_after_eos is **3**, not 1. My 773a2a1 docs had this backwards. 2. The 500-frame max_frames default (~40s) is fine for any reasonable prompt. We only need a tighter cap on the short-input path where the EOS-never-fires runaway bug originally manifested. This commit: - Removes the per-call frames_after_eos override entirely. sherpa-onnx keeps its default of 3 for every prompt, including short ones (which upstream pocket_tts.py also bumps to 3 — so the default is right). - Keeps the max_frames override, but only on short (≤4-word) padded inputs, set to a generous 100 frames (~8s) — enough slack to never truncate a legitimate short reply. - Removes estimate_max_frames() (no callers; the new short cap is a fixed constant, not adaptive). - Refactors the extra-HashMap builder into build_generation_extra() so we can structurally regression-test it. - Adds property test build_extra_never_lowers_frames_after_eos_for_any_word_count that sweeps a range of prompt lengths and fails CI if anyone ever reintroduces a frames_after_eos override below the upstream default. - Adds build_extra_long_prompt_is_none which pins down the specific 'Yep, I can hear you.' regression from this report. Tests: cargo test --lib huddle::pocket → 12/12 pass. Refs commit 773a2a1 (initial prep-prompt fix). Bumps pocket.rs line-size override 560 → 620 (refactored helper + new tests). Signed-off-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>

Symptom (reported 2026-05-18): "the first little sound or two in a sentence is kind of getting skipped over." Affects every sentence, not just the first one, and is independent of sentence length. Root cause: `apply_fades` applied an 8 ms linear fade-in (192 samples at 24 kHz) to the start of every synthesised sentence. Probing Pocket TTS output across four prompts (Y/H/W/T onsets — see `examples/pocket_onset_probe.rs`) shows real audio energy inside the first millisecond: prompt | samples[0] | peak@1ms rms@1ms ───────────────────────────────────────────────────────────── Yep, I can hear you. | 0.00193 | 0.0331 0.0235 Hello there friend. | 0.00185 | 0.0288 0.0191 What can I help with? | 0.00180 | 0.0358 0.0242 Try this experiment now. | 0.00189 | 0.0197 0.0139 For 'Yep' and 'What' the first-1ms RMS is *equal to or greater than* the first-5ms RMS — the consonant attack peaks inside the very window the fade was nuking. A 0→1 linear ramp attenuated those onset samples by ≥6 dB over the first 4 ms, which is exactly what Tyler heard as "swallowed sounds". `samples[0]` ≈ 0.0019 (≈ −54 dBFS) is far below any audible DC-jump-click threshold, so removing the fade-in does not introduce clicks. Fade-out is retained because end-of-sentence cuts *do* create audible clicks when a non-zero waveform terminates abruptly. Changes (3 files, +101/−24): - `huddle/tts.rs`: - Rename `apply_fades` → `apply_fade_out`. Body removes the leading fade loop and operates on `&mut [f32]` instead of `&mut Vec<f32>`. - New const `FIRST_APPEND_LEAD_IN_SAMPLES = 480` (20 ms) and a single-shot `player.append(zeros)` at the `first_append` site, so the OS audio device / rodio mixer gets a quiet ramp-up window *without* scaling any real synthesis samples. Applied once per utterance — sentence boundaries continue to use `INTER_SENTENCE_SILENCE` (100 ms) and don't stack on this cushion. - New regression test `apply_fade_out_does_not_touch_leading_samples` locks in `samples[0..FADE_OUT_SAMPLES]` are byte-equal to input. Will fail loudly if anyone ever reintroduces a leading fade. - `first_append_lead_in_is_sane` pins the 20 ms × 24 kHz = 480 constant and documents why that range is reasonable. - Existing `apply_fades_*` tests renamed and updated; +2 net tests (24 → 26 in tts.rs; 317 → 319 lib-wide). - `examples/pocket_onset_probe.rs` (new, 137 lines): synthesises the four probe prompts, dumps per-prompt onset stats (samples[0], peak/RMS @ 1ms/5ms/20ms), and writes raw WAVs to /tmp for offline inspection. Documents the measurement that justifies removing the fade-in; runs against the same `/tmp/pocket-tts-bench` model directory `pocket_bench` uses. - `desktop/scripts/check-file-sizes.mjs`: bump `tts.rs` override 1130 → 1210 with updated description. Verification before push: - `cargo test --lib` (full) → 319/319 pass. - `cargo fmt --check` clean. - `cargo check` (desktop tauri crate) clean. - `pnpm check` (biome + file-sizes) clean. - Manual A/B not yet done from the worktree — Tyler will hear the result on `cargo run` after pull. Discussion: thread root c0f5988e in #sprout-desktop-lighter-tts (initial diagnosis, Max's review of approach, probe data). Signed-off-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>

Follow-up to f570ec0 addressing Max's PR review feedback: the `FIRST_APPEND_LEAD_IN_SAMPLES` cushion is correctly gated by `if first_append` in the worker loop, but nothing in the test suite catches the only-bad-version of that pad — a future refactor moving the `if first_append` check inside the per-sentence loop would silently stack 20 ms on top of `INTER_SENTENCE_SILENCE` at every sentence boundary and audibly slow multi-sentence utterances. Refactor the append decision into a pure helper and pin the invariant: - New `build_sentence_append_plan(first_append, boosted, silence_len) -> Vec<Vec<f32>>`: returns [lead_in, audio, inter_silence] on the first call (flipping `first_append` to false), or [audio, inter_silence] on every subsequent call. The worker loop now calls this and iterates the returned buffers, instead of conditionally appending inline. - Three new tests: - `lead_in_pad_fires_exactly_once_per_utterance` — pumps 5 sentences through the plan builder, counts the lead-in buffers, asserts exactly 1. The regression test Max specifically asked for. - `build_sentence_append_plan_flips_first_append` — pins the flag-mutation contract. - `first_sentence_leading_silence_is_exactly_lead_in` — asserts the lead-in is the only leading-silence buffer (no inter-sentence silence is emitted before the first audio buffer). The worker-loop call site is now ~7 lines shorter and harder to break: `was_first` snapshots the flag for the `tts_active.store` gate, the plan builder owns the rest. Verification: - `cargo test --lib` → 322/322 pass (was 319 → +3 new). - `cargo fmt --check` clean. - `cargo check` (desktop tauri crate) clean. - `pnpm check` (biome + file-sizes) clean. Discussion: thread root c0f5988e in #sprout-desktop-lighter-tts, specifically Max's messages [11]/[12]/[13] requesting the once-per-utterance test before merge. Signed-off-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>

Pocket TTS' FlowLM has an autoregressive cold-start: the first 2-3 generation steps run without audio context in the KV cache, occasionally smearing or dropping the first phoneme of short utterances. Tyler reproduced this on 'I'm happy.' rendering as 'm happy', and on other 'I'm X' constructions across random seeds. The bug is documented upstream as kyutai-labs/pocket-tts #91 (8 comments, 2 collaborators acknowledged) and #70, with collateral discussion at sherpa-onnx #3180. Earlier commits in this branch reduced but did not eliminate the failure: 773a2a1 added 8-space padding; 1dbfa2c restored sherpa's `frames_after_eos` default of 3 (fixing a separate static-burst regression); f570ec0 dropped the leading fade-in. Empirical study at production settings (silence_scale=0, frames_after_eos=3 default) confirmed that temperature, silence_scale, seed, and pad tweaks are all insufficient — the model's stochastic sampling lands on a bad trajectory often enough to be perceptible on short prompts. This commit applies the upstream-documented sacrificial-word workaround (ikidd in kyutai-labs/pocket-tts #70) with two refinements: 1. Sacrificial prefix '. . ' (two periods + space) instead of a word. The pair was empirically the only variant in our probe that produced a usable post-sacrificial silence gap on every random seed in the 8-seed × 8-variant matrix (`sacrificial_probe`, iterated locally during investigation); a single period failed on seed=99999. Periods render as low-amplitude breath rather than spoken audio. 2. Post-synth trim: scan from t=30ms looking for the first run of samples below 0.02 lasting >= 50 ms — that's the sacrificial→main boundary. `Vec::drain` everything before the gap-end. If no gap is found or the boundary lies beyond 1.2 s (production max-drop bound), bail out and emit the raw buffer rather than corrupt the audio. We don't insert a zero lead-in here because tts.rs's existing FIRST_APPEND_LEAD_IN_SAMPLES already provides the OS-device warm-up cushion on the first append of an utterance, and subsequent sentences are buffered by INTER_SENTENCE_SILENCE. Both the prefix and the trim are gated on PreparedPrompt::is_short (<= 4 words after preprocessing, matches upstream's pad_with_spaces_for_short_inputs predicate). Long prompts pass through unchanged: the first phoneme of a long utterance has enough downstream context to avoid the smear, and a natural early pause like the comma in 'Hello, how can I help you?' would otherwise be misdetected by the trimmer as the sacrificial gap (Max caught this in review — thanks). Also: bump TARGET_PEAK in tts.rs from -6 dBFS (0.501) to -3 dBFS (0.708) per Tyler. This is a ceiling on per-sentence loudness normalization, not a floor — quieter Pocket utterances under MAX_GAIN=8 will still land below the ceiling (bench-typical peak 0.076 lands at 0.608, ~-4.3 dBFS). Comment updated to reflect that nuance. Probe data (see examples/prod_probe.rs; production GenerationConfig: silence_scale=0.0, frames_after_eos default 3, max_frames=100 short). Tested 5 prompts × 5 seeds with the new code path: Short prompts ('I'm happy', 'I'm sorry', 'I'm ready', 'Yep', 'I see you') with sacrificial prefix: 25/25 produced a >=50ms silence gap in the 30-340ms range. Trim drops 47-339ms; final audio 270-748ms. Long prompts without sacrificial (regression check): 'Hello, how can I help you today?' and 'Yes, that works. Let me try again.' generate normally; comma pauses preserved. Tyler ear-confirmed the trimmed short-prompt output: > these are much better! I like this! Max reviewed twice — first flagging a silence_scale mismatch between probe (silence_scale=1.0) and production (0.0), then flagging the destructive-edge hazard if trim ran on un-sacrificed long prompts. Both are addressed: prod_probe mirrors production GenerationConfig exactly (silence_scale=0.0, no frames_after_eos override per 1dbfa2c), and the trim is gated on is_short with a 1.2s max-drop bound as belt-and-suspenders against the destructive edge case. Tests added (in pocket.rs): - prepare_prompt_inserts_sacrificial_prefix_only_for_short: pins the exact ordering (pad + '. . ' + cleaned). - prepare_prompt_threshold_is_inclusive_at_four_words extended to assert is_short and SACRIFICIAL_PREFIX absence on long input. - trim_strips_sacrificial_and_keeps_only_speech: feed a synthetic sacrificial+gap+speech buffer; assert leading sample is speech. - trim_is_noop_when_no_long_silence_gap_exists - trim_is_noop_when_gap_is_shorter_than_threshold - trim_is_noop_when_gap_is_beyond_max_drop_bound: guards the destructive-edge case Max flagged. - trim_is_noop_on_buffer_smaller_than_scan_start: no panic. - trim_constants_use_sane_units: pins millisecond meanings. Tests added (in tts.rs): - normalize_for_playback_clamps_at_max_gain_below_target: new behaviour under the -3 dBFS ceiling for bench-typical peaks. - normalize_for_playback_hits_target_on_quiet_buffer updated for new MAX_GAIN saturation point (0.0885) on the input side. All 330 cargo test --lib pass. cargo fmt --check and desktop/scripts/check-file-sizes.mjs are green. pocket.rs cap 620 → 900, tts.rs cap 1335 → 1380. Signed-off-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>

Max noted that prod_probe's header advertised _sac variants as 'the new path' but the WAVs written were actually the raw engine output, not the post-trim audio that synth_chunk returns to tts.rs. Anyone listening to those files would have heard the sacrificial breath at the start — misleading for ear-testing the fix. Mirror trim_leading_cold_start (and its constants) inline in the probe. For short prompts with a sacrificial prefix, write both files: /tmp/prod_<label>_s<seed>_raw.wav — raw engine output /tmp/prod_<label>_s<seed>_trimmed.wav — what production actually plays Long prompts (no sacrificial, no trim) only get the _raw variant since that's what synth_chunk returns for them in production. Header rewritten to match. Sample data after the change: imhappy_sac_s99999: raw 472ms (gap 50..144ms) → trim 328ms yep_sac_s42: raw 270ms (gap 30..141ms) → trim 129ms imhappy_sac_s314159: raw 730ms (gap 43..339ms) → trim 391ms (Trim length == raw_len - gap_end_ms, matching expectations.) The inline trim is a deliberate copy of huddle::pocket — the example sits in desktop/src-tauri/examples which can't reach into the private huddle module. Comment at the top of the constants block flags the 'keep in sync' contract. All 330 cargo test --lib still pass; file-sizes still green. Non-blocking cleanup from Max's review of 61d064d. Signed-off-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

This comment was marked as outdated.

Sign in to view

tlongwell-block marked this pull request as ready for review May 18, 2026 14:07

tlongwell-block requested a review from wesbillman as a code owner May 18, 2026 14:07

tlongwell-block and others added 7 commits May 18, 2026 10:25

huddle(tts): cushion every sentence onset

f077571

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

tlongwell-block force-pushed the switch-tts-to-pocket branch from 341c47b to f077571 Compare May 18, 2026 17:53

huddle(tts): keep sentence cushion in one source

f00421c

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

tlongwell-block merged commit c0eb5af into main May 18, 2026
14 of 15 checks passed

tlongwell-block deleted the switch-tts-to-pocket branch May 18, 2026 18:28

This was referenced May 28, 2026

chore(release): release version 0.3.1 #769

Merged

chore(release): release version 0.3.2 #774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(huddle): switch local TTS to Pocket TTS (Kyutai/sherpa-onnx)#607

refactor(huddle): switch local TTS to Pocket TTS (Kyutai/sherpa-onnx)#607
tlongwell-block merged 10 commits into
mainfrom
switch-tts-to-pocket

tlongwell-block commented May 18, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tlongwell-block commented May 18, 2026

What

Why this is the headline change

Tradeoff: CoreML

What's in the diff

Voice change

Verification

How to validate locally

Follow-ups (deliberately out of scope)

Co-investigation

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant