Skip to content

refactor(huddle): switch local TTS to Pocket TTS (Kyutai/sherpa-onnx)#607

Merged
tlongwell-block merged 10 commits into
mainfrom
switch-tts-to-pocket
May 18, 2026
Merged

refactor(huddle): switch local TTS to Pocket TTS (Kyutai/sherpa-onnx)#607
tlongwell-block merged 10 commits into
mainfrom
switch-tts-to-pocket

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

What

Swaps the local huddle TTS from Kokoro (loaded via ort) to Pocket TTS (Kyutai Labs, Apache-2.0) running through the same sherpa-onnx static library that already powers STT. This is a Kokoro→Pocket model swap, but the user-visible win is the removal of the ort direct dependency tree and its runtime libonnxruntime.dylib — the dylib that was the source of the gatekeeper-quarantine "damaged app" failures on older Macs.

Why this is the headline change

Before:

  • ort (Rust crate) dlopen'd a prebuilt onnxruntime dylib at runtime.
  • sherpa-onnx-sys linked a second onnxruntime statically into the same binary.
  • Two onnxruntime copies in-process, no shared env/allocator/thread-pool, ~30 MB of duplicate runtime.
  • libonnxruntime.dylib shipped in Sprout.app/Contents/Frameworks/, ad-hoc-signed by the Rust crate's build script, periodically quarantined by Gatekeeper on first launch.

After:

  • One statically-linked onnxruntime (the one sherpa was already using).
  • Zero *.dylib in the macOS bundle. Verified with find … '*onnx*' -o '*.dylib' and otool -L Contents/MacOS/Sprout on a full unsigned tauri build.
  • cargo tree -i ort / -i ort-sys / -i ndarray → "not found."

Tradeoff: CoreML

We lose access to CoreML acceleration. The upstream k2-fsa static archive doesn't ship a CoreML-enabled onnxruntime, and building one ourselves would re-introduce the same dylib distribution + signing pain we're deleting. Bench on M1 8GB:

Metric Pocket TTS (CPU)
Engine load (one-time) ~289 ms
Warm synthesis (75-char sentence) ~680 ms
Throughput ~5× realtime

(No measured Kokoro baseline to compare against — the previous "2.6× realtime / 200ms TTFA" numbers in the codebase turned out to be an unverified doc-comment about batching strategy, not a benchmark. If anyone needs an apples-to-apples comparison, we can re-run Kokoro on the same rig as a follow-up.)

What's in the diff

  • huddle/pocket.rs (new, 238 lines) — sherpa-onnx Pocket TTS wrapper, mirrors huddle/stt.rs config pattern.
  • huddle/kokoro.rs deleted (973 lines).
  • huddle/tts.rs — engine import swap, voice path becomes .wav, new normalize_for_playback(samples) function (per-sentence peak normalization to −6 dBFS with MAX_GAIN=8.0 to cap over-amplification of near-silent buffers; +5 unit tests).
  • huddle/models.rs — Kokoro file list → Pocket bundle (5 ONNX + 2 JSON + voice WAV + LICENSE + MODEL_LICENSE.txt). Readiness fails-closed without the attribution sidecar (+1 test).
  • Cargo.toml / Cargo.lock — drops ort + ndarray direct deps; lockfile shrinks 334 lines.
  • API rename kokorotts across mod.rs, pipeline.rs, lib.rs, and the frontend HuddleBar.tsx. Status surface is now engine-agnostic.
  • desktop/scripts/check-file-sizes.mjs — net −1 override entry: removes the now-deleted kokoro.rs budget (980); bumps models.rs (900→930) and tts.rs (1030→1130) to fit the normalize fn + new tests.

Net diff: 12 files, +700 / −1436 lines (most of it the kokoro.rs deletion).

Voice change

Bundled reference is now Kyutai's American-male sample (the previous voice was Kokoro's af_heart, American-female). Per-sentence peak normalization brings playback level to −6 dBFS regardless of the bundled-voice peak, with MAX_GAIN=8.0 keeping near-silent buffers from being amplified to noise.

Verification

$ cargo fmt --check                                        ✅
$ cargo test --lib                                          ✅ 305/305 (was 299; +5 normalize, +1 sidecar)
$ cargo test --lib huddle::tts                              ✅ 24/24
$ cargo build --release --lib                               ✅ 23s clean
$ just desktop-tauri-check                                  ✅
$ pnpm build                                                ✅ (frontend)
$ just desktop-release-build aarch64-apple-darwin           ✅ (signed-ready Sprout.app + DMG)
$ cargo tree -i ort         / -i ort-sys / -i ndarray       ✅ all "not found"
$ find Sprout.app -iname '*onnx*' -o -iname '*ort*' -o -name '*.dylib'   ✅ empty
$ otool -L Sprout.app/Contents/MacOS/Sprout | rg -i onnx|ort  ✅ empty

How to validate locally

  1. Build + launch the desktop app.
  2. Join a huddle. Watch stderr for TTS warmup completed in {N}ms — expect ≤1500 ms on Apple Silicon with 8 GB+. >1500 ms suggests memory pressure / paging from concurrent STT+TTS init.
  3. Confirm no OfflineRecognizer::create returned None in stderr (STT load path).
  4. Activity Monitor: sprout RSS should settle <800 MB after huddle join. If >800 MB, the spawn order can be serialized (load TTS engine before spawning STT worker) — a one-line follow-up, not a re-architect.

Follow-ups (deliberately out of scope)

  • A measured Kokoro-vs-Pocket bench on the same rig, if anyone wants the comparison defended.
  • An Intel-Mac bench to confirm no regression on x86_64 (the x64 sherpa static archive is supported by sherpa-onnx-sys 1.12.38; should just work).
  • Investigate sherpa-onnx's per-call resampler creation (Creating a resampler: in=16000 out=24000 fires every synth call) — upstream caching opportunity.

Co-investigation

This change was co-investigated with the Goosetown bot crew (Max, Mari, Perci, Quinn, Sami) in the original research/coordination thread. Reviews covered: bundle/signing verification end-to-end (Mari, Sami, Perci), TTS pipeline structural review (Perci, Max), per-sentence loudness normalization design (Quinn, Max), sidecar readiness test (Mari, Max), and PR-claim accuracy (Quinn).

Replace the Kokoro-via-`ort` TTS path with Pocket TTS (Kyutai Labs,
Apache-2.0) running through the same `sherpa-onnx` static library already
used by STT. This removes the entire `ort`/`ort-sys`/`ndarray` direct
dependency tree (and the runtime `libonnxruntime.dylib` it shipped) in
exchange for one statically-linked onnxruntime shared with STT.

Headline wins
-------------
* No more `libonnxruntime.dylib` in the macOS bundle — eliminates the
  ad-hoc-signed dylib + gatekeeper-quarantine class of "damaged app" failures
  on older Macs. Verified by `find … '*onnx*'`, `find … '*.dylib'`, and
  `otool -L Contents/MacOS/Sprout` on a full unsigned `tauri build` artifact:
  zero hits, no dynamic ORT link.
* `cargo tree -i ort` / `-i ort-sys` / `-i ndarray` all return "package ID
  specification did not match any packages." Cargo.lock shrinks by 334 lines
  (those crates + their now-orphaned support deps).
* Net -735 LOC across the codebase (973-line `huddle/kokoro.rs` deleted, 238
  added in `huddle/pocket.rs`).

Tradeoff: CoreML
----------------
Pocket runs CPU-only through sherpa's statically-linked onnxruntime — the
upstream k2-fsa static archive does not bundle the CoreML execution
provider, and building one ourselves would re-introduce the same dylib
distribution problem we are deleting. Bench on M1 8GB shows ~5× realtime
warm synthesis (~680ms for a 75-char sentence) and ~289ms one-time engine
load, which is acceptable for huddle TTS without CoreML.

Voice
-----
Bundled reference is the Kyutai-provided American-male sample. Previous
`af_heart` (Kokoro, American-female) is gone. Per-sentence peak-normalized
to −6 dBFS in `tts.rs::normalize_for_playback` with `MAX_GAIN=8.0` to cap
over-amplification of near-silent buffers.

Model bundle
------------
~289 MB pre-download at first huddle (was ~187 MB for Kokoro): five ONNX
models (text+duration+prosody+depth+acoustic), two JSON tables
(vocab + token_scores), one voice WAV, the upstream LICENSE, and a Sprout
`MODEL_LICENSE.txt` attribution sidecar. All files SHA-256-pinned in
`huddle/models.rs`; readiness now fails-closed without the sidecar
(`tts_readiness_requires_license_sidecar` test).

API rename
----------
`VoiceModelStatus.kokoro` → `.tts`; `ModelManager.kokoro_*` →
`ModelManager.tts_*`; `start_kokoro_download` → `start_tts_download`;
`is_kokoro_ready` → `is_tts_ready`; `modelStatus.kokoro` → `modelStatus.tts`
in the frontend. Status surface is now engine-agnostic, so a future TTS
swap won't need another rename cycle.

Verification
------------
- `cargo fmt --check` clean
- `cargo test --lib` 305/305 (was 299; +5 normalization, +1 sidecar
  readiness)
- `cargo test --lib huddle::tts` 24/24
- `cargo build --release --lib` clean in 23s
- `just desktop-tauri-check` clean
- `pnpm build` clean (frontend)
- Full unsigned `tauri build` (aarch64-apple-darwin) produces signed-ready
  Sprout.app + DMG; bundle inspection above confirms no ORT artifacts ship.

How to validate locally
-----------------------
1. Build + launch the desktop app.
2. Join a huddle. Watch stderr for `TTS warmup completed in {N}ms` —
   expect ≤1500ms on Apple Silicon with 8GB+. >1500ms suggests memory
   pressure / paging from concurrent STT+TTS init.
3. Confirm no `OfflineRecognizer::create returned None` in stderr (STT
   load path).
4. Activity Monitor: sprout RSS should settle <800 MB after huddle join.
   If >800 MB, the spawn order can be serialized (load TTS engine before
   spawning STT worker) as a one-line follow-up.

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
tlongwell-block

This comment was marked as outdated.

Replace the anonymous reference WAV bundled with KevinAHM's Pocket TTS
ONNX export with the 'Mary (f, conversation)' preset from the Kyutai
TTS demo (https://kyutai.org/tts), which maps to vctk/p333_023_enhanced.wav
in kyutai/tts-voices.

Source pin:
- repo: huggingface.co/kyutai/tts-voices
- commit: 323332d33f997de8394f24a193e1a76df720e01a
- path: vctk/p333_023_enhanced.wav
- format: 16-bit mono PCM, 32 kHz, 639,084 bytes
- sha256: a35b0468382218e9f37a9a7494d1e4b74deaf18d7ced22265b4e325bb55c183f
- license: CC-BY-4.0 (VCTK base + ai-coustics enhancement)

The on-disk filename remains 'reference_sample.wav' so engine and bench
code stay voice-agnostic. sherpa-onnx resamples internally via
reference_sample_rate, so the 16 kHz → 32 kHz source change is
transparent to the synthesis pipeline (only the load_voice_style doc
comment in pocket.rs needed updating).

Changes:
- models.rs: new POCKET_REFERENCE_WAV_URL pin, hash swap, TTS_MODEL_VERSION
  1→2 (so existing dev installs re-download cleanly without hitting the
  hash-fail-then-refetch transient), expanded TTS_LICENSE_TEXT block with
  VCTK + ai-coustics attribution per CC-BY-4.0 §3(a)(1).
- pocket.rs: module-doc attribution entry + load_voice_style doc comment
  reflect the new 32 kHz source and Mary's provenance.
- check-file-sizes.mjs: models.rs override 930 → 950 (attribution block
  added ~22 lines).

Verified:
- cargo test --lib (huddle::): 48/48 pass, including
  tts_readiness_requires_license_sidecar.
- cargo test --lib (full): 305/305 pass.
- cargo check --lib: clean.
- check-file-sizes.mjs: clean.
- Live URL fetch: SHA-256 of downloaded file matches TTS_FILE_HASHES entry.

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
@tlongwell-block tlongwell-block marked this pull request as ready for review May 18, 2026 14:07
tlongwell-block and others added 7 commits May 18, 2026 10:25
Pocket TTS' autoregressive LM has a stochastic sampler (temp=0.7, random
seed) and a hard 500-frame ceiling (~40s of audio at 12.5Hz Mimi frame
rate). On very short, unpunctuated, or lowercase inputs, the EOS logit
(threshold > -4) sometimes never fires within those 500 frames, so the
model produces tens of seconds of nonsensical "breathing" output.

Tyler hit this on the first 'yep' utterance after the Mary swap landed:
~30s of monster noise, then subsequent utterances were fine. Non-
deterministic by design — the next sampled trajectory escaped the trap.

Sherpa-onnx's C++ Pocket TTS impl does not run the prompt preparation
that upstream kyutai-labs/pocket-tts applies in Python. This commit
mirrors that recipe locally:

  prepare_pocket_prompt():
    1. Collapse interior whitespace.
    2. Capitalize the first letter.
    3. Append '.' if no terminal punctuation.
    4. If <=4 words: prepend 8 spaces; bump frames_after_eos 1 -> 3.
    5. Compute adaptive max_frames from word count (saturates at 500).

  synth_chunk():
    - Routes input through prepare_pocket_prompt before calling
      generate_with_config.
    - Plumbs frames_after_eos + max_frames via GenerationConfig.extra
      (HashMap<String, serde_json::Value>) — exposed by the sherpa-onnx
      1.12 Rust binding as a per-call escape hatch.

Source: pocket_tts.models.tts_model.prepare_text_prompt and
_estimate_max_gen_len in https://github.com/kyutai-labs/pocket-tts.

12 new unit tests cover: empty input, the literal 'yep' case (must
become '        Yep.'), threshold inclusivity at 4 vs 5 words, preserved
existing punctuation, whitespace collapsing, no-double-capitalize,
non-ASCII first letter (Cyrillic д -> Д), max_frames tightness/clamping/
monotonicity.

Test results:
  cargo test --lib huddle::pocket  -> 12/12 pass
  cargo test --lib                 -> 317/317 pass (was 305 + 12 new)
  cargo fmt --check                -> clean
  cargo clippy huddle::pocket      -> clean (preexisting doc-list
                                            overindent warning at line
                                            41 not introduced here)
  node check-file-sizes.mjs        -> clean (pocket.rs override added)

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
…mpts

The earlier prep-prompt fix (commit 773a2a1) bounded the runaway 'monster
breathing' bug on short inputs, but did so by forcing
`frames_after_eos = 1` on every prompt with ≥5 words. That's *lower*
than the sherpa-onnx upstream default of 3, and it clipped the leading
audio of multi-clause sentences:

  Input:  'Yep, I can hear you. What can I help with?'
  Output: a quick burst of static where 'Yep, I can hear you.' should be.

The sentence splitter emits 'Yep, I can hear you.' (5 words, comma not a
boundary) as the first chunk, which under 773a2a1 hit the long-input
branch and got frames_after_eos=1 — too few trailing LM frames after EOS
fires for the codec to settle, hence the static-burst.

Re-reading offline-tts-pocket-impl.h confirmed two things:

  1. The upstream default for frames_after_eos is **3**, not 1. My
     773a2a1 docs had this backwards.
  2. The 500-frame max_frames default (~40s) is fine for any reasonable
     prompt. We only need a tighter cap on the short-input path where
     the EOS-never-fires runaway bug originally manifested.

This commit:

  - Removes the per-call frames_after_eos override entirely. sherpa-onnx
    keeps its default of 3 for every prompt, including short ones (which
    upstream pocket_tts.py also bumps to 3 — so the default is right).
  - Keeps the max_frames override, but only on short (≤4-word) padded
    inputs, set to a generous 100 frames (~8s) — enough slack to never
    truncate a legitimate short reply.
  - Removes estimate_max_frames() (no callers; the new short cap is a
    fixed constant, not adaptive).
  - Refactors the extra-HashMap builder into build_generation_extra()
    so we can structurally regression-test it.
  - Adds property test build_extra_never_lowers_frames_after_eos_for_any_word_count
    that sweeps a range of prompt lengths and fails CI if anyone ever
    reintroduces a frames_after_eos override below the upstream default.
  - Adds build_extra_long_prompt_is_none which pins down the specific
    'Yep, I can hear you.' regression from this report.

Tests: cargo test --lib huddle::pocket → 12/12 pass.

Refs commit 773a2a1 (initial prep-prompt fix).
Bumps pocket.rs line-size override 560 → 620 (refactored helper + new
tests).

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Symptom (reported 2026-05-18): "the first little sound or two in a
sentence is kind of getting skipped over." Affects every sentence,
not just the first one, and is independent of sentence length.

Root cause: `apply_fades` applied an 8 ms linear fade-in (192 samples
at 24 kHz) to the start of every synthesised sentence. Probing Pocket
TTS output across four prompts (Y/H/W/T onsets — see
`examples/pocket_onset_probe.rs`) shows real audio energy inside the
first millisecond:

  prompt                       | samples[0] | peak@1ms  rms@1ms
  ─────────────────────────────────────────────────────────────
  Yep, I can hear you.         |   0.00193  |  0.0331   0.0235
  Hello there friend.          |   0.00185  |  0.0288   0.0191
  What can I help with?        |   0.00180  |  0.0358   0.0242
  Try this experiment now.     |   0.00189  |  0.0197   0.0139

For 'Yep' and 'What' the first-1ms RMS is *equal to or greater than*
the first-5ms RMS — the consonant attack peaks inside the very window
the fade was nuking. A 0→1 linear ramp attenuated those onset samples
by ≥6 dB over the first 4 ms, which is exactly what Tyler heard as
"swallowed sounds".

`samples[0]` ≈ 0.0019 (≈ −54 dBFS) is far below any audible
DC-jump-click threshold, so removing the fade-in does not introduce
clicks. Fade-out is retained because end-of-sentence cuts *do* create
audible clicks when a non-zero waveform terminates abruptly.

Changes (3 files, +101/−24):

- `huddle/tts.rs`:
  - Rename `apply_fades` → `apply_fade_out`. Body removes the leading
    fade loop and operates on `&mut [f32]` instead of `&mut Vec<f32>`.
  - New const `FIRST_APPEND_LEAD_IN_SAMPLES = 480` (20 ms) and a
    single-shot `player.append(zeros)` at the `first_append` site, so
    the OS audio device / rodio mixer gets a quiet ramp-up window
    *without* scaling any real synthesis samples. Applied once per
    utterance — sentence boundaries continue to use
    `INTER_SENTENCE_SILENCE` (100 ms) and don't stack on this cushion.
  - New regression test
    `apply_fade_out_does_not_touch_leading_samples` locks in
    `samples[0..FADE_OUT_SAMPLES]` are byte-equal to input. Will
    fail loudly if anyone ever reintroduces a leading fade.
  - `first_append_lead_in_is_sane` pins the 20 ms × 24 kHz = 480
    constant and documents why that range is reasonable.
  - Existing `apply_fades_*` tests renamed and updated; +2 net tests
    (24 → 26 in tts.rs; 317 → 319 lib-wide).

- `examples/pocket_onset_probe.rs` (new, 137 lines): synthesises the
  four probe prompts, dumps per-prompt onset stats (samples[0],
  peak/RMS @ 1ms/5ms/20ms), and writes raw WAVs to /tmp for offline
  inspection. Documents the measurement that justifies removing the
  fade-in; runs against the same `/tmp/pocket-tts-bench` model
  directory `pocket_bench` uses.

- `desktop/scripts/check-file-sizes.mjs`: bump `tts.rs` override
  1130 → 1210 with updated description.

Verification before push:

- `cargo test --lib` (full) → 319/319 pass.
- `cargo fmt --check` clean.
- `cargo check` (desktop tauri crate) clean.
- `pnpm check` (biome + file-sizes) clean.
- Manual A/B not yet done from the worktree — Tyler will hear the
  result on `cargo run` after pull.

Discussion: thread root c0f5988e in #sprout-desktop-lighter-tts
(initial diagnosis, Max's review of approach, probe data).

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Follow-up to f570ec0 addressing Max's PR review feedback: the
`FIRST_APPEND_LEAD_IN_SAMPLES` cushion is correctly gated by
`if first_append` in the worker loop, but nothing in the test suite
catches the only-bad-version of that pad — a future refactor moving
the `if first_append` check inside the per-sentence loop would
silently stack 20 ms on top of `INTER_SENTENCE_SILENCE` at every
sentence boundary and audibly slow multi-sentence utterances.

Refactor the append decision into a pure helper and pin the
invariant:

- New `build_sentence_append_plan(first_append, boosted, silence_len)
  -> Vec<Vec<f32>>`: returns [lead_in, audio, inter_silence] on the
  first call (flipping `first_append` to false), or [audio,
  inter_silence] on every subsequent call. The worker loop now calls
  this and iterates the returned buffers, instead of conditionally
  appending inline.

- Three new tests:
  - `lead_in_pad_fires_exactly_once_per_utterance` — pumps 5
    sentences through the plan builder, counts the lead-in buffers,
    asserts exactly 1. The regression test Max specifically asked
    for.
  - `build_sentence_append_plan_flips_first_append` — pins the
    flag-mutation contract.
  - `first_sentence_leading_silence_is_exactly_lead_in` — asserts
    the lead-in is the only leading-silence buffer (no
    inter-sentence silence is emitted before the first audio
    buffer).

The worker-loop call site is now ~7 lines shorter and harder to
break: `was_first` snapshots the flag for the `tts_active.store`
gate, the plan builder owns the rest.

Verification:

- `cargo test --lib` → 322/322 pass (was 319 → +3 new).
- `cargo fmt --check` clean.
- `cargo check` (desktop tauri crate) clean.
- `pnpm check` (biome + file-sizes) clean.

Discussion: thread root c0f5988e in #sprout-desktop-lighter-tts,
specifically Max's messages [11]/[12]/[13] requesting the
once-per-utterance test before merge.

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Pocket TTS' FlowLM has an autoregressive cold-start: the first 2-3
generation steps run without audio context in the KV cache, occasionally
smearing or dropping the first phoneme of short utterances. Tyler
reproduced this on 'I'm happy.' rendering as 'm happy', and on other
'I'm X' constructions across random seeds. The bug is documented
upstream as kyutai-labs/pocket-tts #91 (8 comments, 2 collaborators
acknowledged) and #70, with collateral discussion at sherpa-onnx #3180.

Earlier commits in this branch reduced but did not eliminate the
failure: 773a2a1 added 8-space padding; 1dbfa2c restored sherpa's
`frames_after_eos` default of 3 (fixing a separate static-burst
regression); f570ec0 dropped the leading fade-in. Empirical study at
production settings (silence_scale=0, frames_after_eos=3 default)
confirmed that temperature, silence_scale, seed, and pad tweaks are
all insufficient — the model's stochastic sampling lands on a bad
trajectory often enough to be perceptible on short prompts.

This commit applies the upstream-documented sacrificial-word workaround
(ikidd in kyutai-labs/pocket-tts #70) with two refinements:

  1. Sacrificial prefix '. . ' (two periods + space) instead of a word.
     The pair was empirically the only variant in our probe that
     produced a usable post-sacrificial silence gap on every random
     seed in the 8-seed × 8-variant matrix (`sacrificial_probe`,
     iterated locally during investigation); a single period failed on
     seed=99999. Periods render as low-amplitude breath rather than
     spoken audio.

  2. Post-synth trim: scan from t=30ms looking for the first run of
     samples below 0.02 lasting >= 50 ms — that's the sacrificial→main
     boundary. `Vec::drain` everything before the gap-end. If no gap
     is found or the boundary lies beyond 1.2 s (production max-drop
     bound), bail out and emit the raw buffer rather than corrupt the
     audio. We don't insert a zero lead-in here because tts.rs's
     existing FIRST_APPEND_LEAD_IN_SAMPLES already provides the
     OS-device warm-up cushion on the first append of an utterance,
     and subsequent sentences are buffered by INTER_SENTENCE_SILENCE.

Both the prefix and the trim are gated on PreparedPrompt::is_short
(<= 4 words after preprocessing, matches upstream's
pad_with_spaces_for_short_inputs predicate). Long prompts pass through
unchanged: the first phoneme of a long utterance has enough downstream
context to avoid the smear, and a natural early pause like the comma in
'Hello, how can I help you?' would otherwise be misdetected by the
trimmer as the sacrificial gap (Max caught this in review — thanks).

Also: bump TARGET_PEAK in tts.rs from -6 dBFS (0.501) to -3 dBFS
(0.708) per Tyler. This is a ceiling on per-sentence loudness
normalization, not a floor — quieter Pocket utterances under MAX_GAIN=8
will still land below the ceiling (bench-typical peak 0.076 lands at
0.608, ~-4.3 dBFS). Comment updated to reflect that nuance.

Probe data (see examples/prod_probe.rs; production GenerationConfig:
silence_scale=0.0, frames_after_eos default 3, max_frames=100 short).
Tested 5 prompts × 5 seeds with the new code path:

  Short prompts ('I'm happy', 'I'm sorry', 'I'm ready', 'Yep',
  'I see you') with sacrificial prefix:
    25/25 produced a >=50ms silence gap in the 30-340ms range.
    Trim drops 47-339ms; final audio 270-748ms.

  Long prompts without sacrificial (regression check):
    'Hello, how can I help you today?' and 'Yes, that works. Let me
    try again.' generate normally; comma pauses preserved.

Tyler ear-confirmed the trimmed short-prompt output:
  > these are much better! I like this!

Max reviewed twice — first flagging a silence_scale mismatch between
probe (silence_scale=1.0) and production (0.0), then flagging the
destructive-edge hazard if trim ran on un-sacrificed long prompts.
Both are addressed: prod_probe mirrors production GenerationConfig
exactly (silence_scale=0.0, no frames_after_eos override per 1dbfa2c),
and the trim is gated on is_short with a 1.2s max-drop bound as
belt-and-suspenders against the destructive edge case.

Tests added (in pocket.rs):
  - prepare_prompt_inserts_sacrificial_prefix_only_for_short:
    pins the exact ordering (pad + '. . ' + cleaned).
  - prepare_prompt_threshold_is_inclusive_at_four_words extended to
    assert is_short and SACRIFICIAL_PREFIX absence on long input.
  - trim_strips_sacrificial_and_keeps_only_speech: feed a synthetic
    sacrificial+gap+speech buffer; assert leading sample is speech.
  - trim_is_noop_when_no_long_silence_gap_exists
  - trim_is_noop_when_gap_is_shorter_than_threshold
  - trim_is_noop_when_gap_is_beyond_max_drop_bound: guards the
    destructive-edge case Max flagged.
  - trim_is_noop_on_buffer_smaller_than_scan_start: no panic.
  - trim_constants_use_sane_units: pins millisecond meanings.

Tests added (in tts.rs):
  - normalize_for_playback_clamps_at_max_gain_below_target: new
    behaviour under the -3 dBFS ceiling for bench-typical peaks.
  - normalize_for_playback_hits_target_on_quiet_buffer updated for
    new MAX_GAIN saturation point (0.0885) on the input side.

All 330 cargo test --lib pass. cargo fmt --check and
desktop/scripts/check-file-sizes.mjs are green. pocket.rs cap 620 →
900, tts.rs cap 1335 → 1380.

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Max noted that prod_probe's header advertised _sac variants as 'the new
path' but the WAVs written were actually the raw engine output, not the
post-trim audio that synth_chunk returns to tts.rs. Anyone listening to
those files would have heard the sacrificial breath at the start —
misleading for ear-testing the fix.

Mirror trim_leading_cold_start (and its constants) inline in the probe.
For short prompts with a sacrificial prefix, write both files:

  /tmp/prod_<label>_s<seed>_raw.wav      — raw engine output
  /tmp/prod_<label>_s<seed>_trimmed.wav  — what production actually plays

Long prompts (no sacrificial, no trim) only get the _raw variant since
that's what synth_chunk returns for them in production.

Header rewritten to match. Sample data after the change:

  imhappy_sac_s99999:  raw 472ms (gap 50..144ms) → trim 328ms
  yep_sac_s42:         raw 270ms (gap 30..141ms) → trim 129ms
  imhappy_sac_s314159: raw 730ms (gap 43..339ms) → trim 391ms

(Trim length == raw_len - gap_end_ms, matching expectations.)

The inline trim is a deliberate copy of huddle::pocket — the example
sits in desktop/src-tauri/examples which can't reach into the private
huddle module. Comment at the top of the constants block flags the
'keep in sync' contract.

All 330 cargo test --lib still pass; file-sizes still green.

Non-blocking cleanup from Max's review of 61d064d.

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
@tlongwell-block tlongwell-block merged commit c0eb5af into main May 18, 2026
14 of 15 checks passed
@tlongwell-block tlongwell-block deleted the switch-tts-to-pocket branch May 18, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant