refactor(huddle): switch local TTS to Pocket TTS (Kyutai/sherpa-onnx)#607
Merged
Conversation
Replace the Kokoro-via-`ort` TTS path with Pocket TTS (Kyutai Labs,
Apache-2.0) running through the same `sherpa-onnx` static library already
used by STT. This removes the entire `ort`/`ort-sys`/`ndarray` direct
dependency tree (and the runtime `libonnxruntime.dylib` it shipped) in
exchange for one statically-linked onnxruntime shared with STT.
Headline wins
-------------
* No more `libonnxruntime.dylib` in the macOS bundle — eliminates the
ad-hoc-signed dylib + gatekeeper-quarantine class of "damaged app" failures
on older Macs. Verified by `find … '*onnx*'`, `find … '*.dylib'`, and
`otool -L Contents/MacOS/Sprout` on a full unsigned `tauri build` artifact:
zero hits, no dynamic ORT link.
* `cargo tree -i ort` / `-i ort-sys` / `-i ndarray` all return "package ID
specification did not match any packages." Cargo.lock shrinks by 334 lines
(those crates + their now-orphaned support deps).
* Net -735 LOC across the codebase (973-line `huddle/kokoro.rs` deleted, 238
added in `huddle/pocket.rs`).
Tradeoff: CoreML
----------------
Pocket runs CPU-only through sherpa's statically-linked onnxruntime — the
upstream k2-fsa static archive does not bundle the CoreML execution
provider, and building one ourselves would re-introduce the same dylib
distribution problem we are deleting. Bench on M1 8GB shows ~5× realtime
warm synthesis (~680ms for a 75-char sentence) and ~289ms one-time engine
load, which is acceptable for huddle TTS without CoreML.
Voice
-----
Bundled reference is the Kyutai-provided American-male sample. Previous
`af_heart` (Kokoro, American-female) is gone. Per-sentence peak-normalized
to −6 dBFS in `tts.rs::normalize_for_playback` with `MAX_GAIN=8.0` to cap
over-amplification of near-silent buffers.
Model bundle
------------
~289 MB pre-download at first huddle (was ~187 MB for Kokoro): five ONNX
models (text+duration+prosody+depth+acoustic), two JSON tables
(vocab + token_scores), one voice WAV, the upstream LICENSE, and a Sprout
`MODEL_LICENSE.txt` attribution sidecar. All files SHA-256-pinned in
`huddle/models.rs`; readiness now fails-closed without the sidecar
(`tts_readiness_requires_license_sidecar` test).
API rename
----------
`VoiceModelStatus.kokoro` → `.tts`; `ModelManager.kokoro_*` →
`ModelManager.tts_*`; `start_kokoro_download` → `start_tts_download`;
`is_kokoro_ready` → `is_tts_ready`; `modelStatus.kokoro` → `modelStatus.tts`
in the frontend. Status surface is now engine-agnostic, so a future TTS
swap won't need another rename cycle.
Verification
------------
- `cargo fmt --check` clean
- `cargo test --lib` 305/305 (was 299; +5 normalization, +1 sidecar
readiness)
- `cargo test --lib huddle::tts` 24/24
- `cargo build --release --lib` clean in 23s
- `just desktop-tauri-check` clean
- `pnpm build` clean (frontend)
- Full unsigned `tauri build` (aarch64-apple-darwin) produces signed-ready
Sprout.app + DMG; bundle inspection above confirms no ORT artifacts ship.
How to validate locally
-----------------------
1. Build + launch the desktop app.
2. Join a huddle. Watch stderr for `TTS warmup completed in {N}ms` —
expect ≤1500ms on Apple Silicon with 8GB+. >1500ms suggests memory
pressure / paging from concurrent STT+TTS init.
3. Confirm no `OfflineRecognizer::create returned None` in stderr (STT
load path).
4. Activity Monitor: sprout RSS should settle <800 MB after huddle join.
If >800 MB, the spawn order can be serialized (load TTS engine before
spawning STT worker) as a one-line follow-up.
Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Replace the anonymous reference WAV bundled with KevinAHM's Pocket TTS ONNX export with the 'Mary (f, conversation)' preset from the Kyutai TTS demo (https://kyutai.org/tts), which maps to vctk/p333_023_enhanced.wav in kyutai/tts-voices. Source pin: - repo: huggingface.co/kyutai/tts-voices - commit: 323332d33f997de8394f24a193e1a76df720e01a - path: vctk/p333_023_enhanced.wav - format: 16-bit mono PCM, 32 kHz, 639,084 bytes - sha256: a35b0468382218e9f37a9a7494d1e4b74deaf18d7ced22265b4e325bb55c183f - license: CC-BY-4.0 (VCTK base + ai-coustics enhancement) The on-disk filename remains 'reference_sample.wav' so engine and bench code stay voice-agnostic. sherpa-onnx resamples internally via reference_sample_rate, so the 16 kHz → 32 kHz source change is transparent to the synthesis pipeline (only the load_voice_style doc comment in pocket.rs needed updating). Changes: - models.rs: new POCKET_REFERENCE_WAV_URL pin, hash swap, TTS_MODEL_VERSION 1→2 (so existing dev installs re-download cleanly without hitting the hash-fail-then-refetch transient), expanded TTS_LICENSE_TEXT block with VCTK + ai-coustics attribution per CC-BY-4.0 §3(a)(1). - pocket.rs: module-doc attribution entry + load_voice_style doc comment reflect the new 32 kHz source and Mary's provenance. - check-file-sizes.mjs: models.rs override 930 → 950 (attribution block added ~22 lines). Verified: - cargo test --lib (huddle::): 48/48 pass, including tts_readiness_requires_license_sidecar. - cargo test --lib (full): 305/305 pass. - cargo check --lib: clean. - check-file-sizes.mjs: clean. - Live URL fetch: SHA-256 of downloaded file matches TTS_FILE_HASHES entry. Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Pocket TTS' autoregressive LM has a stochastic sampler (temp=0.7, random
seed) and a hard 500-frame ceiling (~40s of audio at 12.5Hz Mimi frame
rate). On very short, unpunctuated, or lowercase inputs, the EOS logit
(threshold > -4) sometimes never fires within those 500 frames, so the
model produces tens of seconds of nonsensical "breathing" output.
Tyler hit this on the first 'yep' utterance after the Mary swap landed:
~30s of monster noise, then subsequent utterances were fine. Non-
deterministic by design — the next sampled trajectory escaped the trap.
Sherpa-onnx's C++ Pocket TTS impl does not run the prompt preparation
that upstream kyutai-labs/pocket-tts applies in Python. This commit
mirrors that recipe locally:
prepare_pocket_prompt():
1. Collapse interior whitespace.
2. Capitalize the first letter.
3. Append '.' if no terminal punctuation.
4. If <=4 words: prepend 8 spaces; bump frames_after_eos 1 -> 3.
5. Compute adaptive max_frames from word count (saturates at 500).
synth_chunk():
- Routes input through prepare_pocket_prompt before calling
generate_with_config.
- Plumbs frames_after_eos + max_frames via GenerationConfig.extra
(HashMap<String, serde_json::Value>) — exposed by the sherpa-onnx
1.12 Rust binding as a per-call escape hatch.
Source: pocket_tts.models.tts_model.prepare_text_prompt and
_estimate_max_gen_len in https://github.com/kyutai-labs/pocket-tts.
12 new unit tests cover: empty input, the literal 'yep' case (must
become ' Yep.'), threshold inclusivity at 4 vs 5 words, preserved
existing punctuation, whitespace collapsing, no-double-capitalize,
non-ASCII first letter (Cyrillic д -> Д), max_frames tightness/clamping/
monotonicity.
Test results:
cargo test --lib huddle::pocket -> 12/12 pass
cargo test --lib -> 317/317 pass (was 305 + 12 new)
cargo fmt --check -> clean
cargo clippy huddle::pocket -> clean (preexisting doc-list
overindent warning at line
41 not introduced here)
node check-file-sizes.mjs -> clean (pocket.rs override added)
Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
…mpts The earlier prep-prompt fix (commit 773a2a1) bounded the runaway 'monster breathing' bug on short inputs, but did so by forcing `frames_after_eos = 1` on every prompt with ≥5 words. That's *lower* than the sherpa-onnx upstream default of 3, and it clipped the leading audio of multi-clause sentences: Input: 'Yep, I can hear you. What can I help with?' Output: a quick burst of static where 'Yep, I can hear you.' should be. The sentence splitter emits 'Yep, I can hear you.' (5 words, comma not a boundary) as the first chunk, which under 773a2a1 hit the long-input branch and got frames_after_eos=1 — too few trailing LM frames after EOS fires for the codec to settle, hence the static-burst. Re-reading offline-tts-pocket-impl.h confirmed two things: 1. The upstream default for frames_after_eos is **3**, not 1. My 773a2a1 docs had this backwards. 2. The 500-frame max_frames default (~40s) is fine for any reasonable prompt. We only need a tighter cap on the short-input path where the EOS-never-fires runaway bug originally manifested. This commit: - Removes the per-call frames_after_eos override entirely. sherpa-onnx keeps its default of 3 for every prompt, including short ones (which upstream pocket_tts.py also bumps to 3 — so the default is right). - Keeps the max_frames override, but only on short (≤4-word) padded inputs, set to a generous 100 frames (~8s) — enough slack to never truncate a legitimate short reply. - Removes estimate_max_frames() (no callers; the new short cap is a fixed constant, not adaptive). - Refactors the extra-HashMap builder into build_generation_extra() so we can structurally regression-test it. - Adds property test build_extra_never_lowers_frames_after_eos_for_any_word_count that sweeps a range of prompt lengths and fails CI if anyone ever reintroduces a frames_after_eos override below the upstream default. - Adds build_extra_long_prompt_is_none which pins down the specific 'Yep, I can hear you.' regression from this report. Tests: cargo test --lib huddle::pocket → 12/12 pass. Refs commit 773a2a1 (initial prep-prompt fix). Bumps pocket.rs line-size override 560 → 620 (refactored helper + new tests). Signed-off-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Symptom (reported 2026-05-18): "the first little sound or two in a
sentence is kind of getting skipped over." Affects every sentence,
not just the first one, and is independent of sentence length.
Root cause: `apply_fades` applied an 8 ms linear fade-in (192 samples
at 24 kHz) to the start of every synthesised sentence. Probing Pocket
TTS output across four prompts (Y/H/W/T onsets — see
`examples/pocket_onset_probe.rs`) shows real audio energy inside the
first millisecond:
prompt | samples[0] | peak@1ms rms@1ms
─────────────────────────────────────────────────────────────
Yep, I can hear you. | 0.00193 | 0.0331 0.0235
Hello there friend. | 0.00185 | 0.0288 0.0191
What can I help with? | 0.00180 | 0.0358 0.0242
Try this experiment now. | 0.00189 | 0.0197 0.0139
For 'Yep' and 'What' the first-1ms RMS is *equal to or greater than*
the first-5ms RMS — the consonant attack peaks inside the very window
the fade was nuking. A 0→1 linear ramp attenuated those onset samples
by ≥6 dB over the first 4 ms, which is exactly what Tyler heard as
"swallowed sounds".
`samples[0]` ≈ 0.0019 (≈ −54 dBFS) is far below any audible
DC-jump-click threshold, so removing the fade-in does not introduce
clicks. Fade-out is retained because end-of-sentence cuts *do* create
audible clicks when a non-zero waveform terminates abruptly.
Changes (3 files, +101/−24):
- `huddle/tts.rs`:
- Rename `apply_fades` → `apply_fade_out`. Body removes the leading
fade loop and operates on `&mut [f32]` instead of `&mut Vec<f32>`.
- New const `FIRST_APPEND_LEAD_IN_SAMPLES = 480` (20 ms) and a
single-shot `player.append(zeros)` at the `first_append` site, so
the OS audio device / rodio mixer gets a quiet ramp-up window
*without* scaling any real synthesis samples. Applied once per
utterance — sentence boundaries continue to use
`INTER_SENTENCE_SILENCE` (100 ms) and don't stack on this cushion.
- New regression test
`apply_fade_out_does_not_touch_leading_samples` locks in
`samples[0..FADE_OUT_SAMPLES]` are byte-equal to input. Will
fail loudly if anyone ever reintroduces a leading fade.
- `first_append_lead_in_is_sane` pins the 20 ms × 24 kHz = 480
constant and documents why that range is reasonable.
- Existing `apply_fades_*` tests renamed and updated; +2 net tests
(24 → 26 in tts.rs; 317 → 319 lib-wide).
- `examples/pocket_onset_probe.rs` (new, 137 lines): synthesises the
four probe prompts, dumps per-prompt onset stats (samples[0],
peak/RMS @ 1ms/5ms/20ms), and writes raw WAVs to /tmp for offline
inspection. Documents the measurement that justifies removing the
fade-in; runs against the same `/tmp/pocket-tts-bench` model
directory `pocket_bench` uses.
- `desktop/scripts/check-file-sizes.mjs`: bump `tts.rs` override
1130 → 1210 with updated description.
Verification before push:
- `cargo test --lib` (full) → 319/319 pass.
- `cargo fmt --check` clean.
- `cargo check` (desktop tauri crate) clean.
- `pnpm check` (biome + file-sizes) clean.
- Manual A/B not yet done from the worktree — Tyler will hear the
result on `cargo run` after pull.
Discussion: thread root c0f5988e in #sprout-desktop-lighter-tts
(initial diagnosis, Max's review of approach, probe data).
Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Follow-up to f570ec0 addressing Max's PR review feedback: the `FIRST_APPEND_LEAD_IN_SAMPLES` cushion is correctly gated by `if first_append` in the worker loop, but nothing in the test suite catches the only-bad-version of that pad — a future refactor moving the `if first_append` check inside the per-sentence loop would silently stack 20 ms on top of `INTER_SENTENCE_SILENCE` at every sentence boundary and audibly slow multi-sentence utterances. Refactor the append decision into a pure helper and pin the invariant: - New `build_sentence_append_plan(first_append, boosted, silence_len) -> Vec<Vec<f32>>`: returns [lead_in, audio, inter_silence] on the first call (flipping `first_append` to false), or [audio, inter_silence] on every subsequent call. The worker loop now calls this and iterates the returned buffers, instead of conditionally appending inline. - Three new tests: - `lead_in_pad_fires_exactly_once_per_utterance` — pumps 5 sentences through the plan builder, counts the lead-in buffers, asserts exactly 1. The regression test Max specifically asked for. - `build_sentence_append_plan_flips_first_append` — pins the flag-mutation contract. - `first_sentence_leading_silence_is_exactly_lead_in` — asserts the lead-in is the only leading-silence buffer (no inter-sentence silence is emitted before the first audio buffer). The worker-loop call site is now ~7 lines shorter and harder to break: `was_first` snapshots the flag for the `tts_active.store` gate, the plan builder owns the rest. Verification: - `cargo test --lib` → 322/322 pass (was 319 → +3 new). - `cargo fmt --check` clean. - `cargo check` (desktop tauri crate) clean. - `pnpm check` (biome + file-sizes) clean. Discussion: thread root c0f5988e in #sprout-desktop-lighter-tts, specifically Max's messages [11]/[12]/[13] requesting the once-per-utterance test before merge. Signed-off-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Pocket TTS' FlowLM has an autoregressive cold-start: the first 2-3 generation steps run without audio context in the KV cache, occasionally smearing or dropping the first phoneme of short utterances. Tyler reproduced this on 'I'm happy.' rendering as 'm happy', and on other 'I'm X' constructions across random seeds. The bug is documented upstream as kyutai-labs/pocket-tts #91 (8 comments, 2 collaborators acknowledged) and #70, with collateral discussion at sherpa-onnx #3180. Earlier commits in this branch reduced but did not eliminate the failure: 773a2a1 added 8-space padding; 1dbfa2c restored sherpa's `frames_after_eos` default of 3 (fixing a separate static-burst regression); f570ec0 dropped the leading fade-in. Empirical study at production settings (silence_scale=0, frames_after_eos=3 default) confirmed that temperature, silence_scale, seed, and pad tweaks are all insufficient — the model's stochastic sampling lands on a bad trajectory often enough to be perceptible on short prompts. This commit applies the upstream-documented sacrificial-word workaround (ikidd in kyutai-labs/pocket-tts #70) with two refinements: 1. Sacrificial prefix '. . ' (two periods + space) instead of a word. The pair was empirically the only variant in our probe that produced a usable post-sacrificial silence gap on every random seed in the 8-seed × 8-variant matrix (`sacrificial_probe`, iterated locally during investigation); a single period failed on seed=99999. Periods render as low-amplitude breath rather than spoken audio. 2. Post-synth trim: scan from t=30ms looking for the first run of samples below 0.02 lasting >= 50 ms — that's the sacrificial→main boundary. `Vec::drain` everything before the gap-end. If no gap is found or the boundary lies beyond 1.2 s (production max-drop bound), bail out and emit the raw buffer rather than corrupt the audio. We don't insert a zero lead-in here because tts.rs's existing FIRST_APPEND_LEAD_IN_SAMPLES already provides the OS-device warm-up cushion on the first append of an utterance, and subsequent sentences are buffered by INTER_SENTENCE_SILENCE. Both the prefix and the trim are gated on PreparedPrompt::is_short (<= 4 words after preprocessing, matches upstream's pad_with_spaces_for_short_inputs predicate). Long prompts pass through unchanged: the first phoneme of a long utterance has enough downstream context to avoid the smear, and a natural early pause like the comma in 'Hello, how can I help you?' would otherwise be misdetected by the trimmer as the sacrificial gap (Max caught this in review — thanks). Also: bump TARGET_PEAK in tts.rs from -6 dBFS (0.501) to -3 dBFS (0.708) per Tyler. This is a ceiling on per-sentence loudness normalization, not a floor — quieter Pocket utterances under MAX_GAIN=8 will still land below the ceiling (bench-typical peak 0.076 lands at 0.608, ~-4.3 dBFS). Comment updated to reflect that nuance. Probe data (see examples/prod_probe.rs; production GenerationConfig: silence_scale=0.0, frames_after_eos default 3, max_frames=100 short). Tested 5 prompts × 5 seeds with the new code path: Short prompts ('I'm happy', 'I'm sorry', 'I'm ready', 'Yep', 'I see you') with sacrificial prefix: 25/25 produced a >=50ms silence gap in the 30-340ms range. Trim drops 47-339ms; final audio 270-748ms. Long prompts without sacrificial (regression check): 'Hello, how can I help you today?' and 'Yes, that works. Let me try again.' generate normally; comma pauses preserved. Tyler ear-confirmed the trimmed short-prompt output: > these are much better! I like this! Max reviewed twice — first flagging a silence_scale mismatch between probe (silence_scale=1.0) and production (0.0), then flagging the destructive-edge hazard if trim ran on un-sacrificed long prompts. Both are addressed: prod_probe mirrors production GenerationConfig exactly (silence_scale=0.0, no frames_after_eos override per 1dbfa2c), and the trim is gated on is_short with a 1.2s max-drop bound as belt-and-suspenders against the destructive edge case. Tests added (in pocket.rs): - prepare_prompt_inserts_sacrificial_prefix_only_for_short: pins the exact ordering (pad + '. . ' + cleaned). - prepare_prompt_threshold_is_inclusive_at_four_words extended to assert is_short and SACRIFICIAL_PREFIX absence on long input. - trim_strips_sacrificial_and_keeps_only_speech: feed a synthetic sacrificial+gap+speech buffer; assert leading sample is speech. - trim_is_noop_when_no_long_silence_gap_exists - trim_is_noop_when_gap_is_shorter_than_threshold - trim_is_noop_when_gap_is_beyond_max_drop_bound: guards the destructive-edge case Max flagged. - trim_is_noop_on_buffer_smaller_than_scan_start: no panic. - trim_constants_use_sane_units: pins millisecond meanings. Tests added (in tts.rs): - normalize_for_playback_clamps_at_max_gain_below_target: new behaviour under the -3 dBFS ceiling for bench-typical peaks. - normalize_for_playback_hits_target_on_quiet_buffer updated for new MAX_GAIN saturation point (0.0885) on the input side. All 330 cargo test --lib pass. cargo fmt --check and desktop/scripts/check-file-sizes.mjs are green. pocket.rs cap 620 → 900, tts.rs cap 1335 → 1380. Signed-off-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Max noted that prod_probe's header advertised _sac variants as 'the new path' but the WAVs written were actually the raw engine output, not the post-trim audio that synth_chunk returns to tts.rs. Anyone listening to those files would have heard the sacrificial breath at the start — misleading for ear-testing the fix. Mirror trim_leading_cold_start (and its constants) inline in the probe. For short prompts with a sacrificial prefix, write both files: /tmp/prod_<label>_s<seed>_raw.wav — raw engine output /tmp/prod_<label>_s<seed>_trimmed.wav — what production actually plays Long prompts (no sacrificial, no trim) only get the _raw variant since that's what synth_chunk returns for them in production. Header rewritten to match. Sample data after the change: imhappy_sac_s99999: raw 472ms (gap 50..144ms) → trim 328ms yep_sac_s42: raw 270ms (gap 30..141ms) → trim 129ms imhappy_sac_s314159: raw 730ms (gap 43..339ms) → trim 391ms (Trim length == raw_len - gap_end_ms, matching expectations.) The inline trim is a deliberate copy of huddle::pocket — the example sits in desktop/src-tauri/examples which can't reach into the private huddle module. Comment at the top of the constants block flags the 'keep in sync' contract. All 330 cargo test --lib still pass; file-sizes still green. Non-blocking cleanup from Max's review of 61d064d. Signed-off-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
341c47b to
f077571
Compare
Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Swaps the local huddle TTS from Kokoro (loaded via
ort) to Pocket TTS (Kyutai Labs, Apache-2.0) running through the samesherpa-onnxstatic library that already powers STT. This is a Kokoro→Pocket model swap, but the user-visible win is the removal of theortdirect dependency tree and its runtimelibonnxruntime.dylib— the dylib that was the source of the gatekeeper-quarantine "damaged app" failures on older Macs.Why this is the headline change
Before:
ort(Rust crate)dlopen'd a prebuilt onnxruntime dylib at runtime.sherpa-onnx-syslinked a second onnxruntime statically into the same binary.libonnxruntime.dylibshipped inSprout.app/Contents/Frameworks/, ad-hoc-signed by the Rust crate's build script, periodically quarantined by Gatekeeper on first launch.After:
*.dylibin the macOS bundle. Verified withfind … '*onnx*' -o '*.dylib'andotool -L Contents/MacOS/Sprouton a full unsignedtauri build.cargo tree -i ort/-i ort-sys/-i ndarray→ "not found."Tradeoff: CoreML
We lose access to CoreML acceleration. The upstream k2-fsa static archive doesn't ship a CoreML-enabled onnxruntime, and building one ourselves would re-introduce the same dylib distribution + signing pain we're deleting. Bench on M1 8GB:
(No measured Kokoro baseline to compare against — the previous "2.6× realtime / 200ms TTFA" numbers in the codebase turned out to be an unverified doc-comment about batching strategy, not a benchmark. If anyone needs an apples-to-apples comparison, we can re-run Kokoro on the same rig as a follow-up.)
What's in the diff
huddle/pocket.rs(new, 238 lines) — sherpa-onnx Pocket TTS wrapper, mirrorshuddle/stt.rsconfig pattern.huddle/kokoro.rsdeleted (973 lines).huddle/tts.rs— engine import swap, voice path becomes.wav, newnormalize_for_playback(samples)function (per-sentence peak normalization to −6 dBFS withMAX_GAIN=8.0to cap over-amplification of near-silent buffers; +5 unit tests).huddle/models.rs— Kokoro file list → Pocket bundle (5 ONNX + 2 JSON + voice WAV + LICENSE + MODEL_LICENSE.txt). Readiness fails-closed without the attribution sidecar (+1 test).Cargo.toml/Cargo.lock— dropsort+ndarraydirect deps; lockfile shrinks 334 lines.kokoro→ttsacrossmod.rs,pipeline.rs,lib.rs, and the frontendHuddleBar.tsx. Status surface is now engine-agnostic.desktop/scripts/check-file-sizes.mjs— net −1 override entry: removes the now-deletedkokoro.rsbudget (980); bumpsmodels.rs(900→930) andtts.rs(1030→1130) to fit the normalize fn + new tests.Net diff: 12 files, +700 / −1436 lines (most of it the kokoro.rs deletion).
Voice change
Bundled reference is now Kyutai's American-male sample (the previous voice was Kokoro's
af_heart, American-female). Per-sentence peak normalization brings playback level to −6 dBFS regardless of the bundled-voice peak, withMAX_GAIN=8.0keeping near-silent buffers from being amplified to noise.Verification
How to validate locally
TTS warmup completed in {N}ms— expect ≤1500 ms on Apple Silicon with 8 GB+. >1500 ms suggests memory pressure / paging from concurrent STT+TTS init.OfflineRecognizer::create returned Nonein stderr (STT load path).Follow-ups (deliberately out of scope)
Creating a resampler: in=16000 out=24000fires every synth call) — upstream caching opportunity.Co-investigation
This change was co-investigated with the Goosetown bot crew (Max, Mari, Perci, Quinn, Sami) in the original research/coordination thread. Reviews covered: bundle/signing verification end-to-end (Mari, Sami, Perci), TTS pipeline structural review (Perci, Max), per-sentence loudness normalization design (Quinn, Max), sidecar readiness test (Mari, Max), and PR-claim accuracy (Quinn).