chore(deps): update actions/upload-artifact digest to ea165f8#70
Merged
Conversation
wesbillman
approved these changes
Mar 16, 2026
tlongwell-block
pushed a commit
that referenced
this pull request
May 18, 2026
Pocket TTS' FlowLM has an autoregressive cold-start: the first 2-3 generation steps run without audio context in the KV cache, occasionally smearing or dropping the first phoneme of short utterances. Tyler reproduced this on 'I'm happy.' rendering as 'm happy', and on other 'I'm X' constructions across random seeds. The bug is documented upstream as kyutai-labs/pocket-tts #91 (8 comments, 2 collaborators acknowledged) and #70, with collateral discussion at sherpa-onnx #3180. Earlier commits in this branch reduced but did not eliminate the failure: 773a2a1 added 8-space padding; 1dbfa2c restored sherpa's `frames_after_eos` default of 3 (fixing a separate static-burst regression); f570ec0 dropped the leading fade-in. Empirical study at production settings (silence_scale=0, frames_after_eos=3 default) confirmed that temperature, silence_scale, seed, and pad tweaks are all insufficient — the model's stochastic sampling lands on a bad trajectory often enough to be perceptible on short prompts. This commit applies the upstream-documented sacrificial-word workaround (ikidd in kyutai-labs/pocket-tts #70) with two refinements: 1. Sacrificial prefix '. . ' (two periods + space) instead of a word. The pair was empirically the only variant in our probe that produced a usable post-sacrificial silence gap on every random seed in the 8-seed × 8-variant matrix (`sacrificial_probe`, iterated locally during investigation); a single period failed on seed=99999. Periods render as low-amplitude breath rather than spoken audio. 2. Post-synth trim: scan from t=30ms looking for the first run of samples below 0.02 lasting >= 50 ms — that's the sacrificial→main boundary. `Vec::drain` everything before the gap-end. If no gap is found or the boundary lies beyond 1.2 s (production max-drop bound), bail out and emit the raw buffer rather than corrupt the audio. We don't insert a zero lead-in here because tts.rs's existing FIRST_APPEND_LEAD_IN_SAMPLES already provides the OS-device warm-up cushion on the first append of an utterance, and subsequent sentences are buffered by INTER_SENTENCE_SILENCE. Both the prefix and the trim are gated on PreparedPrompt::is_short (<= 4 words after preprocessing, matches upstream's pad_with_spaces_for_short_inputs predicate). Long prompts pass through unchanged: the first phoneme of a long utterance has enough downstream context to avoid the smear, and a natural early pause like the comma in 'Hello, how can I help you?' would otherwise be misdetected by the trimmer as the sacrificial gap (Max caught this in review — thanks). Also: bump TARGET_PEAK in tts.rs from -6 dBFS (0.501) to -3 dBFS (0.708) per Tyler. This is a ceiling on per-sentence loudness normalization, not a floor — quieter Pocket utterances under MAX_GAIN=8 will still land below the ceiling (bench-typical peak 0.076 lands at 0.608, ~-4.3 dBFS). Comment updated to reflect that nuance. Probe data (see examples/prod_probe.rs; production GenerationConfig: silence_scale=0.0, frames_after_eos default 3, max_frames=100 short). Tested 5 prompts × 5 seeds with the new code path: Short prompts ('I'm happy', 'I'm sorry', 'I'm ready', 'Yep', 'I see you') with sacrificial prefix: 25/25 produced a >=50ms silence gap in the 30-340ms range. Trim drops 47-339ms; final audio 270-748ms. Long prompts without sacrificial (regression check): 'Hello, how can I help you today?' and 'Yes, that works. Let me try again.' generate normally; comma pauses preserved. Tyler ear-confirmed the trimmed short-prompt output: > these are much better! I like this! Max reviewed twice — first flagging a silence_scale mismatch between probe (silence_scale=1.0) and production (0.0), then flagging the destructive-edge hazard if trim ran on un-sacrificed long prompts. Both are addressed: prod_probe mirrors production GenerationConfig exactly (silence_scale=0.0, no frames_after_eos override per 1dbfa2c), and the trim is gated on is_short with a 1.2s max-drop bound as belt-and-suspenders against the destructive edge case. Tests added (in pocket.rs): - prepare_prompt_inserts_sacrificial_prefix_only_for_short: pins the exact ordering (pad + '. . ' + cleaned). - prepare_prompt_threshold_is_inclusive_at_four_words extended to assert is_short and SACRIFICIAL_PREFIX absence on long input. - trim_strips_sacrificial_and_keeps_only_speech: feed a synthetic sacrificial+gap+speech buffer; assert leading sample is speech. - trim_is_noop_when_no_long_silence_gap_exists - trim_is_noop_when_gap_is_shorter_than_threshold - trim_is_noop_when_gap_is_beyond_max_drop_bound: guards the destructive-edge case Max flagged. - trim_is_noop_on_buffer_smaller_than_scan_start: no panic. - trim_constants_use_sane_units: pins millisecond meanings. Tests added (in tts.rs): - normalize_for_playback_clamps_at_max_gain_below_target: new behaviour under the -3 dBFS ceiling for bench-typical peaks. - normalize_for_playback_hits_target_on_quiet_buffer updated for new MAX_GAIN saturation point (0.0885) on the input side. All 330 cargo test --lib pass. cargo fmt --check and desktop/scripts/check-file-sizes.mjs are green. pocket.rs cap 620 → 900, tts.rs cap 1335 → 1380. Signed-off-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
6546280→ea165f8Configuration
📅 Schedule: Branch creation - Between 12:00 AM and 03:59 AM, only on Monday ( * 0-3 * * 1 ) (UTC), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.