Skip to content

chore(deps): update actions/upload-artifact digest to ea165f8#70

Merged
wesbillman merged 1 commit into
mainfrom
renovate/actions-upload-artifact-digest
Mar 16, 2026
Merged

chore(deps): update actions/upload-artifact digest to ea165f8#70
wesbillman merged 1 commit into
mainfrom
renovate/actions-upload-artifact-digest

Conversation

@renovate
Copy link
Copy Markdown
Contributor

@renovate renovate Bot commented Mar 16, 2026

This PR contains the following updates:

Package Type Update Change
actions/upload-artifact (changelog) action digest 6546280ea165f8

Configuration

📅 Schedule: Branch creation - Between 12:00 AM and 03:59 AM, only on Monday ( * 0-3 * * 1 ) (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate Bot requested a review from wesbillman as a code owner March 16, 2026 01:19
@wesbillman wesbillman merged commit a0c4d1b into main Mar 16, 2026
8 checks passed
@wesbillman wesbillman deleted the renovate/actions-upload-artifact-digest branch March 16, 2026 15:17
tlongwell-block added a commit that referenced this pull request Mar 16, 2026
* origin/main:
  feat: agent users:write scope + system messages in chat (#73)
  chore(deps): update swatinem/rust-cache digest to e18b497 (#71)
  chore(deps): update actions/upload-artifact digest to ea165f8 (#70)
  feat: NIP-50 search, NIP-10 threads, NIP-17 DMs, Sprout DM discovery (#74)
tlongwell-block pushed a commit that referenced this pull request May 18, 2026
Pocket TTS' FlowLM has an autoregressive cold-start: the first 2-3
generation steps run without audio context in the KV cache, occasionally
smearing or dropping the first phoneme of short utterances. Tyler
reproduced this on 'I'm happy.' rendering as 'm happy', and on other
'I'm X' constructions across random seeds. The bug is documented
upstream as kyutai-labs/pocket-tts #91 (8 comments, 2 collaborators
acknowledged) and #70, with collateral discussion at sherpa-onnx #3180.

Earlier commits in this branch reduced but did not eliminate the
failure: 773a2a1 added 8-space padding; 1dbfa2c restored sherpa's
`frames_after_eos` default of 3 (fixing a separate static-burst
regression); f570ec0 dropped the leading fade-in. Empirical study at
production settings (silence_scale=0, frames_after_eos=3 default)
confirmed that temperature, silence_scale, seed, and pad tweaks are
all insufficient — the model's stochastic sampling lands on a bad
trajectory often enough to be perceptible on short prompts.

This commit applies the upstream-documented sacrificial-word workaround
(ikidd in kyutai-labs/pocket-tts #70) with two refinements:

  1. Sacrificial prefix '. . ' (two periods + space) instead of a word.
     The pair was empirically the only variant in our probe that
     produced a usable post-sacrificial silence gap on every random
     seed in the 8-seed × 8-variant matrix (`sacrificial_probe`,
     iterated locally during investigation); a single period failed on
     seed=99999. Periods render as low-amplitude breath rather than
     spoken audio.

  2. Post-synth trim: scan from t=30ms looking for the first run of
     samples below 0.02 lasting >= 50 ms — that's the sacrificial→main
     boundary. `Vec::drain` everything before the gap-end. If no gap
     is found or the boundary lies beyond 1.2 s (production max-drop
     bound), bail out and emit the raw buffer rather than corrupt the
     audio. We don't insert a zero lead-in here because tts.rs's
     existing FIRST_APPEND_LEAD_IN_SAMPLES already provides the
     OS-device warm-up cushion on the first append of an utterance,
     and subsequent sentences are buffered by INTER_SENTENCE_SILENCE.

Both the prefix and the trim are gated on PreparedPrompt::is_short
(<= 4 words after preprocessing, matches upstream's
pad_with_spaces_for_short_inputs predicate). Long prompts pass through
unchanged: the first phoneme of a long utterance has enough downstream
context to avoid the smear, and a natural early pause like the comma in
'Hello, how can I help you?' would otherwise be misdetected by the
trimmer as the sacrificial gap (Max caught this in review — thanks).

Also: bump TARGET_PEAK in tts.rs from -6 dBFS (0.501) to -3 dBFS
(0.708) per Tyler. This is a ceiling on per-sentence loudness
normalization, not a floor — quieter Pocket utterances under MAX_GAIN=8
will still land below the ceiling (bench-typical peak 0.076 lands at
0.608, ~-4.3 dBFS). Comment updated to reflect that nuance.

Probe data (see examples/prod_probe.rs; production GenerationConfig:
silence_scale=0.0, frames_after_eos default 3, max_frames=100 short).
Tested 5 prompts × 5 seeds with the new code path:

  Short prompts ('I'm happy', 'I'm sorry', 'I'm ready', 'Yep',
  'I see you') with sacrificial prefix:
    25/25 produced a >=50ms silence gap in the 30-340ms range.
    Trim drops 47-339ms; final audio 270-748ms.

  Long prompts without sacrificial (regression check):
    'Hello, how can I help you today?' and 'Yes, that works. Let me
    try again.' generate normally; comma pauses preserved.

Tyler ear-confirmed the trimmed short-prompt output:
  > these are much better! I like this!

Max reviewed twice — first flagging a silence_scale mismatch between
probe (silence_scale=1.0) and production (0.0), then flagging the
destructive-edge hazard if trim ran on un-sacrificed long prompts.
Both are addressed: prod_probe mirrors production GenerationConfig
exactly (silence_scale=0.0, no frames_after_eos override per 1dbfa2c),
and the trim is gated on is_short with a 1.2s max-drop bound as
belt-and-suspenders against the destructive edge case.

Tests added (in pocket.rs):
  - prepare_prompt_inserts_sacrificial_prefix_only_for_short:
    pins the exact ordering (pad + '. . ' + cleaned).
  - prepare_prompt_threshold_is_inclusive_at_four_words extended to
    assert is_short and SACRIFICIAL_PREFIX absence on long input.
  - trim_strips_sacrificial_and_keeps_only_speech: feed a synthetic
    sacrificial+gap+speech buffer; assert leading sample is speech.
  - trim_is_noop_when_no_long_silence_gap_exists
  - trim_is_noop_when_gap_is_shorter_than_threshold
  - trim_is_noop_when_gap_is_beyond_max_drop_bound: guards the
    destructive-edge case Max flagged.
  - trim_is_noop_on_buffer_smaller_than_scan_start: no panic.
  - trim_constants_use_sane_units: pins millisecond meanings.

Tests added (in tts.rs):
  - normalize_for_playback_clamps_at_max_gain_below_target: new
    behaviour under the -3 dBFS ceiling for bench-typical peaks.
  - normalize_for_playback_hits_target_on_quiet_buffer updated for
    new MAX_GAIN saturation point (0.0885) on the input side.

All 330 cargo test --lib pass. cargo fmt --check and
desktop/scripts/check-file-sizes.mjs are green. pocket.rs cap 620 →
900, tts.rs cap 1335 → 1380.

Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Signed-off-by: npub1cc3ha7z055mu0rwwu7806t2wt8mj3pvu0uv5mfp2c50dahaqhczshdalg6 <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant