fix: remove kv padding from flash attention wrapper#1453
Conversation
|
I've just tested the changes and this still doesn't fix the issue. The issue is not only happening on short prompts but on any prompt lenght using flash attention on vulkan for Ernie and Anima models. |
|
I’ve tried to fix it, and it’s working properly on my device now. @daniandtheweb Could you pull the latest commit and give it another try? Also, don’t forget to sync the ggml submodule. |
|
Does the simplest txt2img pipeline—like the one below—also cause issues on your side? |
This specific command works as expected. Here's the simplest reproduction of the issue that I've been able to achieve for now: This works: This doesn't work: But removing flash attention makes it work also on 1024x1024: |
|
@daniandtheweb I fixed this issue on the master branch after updating to the latest ggml. Could you try again? |
|
On my end it still breaks on this same command: Without |
|
I can’t reproduce this issue with the Vulkan backend on my device, which makes it difficult to investigate. This change does provide optimizations on certain paths, so I’ll go ahead and merge this PR for now. |
Brings the upstream src-layout reorg (leejet#1615 model/, core/, conditioning/, runtime/, extensions/), the new offload path (leejet#1601 pinned host buffer, leejet#1576 --stream-layers), vram-limit propagation (leejet#1583), APG/PiD/ideogram4, and the photomaker->generation-extension move (leejet#1618). Conflict resolution (4 files): - model.h / stable-diffusion.cpp: union the fork's LONGCAT_AVATAR version with upstream's PiD/Ideogram4; keep the avatar deferred-DiT-load + per-frame timestep zeroing, adopt upstream's alloc error-checks + generation-extensions alloc loop (pmid is now an extension); keep whisper-encoder alloc. - conditioner.hpp: keep both set_keep_params_resident + set_stream_layers_enabled. - ggml_extend.hpp: keep the fork's coherent offload system (lap-32 pinned alloc, lap-32.2 H2D pipelining, partial/all-param restore, umT5 free-then-reload null fix, lap-28 F16-KV/mask attention) and fold upstream's persistent_externals snapshot + observed_max_effective_budget reset alongside; flash_skip_kv_pad opt-out coexists with upstream leejet#1453's unconditional kv-pad removal. - Repointed fork-only headers (longcat_avatar/audio, nava, nava example) at the new nested include paths.
- ggml_backend_is_cpu -> sd_backend_is_cpu (upstream leejet#1591 dynamic CPU backend), 7 call sites in ggml_extend.hpp + stable-diffusion.cpp. - re-declare int kv_pad in ggml_ext_attention_ext (upstream leejet#1453 dropped it; the fork keeps the opt-out pad path via flash_skip_kv_pad, capturing kv_pad by-ref). - nava example: ggml_backend_cpu_init -> sd_backend_cpu_init, and include model/vae/wan_vae.hpp (WanVAERunner extracted from wan.hpp upstream leejet#1614). Builds clean: sd-cli + sd-server + nava all green.











Most backends already handle non-256 KV lengths internally or fall back via backend support checks. Avoid generating synthetic padding masks, which can trigger incorrect Vulkan flash attention output for short prompt lengths.
Fix #1431.