fix: remove kv padding from flash attention wrapper by leejet · Pull Request #1453 · leejet/stable-diffusion.cpp

leejet · 2026-04-22T18:00:34Z

Most backends already handle non-256 KV lengths internally or fall back via backend support checks. Avoid generating synthetic padding masks, which can trigger incorrect Vulkan flash attention output for short prompt lengths.

Fix #1431.

daniandtheweb · 2026-04-23T00:08:39Z

I've just tested the changes and this still doesn't fix the issue. The issue is not only happening on short prompts but on any prompt lenght using flash attention on vulkan for Ernie and Anima models.

leejet · 2026-04-23T14:08:43Z

I’ve tried to fix it, and it’s working properly on my device now. @daniandtheweb Could you pull the latest commit and give it another try? Also, don’t forget to sync the ggml submodule.

git submodule sync --recursive
git submodule update --init --recursive --force

leejet · 2026-04-23T14:16:22Z

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\ernie-image-UD-Q4_K_M.gguf --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\ministral-3-3b.safetensors -p "a lovely cat" --cfg-scale 5.0 -v --offload-to-cpu --diffusion-fa

before ggml update

after ggml update

daniandtheweb · 2026-04-23T14:26:49Z

I've done a clean build using this branch and the issue is still there. My current prompt is taken from civitai:

./sd-cli -M img_gen -p "year 2023, year 2024, year 2025, highres,masterpiece, best quality, score_7, score_8, score_9, @miclot, safe, a group of five anime girls and a small dog posing for a selfie in a snowy landscape, the girl in the foreground has long pink hair and purple eyes, wearing a teal beanie with white stripes and a white puffer jacket, smiling widely and making a peace sign with her left hand, the girl to the left has short brown hair and glasses, wearing a maroon beanie and a light blue jacket, raising her right hand in a peace sign, the girl in the center has black hair and brown eyes, wearing a black beanie and a dark jacket, looking at the camera, the girl to the right has short purple hair and purple eyes, wearing a striped scarf and a green jacket, looking at the camera, the girl in the back has blonde hair and green eyes, wearing a white beanie and a yellow jacket, making a peace sign with her left hand, the dog has brown and white fur, sticking its tongue out, the background features a clear blue sky and snow-covered mountains, the scene is bright and sunny with natural lighting, the colors are vibrant with a mix of cool and warm tones, the composition is a close-up shot with the characters filling most of the frame, the focus is on the group's happy expressions and the snowy environment. <lora:anima-turbo-lora-v0.1:1>" -n "worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, sepia,watermark, mosaic censoring, bar censor," --sampling-method euler --steps 8 -W 1024 -H 1024 -b 1 --cfg-scale 1 -s 1327417454 --clip-skip -1 --embd-dir /home/daniandtheweb/Workspace/sd.cpp-webui/models/embeddings/ --lora-model-dir /home/daniandtheweb/Workspace/sd.cpp-webui/models/loras/ -t 0 --rng cuda --sampler-rng cuda --lora-apply-mode auto -o /home/daniandtheweb/Workspace/sd.cpp-webui/outputs/txt2img/1765151592.png --diffusion-model /home/daniandtheweb/Workspace/sd.cpp-webui/models/unet/anima-preview3-base.safetensors --vae /home/daniandtheweb/Workspace/sd.cpp-webui/models/vae/qwen_image_vae.safetensors --llm /home/daniandtheweb/Workspace/sd.cpp-webui/models/text_encoders/qwen_3_06b_base.safetensors --scheduler simple --vae-tile-overlap 0.5 --vae-tile-size 32x32 --preview proj --preview-path /home/daniandtheweb/Workspace/sd.cpp-webui/outputs/txt2img/1765151592_preview.png --preview-interval 1 --vae-tiling --fa --vae-conv-direct --mmap --color

Here's the progression of the preview, in case it can help solving the issue:

1 step	2 steps	3 steps	4 steps	5 steps	6 steps	7 steps	8 steps

Without flash attention the resulting image comes out just fine:

The same issue still remains on Ernie on my end.

This has been tested on Linux on a radeon rx 7800xt with both the official mesa drivers and the git ones. I also tried disabling cooperative matrix and int dot acceleration for vulkan but the result is the same, with flash attention the generation breaks down.

leejet · 2026-04-23T14:54:17Z

Does the simplest txt2img pipeline—like the one below—also cause issues on your side?

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\anima-preview.safetensors --vae ..\..\ComfyUI\models\vae\qwen_image_vae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_06b_base.safetensors  -p "a lovely cat holding a sign says 'anima.cpp'" --cfg-scale 6.0 --sampling-method euler -v --offload-to-cpu

daniandtheweb · 2026-04-23T15:50:04Z

Does the simplest txt2img pipeline—like the one below—also cause issues on your side?

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\anima-preview.safetensors --vae ..\..\ComfyUI\models\vae\qwen_image_vae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_06b_base.safetensors  -p "a lovely cat holding a sign says 'anima.cpp'" --cfg-scale 6.0 --sampling-method euler -v --offload-to-cpu

This specific command works as expected.

Here's the simplest reproduction of the issue that I've been able to achieve for now:

This works:

./sd-cli -p "a cat" --sampling-method euler --steps 8 --cfg-scale 1 --diffusion-model /home/daniandtheweb/Workspace/sd.cpp-webui/models/unet/ernie-image-turbo-Q8_0.gguf --vae /home/daniandtheweb/Workspace/sd.cpp-webui/models/vae/flux2-vae.safetensors --llm /home/daniandtheweb/Workspace/sd.cpp-webui/models/text_encoders/Ministral-3-3B-Instruct-2512-UD-Q8_K_XL.gguf --fa --vae-conv-direct

This doesn't work:

./sd-cli -p "a cat" --sampling-method euler --steps 8 -W 1024 -H 1024 --cfg-scale 1 --diffusion-model /home/daniandtheweb/Workspace/sd.cpp-webui/models/unet/ernie-image-turbo-Q8_0.gguf --vae /home/daniandtheweb/Workspace/sd.cpp-webui/models/vae/flux2-vae.safetensors --llm /home/daniandtheweb/Workspace/sd.cpp-webui/models/text_encoders/Ministral-3-3B-Instruct-2512-UD-Q8_K_XL.gguf --fa --vae-conv-direct

But removing flash attention makes it work also on 1024x1024:

./sd-cli -p "a cat" --sampling-method euler --steps 8 -W 1024 -H 1024 --cfg-scale 1 --diffusion-model /home/daniandtheweb/Workspace/sd.cpp-webui/models/unet/ernie-image-turbo-Q8_0.gguf --vae /home/daniandtheweb/Workspace/sd.cpp-webui/models/vae/flux2-vae.safetensors --llm /home/daniandtheweb/Workspace/sd.cpp-webui/models/text_encoders/Ministral-3-3B-Instruct-2512-UD-Q8_K_XL.gguf --vae-conv-direct

leejet · 2026-05-18T17:31:20Z

@daniandtheweb I fixed this issue on the master branch after updating to the latest ggml. Could you try again?

daniandtheweb · 2026-05-18T17:52:12Z

On my end it still breaks on this same command:

./sd-cli -p "a cat" --sampling-method euler --steps 8 -W 1024 -H 1024 --cfg-scale 1 --diffusion-model /home/daniandtheweb/Workspace/sd.cpp-webui/models/unet/ernie-image-turbo-Q8_0.gguf --vae /home/daniandtheweb/Workspace/sd.cpp-webui/models/vae/flux2-vae.safetensors --llm /home/daniandtheweb/Workspace/sd.cpp-webui/models/text_encoders/Ministral-3-3B-Instruct-2512-UD-Q8_K_XL.gguf --fa --vae-conv-direct

stable-diffusion.cpp version master-625-f683c88-4-gcaa823a, commit caa823a

Without fa it works. The issue appears only above a certain resolution as on 512x512 it works just fine with fa.

leejet · 2026-05-31T15:22:31Z

I can’t reproduce this issue with the Vulkan backend on my device, which makes it difficult to investigate. This change does provide optimizations on certain paths, so I’ll go ahead and merge this PR for now.

Brings the upstream src-layout reorg (leejet#1615 model/, core/, conditioning/, runtime/, extensions/), the new offload path (leejet#1601 pinned host buffer, leejet#1576 --stream-layers), vram-limit propagation (leejet#1583), APG/PiD/ideogram4, and the photomaker->generation-extension move (leejet#1618). Conflict resolution (4 files): - model.h / stable-diffusion.cpp: union the fork's LONGCAT_AVATAR version with upstream's PiD/Ideogram4; keep the avatar deferred-DiT-load + per-frame timestep zeroing, adopt upstream's alloc error-checks + generation-extensions alloc loop (pmid is now an extension); keep whisper-encoder alloc. - conditioner.hpp: keep both set_keep_params_resident + set_stream_layers_enabled. - ggml_extend.hpp: keep the fork's coherent offload system (lap-32 pinned alloc, lap-32.2 H2D pipelining, partial/all-param restore, umT5 free-then-reload null fix, lap-28 F16-KV/mask attention) and fold upstream's persistent_externals snapshot + observed_max_effective_budget reset alongside; flash_skip_kv_pad opt-out coexists with upstream leejet#1453's unconditional kv-pad removal. - Repointed fork-only headers (longcat_avatar/audio, nava, nava example) at the new nested include paths.

- ggml_backend_is_cpu -> sd_backend_is_cpu (upstream leejet#1591 dynamic CPU backend), 7 call sites in ggml_extend.hpp + stable-diffusion.cpp. - re-declare int kv_pad in ggml_ext_attention_ext (upstream leejet#1453 dropped it; the fork keeps the opt-out pad path via flash_skip_kv_pad, capturing kv_pad by-ref). - nava example: ggml_backend_cpu_init -> sd_backend_cpu_init, and include model/vae/wan_vae.hpp (WanVAERunner extracted from wan.hpp upstream leejet#1614). Builds clean: sd-cli + sd-server + nava all green.

fix: remove kv padding from flash attention wrapper

a5dde30

update ggml

a7c56d3

fix: install SPIR-V headers for Vulkan builds

4ef76b6

wbruna mentioned this pull request May 31, 2026

Wan 2.2 I2V: OOM on Unified Memory (Jetson/Grace) — VAE buffer not freed before diffusion #1587

Open

Merge branch 'master' into remove-kv-pad-for-flash-attn

a6daeb9

leejet merged commit 20901f6 into master May 31, 2026
14 checks passed

leejet deleted the remove-kv-pad-for-flash-attn branch May 31, 2026 17:45

wbruna mentioned this pull request May 31, 2026

[Bug] Vulkan Flash Attention not working #1431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove kv padding from flash attention wrapper#1453

fix: remove kv padding from flash attention wrapper#1453
leejet merged 4 commits into
masterfrom
remove-kv-pad-for-flash-attn

leejet commented Apr 22, 2026

Uh oh!

daniandtheweb commented Apr 23, 2026 •

edited

Loading

Uh oh!

leejet commented Apr 23, 2026

Uh oh!

leejet commented Apr 23, 2026

Uh oh!

daniandtheweb commented Apr 23, 2026 •

edited

Loading

Uh oh!

leejet commented Apr 23, 2026

Uh oh!

daniandtheweb commented Apr 23, 2026 •

edited

Loading

Uh oh!

leejet commented May 18, 2026

Uh oh!

daniandtheweb commented May 18, 2026 •

edited

Loading

Uh oh!

leejet commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leejet commented Apr 22, 2026

Uh oh!

daniandtheweb commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leejet commented Apr 23, 2026

Uh oh!

leejet commented Apr 23, 2026

before ggml update

after ggml update

Uh oh!

daniandtheweb commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leejet commented Apr 23, 2026

Uh oh!

daniandtheweb commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leejet commented May 18, 2026

Uh oh!

daniandtheweb commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leejet commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daniandtheweb commented Apr 23, 2026 •

edited

Loading

daniandtheweb commented Apr 23, 2026 •

edited

Loading

daniandtheweb commented Apr 23, 2026 •

edited

Loading

daniandtheweb commented May 18, 2026 •

edited

Loading