Conversation
|
Finally, temporal tiling 🥳 |
|
Looks like @stduhpf 's https://huggingface.co/stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small would be ideal here. @leejet your example command contains both |
|
Does it support LoRa loading? |
Of course it does. |
|
Audio decoding is messed up (seems to be sped up with high pitch and repeats twice). Other than that it works nicely so far. Great job as always @leejet ! edit: I believe the audio issue might just be because of the arrangement of stereo samples in the webm file (planar vs interleaved) |
This comment was marked as resolved.
This comment was marked as resolved.
|
Very nice to see this is being worked on, LTX 2.3 is highly anticipated! Tangentially, not sure if you've seen this (likely vibecoded) ltx.cpp implementation https://github.com/audiohacking/ltx.cpp which is also ggml based, probably not too helpful but might be a useful reference, seeing as they also use the same comfyui ggufs. |
I hadn’t looked at this project before, but it seems that this is not a complete implementation. The documentation says that ltx2.3 uses T5-XXL, which I don’t quite understand. |
13 new upstream commits since previous sync at 0b82969. The big one is leejet#1500 (module backend assignment): ~1.5k LOC churn that splits backend code into a new ggml_extend_backend.{h,cpp} pair and replaces every runner's (backend_t backend, bool offload_params_to_cpu) constructor arg with (backend_t runtime, backend_t params). New CLI flags --backend te=cpu,vae=cuda0,... and --params-backend te=cpu,vae=cpu,... Other notable upstream changes folded in: 3633072 module backend assignment (leejet#1500) 38b14ad --max-vram -1 auto-detect (leejet#1498) 67dda3f LTX 2.3 architecture (leejet#1463) 06accf2 LTXAV latent2rgb projection 9d68341 Euler/DDIM unification (leejet#1474) cde20d5 stereo handling in sd_audio d7ecbe1 T5 EOS dedup in Anima bd17f53 / 0c1ca17 / 839f6a9 / 3b4d26f ROCm/docs/CI db08b84 GCC 16 build fix 686856e fake-VAE log demotion 0b82969 / 381e0df PR template + CONTRIBUTING.md Conflicts: - examples/common/common.cpp, include/stable-diffusion.h: kept our offload_config alongside upstream's new backend/params_backend strings. sd_ctx_params_t now carries both axes. - src/lora.hpp: dropped our enable_offload bool. The new params_backend argument expresses the same intent (CPU = offload). - src/hidream_o1.hpp: kept params_prefix member, switched constructor to upstream's (backend, params_backend) signature. - src/stable-diffusion.cpp: every runner-construction site took upstream's backend_for(MODULE) / params_backend_for(MODULE) lookups. Removed the dead cond_stage/diffusion/vae_offload_to_cpu local-bool derivation; replaced with calls to a new SDBackendManager::force_module_params_backend(MODULE, "cpu") helper that mutates params_assignment_ after init_backend() runs. The offload_config-driven escalations now land in the same data structure upstream's --params-backend writes to. Post-merge fixups surfaced by retesting HiDream O1 streaming: - src/llm.hpp: TextModel.forward_final_norm now casts to LLMRMSNorm, not RMSNorm. Upstream changed the "norm" block's concrete type; our pre-merge cast returned nullptr and crashed on first forward(). - src/hidream_o1.hpp: Stage 1 of compute_streaming_true scales inputs_embeds by sqrt(hidden_size) when params.llm.normalize_input, matching what forward_embeds does. No-op for HiDream O1 today but keeps the streaming path drift-free if a future arch flips it. Smoke-tested on 12 GB GPU: Z-Image-Turbo Q8 layer_streaming -> 4.32 s HiDream O1 bf16 dev layer_streaming -> 17.44 s (4 steps, 1024x1024)
Brings in the upstream files that LTX-2.3 (leejet/stable-diffusion.cpp leejet#1463) depends on, without merging unrelated commits from upstream master: - src/tokenizers/ directory restructure (replaces src/tokenize_util.* and src/vocab/), adds Gemma 3 tokenizer (gemma_tokenizer.*, gemma_merges.hpp, gemma_vocab.hpp) - src/llm.hpp updated to include the GEMMA3_12B architecture path used by LTX-2.3's text encoder - src/conditioner.hpp adds LTXAVEmbedder and LTXAVTextProjectionRunner - src/common_dit.hpp adds patchify3d / unpatchify3d - src/denoiser.hpp adds LTX2Scheduler and the LTX2_SCHEDULER enum value - src/diffusion_model.hpp adds LTXAVDiffusionExtra + video_positions / frame_rate fields to DiffusionExtraParams - src/ggml_extend.hpp adds force_prec_f32 to Conv3d for the VAE encoder block that overflows in F16 - src/ltx_vae.hpp (NEW) spatiotemporal Video-VAE encoder/decoder - src/ltxv.hpp replaces the 72-line LTX-Video v1 stub with full LTX-2.3 DiT - src/ltx_audio_vae.h (present for completeness; audio remains unwired and out of scope for this fork's video-only target) - src/vae.hpp adds VERSION_LTXAV scale factor branch - src/wan.hpp patchify/unpatchify helpers exposed as static for reuse - src/model.* adds LTX-2 architecture detection This is the staging layer. The next commit cherry-picks the LTX-2.3 wiring (stable-diffusion.cpp, CLI, headers) on top.
…t#1463) Cherry-picks upstream leejet/stable-diffusion.cpp commit 67dda3f ("feat: add ltx2.3 support") on top of the staged infrastructure. Conflicts resolved: * .gitmodules: kept only the ggml submodule; the LTX-2.3 video build does not need the upstream sdcpp-webui frontend, libwebp or libwebm submodules. * ggml: bumped to leejet/ggml 7f4ab364 (re-init, depth-1 clone) which carries the IM2COL_3D / PAD ops the Video-VAE requires. * stable-diffusion.cpp: taken from upstream wholesale. The fork had no fork-specific changes here; all 27 hunks were upstream's own refactor between the fork's master and the LTX-2.3 commit. The file now contains the LTXAVModel and LTXVideoVAE construction paths, the LTXAVEmbedder conditioner branch, process_ltxav_video_timesteps(), the LTX I2V latent preparation, and the new bool generate_video() signature with sd_image_t** and sd_audio_t** out-parameters. * include/stable-diffusion.h: taken wholesale (new generate_video signature, embeddings_connectors_path, audio_vae_path, LTX2_SCHEDULER, sd_audio_t). * examples/cli/main.cpp, CMakeLists.txt, examples/cli/CMakeLists.txt, examples/cli/README.md, examples/server/README.md: upstream wholesale (LTX-2 CLI flags, WebP/WebM detection, tokenizers CMake). * examples/common/: brought in the upstream split (common.cpp, common.h, media_io.cpp, media_io.h, log.cpp, log.h, resource_owners.hpp) plus examples/server/async_jobs.cpp and src/tensor_ggml.hpp. * docs/ltx2.md: merged my long-form guide with the upstream real CLI examples and weight URLs so a reviewer can build-to-first-video from the README. * src/ltx2_api.cpp: updated to the new generate_video() signature, dropped the HAVE_LTX2_SYNC guards now that LTX2_SCHEDULER is in scheduler_t. Calls the real LTX2_SCHEDULER + Euler sampler + flow_shift 2.37. Audio output is freed and discarded (video-only build). clang -std=c++17 -fsyntax-only on src/ltx2_api.cpp passes.
The cherry-pick of leejet#1463 missed three files in examples/common/ (log.cpp, log.h, resource_owners.hpp) and two in examples/cli/ (image_metadata.cpp, image_metadata.h). The new examples/cli/CMakeLists.txt explicitly references these sources, so cmake configuration failed without them. Also removes examples/common/common.hpp which was replaced upstream by the common.cpp + common.h split. CMake now configures cleanly with the Metal backend on Apple Silicon (ggml 7f4ab36, BLAS via Accelerate). Build verified with: cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \ -DSD_METAL=ON -DSD_WEBP=OFF -DSD_WEBM=OFF \ -DBUILD_SHARED_LIBS=OFF
…on M3 Metal The cherry-pick of leejet#1463 left the fork in a mixed-version src/ state: some files at 67dda3f, others at the older fork master (c8fb3d2). That broke the include graph (clip.hpp included src/tokenize_util.h, ggml_extend.hpp included ggml_extend_backend.h, model.cpp expected model_io/, denoiser.hpp referenced ER_SDE / EULER_CFG_PP sample-method enums missing from the public header, etc.). This commit fully aligns src/, examples/, and include/ to 67dda3f: * src/: bulk-checkout every src/*.cpp/hpp/h from 67dda3f, then keep src/ltx2_api.cpp (fork-specific). Adds model_io/, ggml_graph_cut.{cpp,h}, ggml_extend_backend.{cpp,h}, sample-cache.{cpp,h}, condition_cache_utils.hpp, auto_encoder_kl.hpp, convert.cpp, ernie_image.hpp, hidream_o1.hpp, rng.hpp (relocated), spectrum.hpp, tensor.hpp, upscaler.h. Removes stale src/gguf_reader.hpp (replaced by model_io/gguf_io.*). * include/stable-diffusion.h: bumped to 67dda3f, adds the new sample methods (ER_SDE_SAMPLE_METHOD, EULER_CFG_PP_SAMPLE_METHOD, EULER_A_CFG_PP_SAMPLE_METHOD) the LTXAV denoiser depends on. * src/ltx2_api.cpp: untouched. The post-sync edits committed earlier already match the new generate_video() signature. Verified locally on Apple M3 (Metal 4 + CPU NEON + BLAS Accelerate): cmake -G Ninja -DSD_METAL=ON -DSD_WEBP=OFF -DSD_WEBM=OFF \ -DBUILD_SHARED_LIBS=OFF .. ninja sd-cli Produces a 35 MB sd-cli binary that recognises every LTX-2 flag: --mode vid_gen, --diffusion-model, --vae, --llm, --embeddings-connectors, --audio-vae, --init-img, --end-img, --video-frames, --fps, --temporal-tiling, --sampling-method euler --scheduler ltx2 Net diff vs fork master: 95 files, well under the 174 files of 64johnlee's unmergeable mega-merge. "Minor modifications before merge" acceptance criterion stays intact.
LTX-2.3 dev T2V
t2v.webm
LTX-2.3 dev I2V
i2v.webm