Commit Graph

389 Commits

Author SHA1 Message Date
be95871adc feat: Gemma4 text generation support (CORE-30) (#13376)
* initial gemma4 support

* parity with reference implementation

outputs can 100% match transformers with same sdpa flags, checkpoint this and then optimize

* Cleanup, video fixes

* cleanup, enable fused rms norm by default

* update comment

* Cleanup

* Update sd.py

* Various fixes

* Add fp8 scaled embedding support

* small fixes

* Translate think tokens

* Fix image encoder attention mask type

So it works with basic attention

* Handle thinking tokens different only for Gemma4

* Code cleanup

* Update nodes_textgen.py

* Use embed scale class instead of buffer

Slight difference to HF, but technically more accurate and simpler code

* Default to fused rms_norm

* Update gemma4.py
2026-05-02 22:46:15 -04:00
a164c82913 Add high quality preview support for Flux2 latents (#13496) 2026-04-29 19:37:30 -04:00
5eeae3f1d8 Cogvideox (#13402)
---------

Co-authored-by: kijai <40791699+kijai@users.noreply.github.com>
Co-authored-by: Talmaj Marinc <talmaj@comfy.org>
2026-04-29 19:30:08 -04:00
eb22225387 Support standalone LTXV audio VAEs (#13499) 2026-04-21 10:46:37 -07:00
ad94d47221 Make the ltx audio vae more native. (#13486) 2026-04-21 11:02:42 -04:00
31283d2892 Implement Ernie Image model. (#13369) 2026-04-11 22:29:31 -04:00
b615af1c65 Add support for small flux.2 decoder (#13314) 2026-04-07 03:44:18 -04:00
e2ddf28d78 Fix some fp8 scaled checkpoints no longer working. (#13239) 2026-03-31 14:27:17 -07:00
3f77450ef1 Fix #13214 (#13216) 2026-03-28 22:35:59 -04:00
b353a7c863 Integrate RAM cache with model RAM management (#13173) 2026-03-27 21:34:16 -04:00
404d7b9978 feat: Support Qwen3.5 text generation models (#12771) 2026-03-25 22:48:28 -04:00
82b868a45a Fix VRAM leak in tiler fallback in video VAEs (#13073)
* sd: soft_empty_cache on tiler fallback

This doesnt cost a lot and creates the expected VRAM reduction in
resource monitors when you fallback to tiler.

* wan: vae: Don't recursion in local fns (move run_up)

Moved Decoder3d’s recursive run_up out of forward into a class
method to avoid nested closure self-reference cycles. This avoids
cyclic garbage that delays garbage of tensors which in turn delays
VRAM release before tiled fallback.

* ltx: vae: Don't recursion in local fns (move run_up)

Mov the recursive run_up out of forward into a class
method to avoid nested closure self-reference cycles. This avoids
cyclic garbage that delays garbage of tensors which in turn delays
VRAM release before tiled fallback.
2026-03-19 22:30:27 -04:00
fabed694a2 ltx: vae: implement chunked encoder + CPU IO chunking (Big VRAM reductions) (#13062)
* ltx: vae: add cache state to downsample block

* ltx: vae: Add time stride awareness to causal_conv_3d

* ltx: vae: Automate truncation for encoder

Other VAEs just truncate without error. Do the same.

* sd/ltx: Make chunked_io a flag in its own right

Taking this bi-direcitonal, so make it a for-purpose named flag.

* ltx: vae: implement chunked encoder + CPU IO chunking

People are doing things with big frame counts in LTX including V2V
flows. Implement the time-chunked encoder to keep the VRAM down, with
the converse of the new CPU pre-allocation technique, where the chunks
are brought from the CPU JIT.

* ltx: vae-encode: round chunk sizes more strictly

Only powers of 2 and multiple of 8 are valid due to cache slicing.
2026-03-19 09:58:47 -07:00
9fff091f35 Further Reduce LTX VAE decode peak RAM usage (#13052) 2026-03-18 18:32:26 -04:00
68d542cc06 Fix case where pixel space VAE could cause issues. (#13030) 2026-03-17 20:46:22 -04:00
735a0465e5 Inplace VAE output processing to reduce peak RAM consumption. (#13028) 2026-03-17 20:20:49 -04:00
c711b8f437 Add --fp16-intermediates to use fp16 for intermediate values between nodes (#12953)
This is an experimental WIP option that might not work in your workflow but
should lower memory usage if it does.

Currently only the VAE and the load image node will output in fp16 when
this option is turned on.
2026-03-14 19:18:19 -04:00
535c16ce6e Widen OOM_EXCEPTION to AcceleratorError form (#12835)
Pytorch only filters for OOMs in its own allocators however there are
paths that can OOM on allocators made outside the pytorch allocators.
These manifest as an AllocatorError as pytorch does not have universal
error translation to its OOM type on exception. Handle it. A log I have
for this also shows a double report of the error async, so call the
async discarder to cleanup and make these OOMs look like OOMs.
2026-03-10 00:41:02 -04:00
43c64b6308 Support the LTXAV 2.3 model. (#12773) 2026-03-04 20:06:20 -05:00
0a7446ade4 Pass tokens when loading text gen model for text generation (#12755)
Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com>
2026-03-04 08:59:56 -08:00
5f41584e96 Disable dynamic_vram when weight hooks applied (#12653)
* sd: add support for clip model reconstruction

* nodes: SetClipHooks: Demote the dynamic model patcher

* mp: Make dynamic_disable more robust

The backup need to not be cloned. In addition add a delegate object
to ModelPatcherDynamic so that non-cloning code can do
ModelPatcherDynamic demotion

* sampler_helpers: Demote to non-dynamic model patcher when hooking

* code rabbit review comments
2026-02-28 16:50:18 -05:00
ac4412d0fa Native LongCat-Image implementation (#12597) 2026-02-27 23:04:34 -05:00
907e5dcbbf initial FlowRVS support (#12637) 2026-02-25 23:38:46 -05:00
3ebe1ac22e Disable dynamic_vram when using torch compiler (#12612)
* mp: attach re-construction arguments to model patcher

When making a model-patcher from a unet or ckpt, attach a callable
function that can be called to replay the model construction. This
can be used to deep clone model patcher WRT the actual model.

Originally written by Kosinkadink
f4b99bc623

* mp: Add disable_dynamic clone argument

Add a clone argument that lets a caller clone a ModelPatcher but disable
dynamic to demote the clone to regular MP. This is useful for legacy
features where dynamic_vram support is missing or TBD.

* torch_compile: disable dynamic_vram

This is a bigger feature. Disable for the interim to preserve
functionality.
2026-02-24 19:13:46 -05:00
0301ccf745 Small cleanup and try to get qwen 3 work with the text gen. (#12537) 2026-02-19 22:42:28 -05:00
6d11cc7354 feat: Add basic text generation support with native models, initially supporting Gemma3 (#12392) 2026-02-18 20:49:43 -05:00
f719f9c062 sd: delay VAE dtype archive until after override (#12388)
VAEs have host specific dtype logic that should override the dynamic
_model_dtype. Defer the archiving of model dtypes until after.
2026-02-10 13:37:46 -05:00
35183543e0 Add VAE tiled decode node for audio. (#12299) 2026-02-05 01:12:04 -05:00
fe2511468d Support the 4B ace step 1.5 lm model. (#12257)
Can be used as an alternative to the 1.7B
2026-02-03 19:01:38 -05:00
b8315e66cb Fix tiled vae for ace step 1.5 (#12253) 2026-02-03 14:40:45 -05:00
3c1a1a2df8 Basic support for the ace step 1.5 model. (#12237) 2026-02-03 00:06:18 -05:00
f8acd9c402 Reduce RAM usage, fix VRAM OOMs, and fix Windows shared memory spilling with adaptive model loading (#11845) 2026-02-01 01:01:11 -05:00
a97c98068f [Weight-adapter/Trainer] Bypass forward mode in Weight adapter system (#11958)
* Add API of bypass forward module

* bypass implementation

* add bypass fwd into nodes list/trainer
2026-01-24 22:56:22 -05:00
3365ad18a5 Support LTX2 tiny vae (taeltx_2) (#11929) 2026-01-21 23:03:51 -05:00
abe2ec26a6 Support the Anima model. (#12012) 2026-01-21 19:44:28 -05:00
3b832231bb Flux2 Klein support. (#11890) 2026-01-15 10:33:15 -05:00
cd912963f1 Fix issue with t5 text encoder in fp4. (#11794) 2026-01-10 17:31:31 -05:00
50d6e1caf4 Tweak ltxv vae mem estimation. (#11722) 2026-01-07 23:07:05 -05:00
25bc1b5b57 Add memory estimation function to ltxav text encoder. (#11716) 2026-01-07 20:11:22 -05:00
f2b002372b Support the LTXV 2 model. (#11632) 2026-01-05 01:58:59 -05:00
4c432c11ed Implement Jina CLIP v2 and NewBie dual CLIP (#11415)
* Implement Jina CLIP v2

* Support quantized Gemma in NewBie dual CLIP
2025-12-20 00:57:22 -05:00
6a2678ac65 Trim/pad channels in VAE code. (#11406) 2025-12-18 18:22:38 -05:00
e4fb3a3572 Support loading Wan/Qwen VAEs with different in/out channels. (#11405) 2025-12-18 17:45:33 -05:00
e711aaf1a7 Lower VAE loading requirements:Create a new branch for GPU memory calculations in qwen-image vae (#11199) 2025-12-10 22:02:26 -05:00
e136b6dbb0 dequantization offload accounting (fixes Flux2 OOMs - incl TEs) (#11171)
* make setattr safe for non existent attributes

Handle the case where the attribute doesnt exist by returning a static
sentinel (distinct from None). If the sentinel is passed in as the set
value, del the attr.

* Account for dequantization and type-casts in offload costs

When measuring the cost of offload, identify weights that need a type
change or dequantization and add the size of the conversion result
to the offload cost.

This is mutually exclusive with lowvram patches which already has
a large conservative estimate and wont overlap the dequant cost so\
dont double count.

* Set the compute type on CLIP MPs

So that the loader can know the size of weights for dequant accounting.
2025-12-08 23:21:31 -05:00
fd109325db Kandinsky5 model support (#10988)
* Add Kandinsky5 model support

lite and pro T2V tested to work

* Update kandinsky5.py

* Fix fp8

* Fix fp8_scaled text encoder

* Add transformer_options for attention

* Code cleanup, optimizations, use fp32 for all layers originally at fp32

* ImageToVideo -node

* Fix I2V, add necessary latent post process nodes

* Support text to image model

* Support block replace patches (SLG mostly)

* Support official LoRAs

* Don't scale RoPE for lite model as that just doesn't work...

* Update supported_models.py

* Rever RoPE scaling to simpler one

* Fix typo

* Handle latent dim difference for image model in the VAE instead

* Add node to use different prompts for clip_l and qwen25_7b

* Reduce peak VRAM usage a bit

* Further reduce peak VRAM consumption by chunking ffn

* Update chunking

* Update memory_usage_factor

* Code cleanup, don't force the fp32 layers as it has minimal effect

* Allow for stronger changes with first frames normalization

Default values are too weak for any meaningful changes, these should probably be exposed as advanced node options when that's available.

* Add image model's own chat template, remove unused image2video template

* Remove hard error in ReplaceVideoLatentFrames -node

* Update kandinsky5.py

* Update supported_models.py

* Fix typos in prompt template

They were now fixed in the original repository as well

* Update ReplaceVideoLatentFrames

Add tooltips
Make source optional
Better handle negative index

* Rename NormalizeVideoLatentFrames -node

For bit better clarity what it does

* Fix NormalizeVideoLatentStart node out on non-op
2025-12-05 22:20:22 -05:00
6fd463aec9 Fix regression when text encoder loaded directly on GPU. (#11129) 2025-12-05 15:33:16 -05:00
43071e3de3 Make old scaled fp8 format use the new mixed quant ops system. (#11000) 2025-12-05 14:35:42 -05:00
9bc893c5bb sd: bump HY1.5 VAE estimate (#11107)
Im able to push vram above estimate on partial unload. Bump the
estimate. This is experimentally determined with a 720P and 480P
datapoint calibrating for 24GB VRAM total.
2025-12-04 09:50:36 -08:00
f4bdf5f830 sd: revise hy VAE VRAM (#11105)
This was recently collapsed down to rolling VAE through temporal. Clamp
The time dimension.
2025-12-04 09:50:04 -08:00