Commit Graph

2132 Commits

Author SHA1 Message Date
404d7b9978 feat: Support Qwen3.5 text generation models (#12771) 2026-03-25 22:48:28 -04:00
5ebb0c2e0b FP8 bwd training (#13121) 2026-03-24 20:39:04 -04:00
e87858e974 feat: LTX2: Support reference audio (ID-LoRA) (#13111) 2026-03-23 18:22:24 -04:00
d49420b3c7 LongCat-Image edit (#13003) 2026-03-21 23:51:05 -04:00
25b6d1d629 wan: vae: Fix light/color change (#13101)
There was an issue where the resample split was too early and dropped one
of the rolling convolutions a frame early. This is most noticable as a
lighting/color change between pixel frames 5->6 (latent 2->3), or as a
lighting change between the first and last frame in an FLF wan flow.
2026-03-21 18:44:35 -04:00
11c15d8832 Fix fp16 intermediates giving different results. (#13100) 2026-03-21 17:53:25 -04:00
b5d32e6ad2 Fix sampling issue with fp16 intermediates. (#13099) 2026-03-21 17:47:42 -04:00
87cda1fc25 Move inline comfy.context_windows imports to top-level in model_base.py (#13083)
The recent PR that added resize_cond_for_context_window methods to
model classes used inline 'import comfy.context_windows' in each
method body. This moves that import to the top-level import section,
replacing 4 duplicate inline imports with a single top-level one.
2026-03-20 20:03:42 -04:00
589228e671 Add slice_cond and per-model context window cond resizing (#12645)
* Add slice_cond and per-model context window cond resizing

* Fix cond_value.size() call in context window cond resizing

* Expose additional advanced inputs for ContextWindowsManualNode

Necessary for WanAnimate context windows workflow, which needs cond_retain_index_list = 0 to work properly with its reference input.

---------
2026-03-19 20:42:42 -07:00
f49856af57 ltx: vae: Fix missing init variable (#13074)
Forgot to push this ammendment. Previous test results apply to this.
2026-03-19 22:34:58 -04:00
82b868a45a Fix VRAM leak in tiler fallback in video VAEs (#13073)
* sd: soft_empty_cache on tiler fallback

This doesnt cost a lot and creates the expected VRAM reduction in
resource monitors when you fallback to tiler.

* wan: vae: Don't recursion in local fns (move run_up)

Moved Decoder3d’s recursive run_up out of forward into a class
method to avoid nested closure self-reference cycles. This avoids
cyclic garbage that delays garbage of tensors which in turn delays
VRAM release before tiled fallback.

* ltx: vae: Don't recursion in local fns (move run_up)

Mov the recursive run_up out of forward into a class
method to avoid nested closure self-reference cycles. This avoids
cyclic garbage that delays garbage of tensors which in turn delays
VRAM release before tiled fallback.
2026-03-19 22:30:27 -04:00
8458ae2686 Revert "fix: run text encoders on MPS GPU instead of CPU for Apple Silicon (#…" (#13070)
This reverts commit b941913f1d.
2026-03-19 15:27:55 -04:00
fd0261d2bc Reduce tiled decode peak memory (#13050) 2026-03-19 13:29:34 -04:00
ab14541ef7 memory: Add more exclusion criteria to pinned read (#13067) 2026-03-19 10:03:20 -07:00
fabed694a2 ltx: vae: implement chunked encoder + CPU IO chunking (Big VRAM reductions) (#13062)
* ltx: vae: add cache state to downsample block

* ltx: vae: Add time stride awareness to causal_conv_3d

* ltx: vae: Automate truncation for encoder

Other VAEs just truncate without error. Do the same.

* sd/ltx: Make chunked_io a flag in its own right

Taking this bi-direcitonal, so make it a for-purpose named flag.

* ltx: vae: implement chunked encoder + CPU IO chunking

People are doing things with big frame counts in LTX including V2V
flows. Implement the time-chunked encoder to keep the VRAM down, with
the converse of the new CPU pre-allocation technique, where the chunks
are brought from the CPU JIT.

* ltx: vae-encode: round chunk sizes more strictly

Only powers of 2 and multiple of 8 are valid due to cache slicing.
2026-03-19 09:58:47 -07:00
f6b869d7d3 fp16 intermediates doen't work for some text enc models. (#13056) 2026-03-18 19:42:28 -04:00
56ff88f951 Fix regression. (#13053) 2026-03-18 18:35:25 -04:00
9fff091f35 Further Reduce LTX VAE decode peak RAM usage (#13052) 2026-03-18 18:32:26 -04:00
dcd659590f Make more intermediate values follow the intermediate dtype. (#13051) 2026-03-18 18:14:18 -04:00
b941913f1d fix: run text encoders on MPS GPU instead of CPU for Apple Silicon (#12809)
On Apple Silicon, `vram_state` is set to `VRAMState.SHARED` because
CPU and GPU share unified memory. However, `text_encoder_device()`
only checked for `HIGH_VRAM` and `NORMAL_VRAM`, causing all text
encoders to fall back to CPU on MPS devices.

Adding `VRAMState.SHARED` to the condition allows non-quantized text
encoders (e.g. bf16 Gemma 3 12B) to run on the MPS GPU, providing
significant speedup for text encoding and prompt generation.

Note: quantized models (fp4/fp8) that use float8_e4m3fn internally
will still fall back to CPU via the `supports_cast()` check in
`CLIP.__init__()`, since MPS does not support fp8 dtypes.
2026-03-17 21:21:32 -04:00
cad24ce262 cascade: remove dead weight init code (#13026)
This weight init process is fully shadowed be the weight load and
doesnt work in dynamic_vram were the weight allocation is deferred.
2026-03-17 20:59:10 -04:00
68d542cc06 Fix case where pixel space VAE could cause issues. (#13030) 2026-03-17 20:46:22 -04:00
735a0465e5 Inplace VAE output processing to reduce peak RAM consumption. (#13028) 2026-03-17 20:20:49 -04:00
035414ede4 Reduce WAN VAE VRAM, Save use cases for OOM/Tiler (#13014)
* wan: vae: encoder: Add feature cache layer that corks singles

If a downsample only gives you a single frame, save it to the feature
cache and return nothing to the top level. This increases the
efficiency of cacheability, but also prepares support for going two
by two rather than four by four on the frames.

* wan: remove all concatentation with the feature cache

The loopers are now responsible for ensuring that non-final frames are
processes at least two-by-two, elimiating the need for this cat case.

* wan: vae: recurse and chunk for 2+2 frames on decode

Avoid having to clone off slices of 4 frame chunks and reduce the size
of the big 6 frame convolutions down to 4. Save the VRAMs.

* wan: encode frames 2x2.

Reduce VRAM usage greatly by encoding frames 2 at a time rather than
4.

* wan: vae: remove cloning

The loopers now control the chunking such there is noever more than 2
frames, so just cache these slices directly and avoid the clone
allocations completely.

* wan: vae: free consumer caller tensors on recursion

* wan: vae: restyle a little to match LTX
2026-03-17 17:34:39 -04:00
1a157e1f97 Reduce LTX VAE VRAM usage and save use cases from OOMs/Tiler (#13013)
* ltx: vae: scale the chunk size with the users VRAM

Scale this linearly down for users with low VRAM.

* ltx: vae: free non-chunking recursive intermediates

* ltx: vae: cleanup some intermediates

The conv layer can be the VRAM peak and it does a torch.cat. So cleanup
the pieces of the cat. Also clear our the cache ASAP as each layer detect
its end as this VAE surges in VRAM at the end due to the ended padding
increasing the size of the final frame convolutions off-the-books to
the chunker. So if all the earlier layers free up their cache it can
offset that surge.

Its a fragmentation nightmare, and the chance of it having to recache the
pyt allocator is very high, but you wont OOM.
2026-03-17 17:32:43 -04:00
8cc746a864 fix: disable SageAttention for Hunyuan3D v2.1 DiT (#12772) 2026-03-16 22:27:27 -04:00
ca17fc8355 Fix potential issue. (#13009) 2026-03-16 21:38:40 -04:00
20561aa919 [Trainer] FP4, 8, 16 training by native dtype support and quant linear autograd function (#12681) 2026-03-16 21:31:50 -04:00
7a16e8aa4e Add --enable-dynamic-vram options to force enable it. (#13002) 2026-03-16 16:50:13 -04:00
b202f842af Skip running model finalizers at exit (#12994) 2026-03-16 16:00:42 -04:00
3814bf4454 Enable Pytorch Attention for gfx1150 (#12973) 2026-03-15 12:45:30 -07:00
e84a200a3c ops: opt out of deferred weight init if subclassed (#12967)
If a subclass BYO _load_from_state_dict and doesnt call the super() the
needed default init of these weights is missed and can lead to problems
for uninitialized weights.
2026-03-15 11:49:49 -07:00
0904cc3fe5 LTXV: Accumulate VAE decode results on intermediate_device (#12955) 2026-03-14 18:09:09 -07:00
c711b8f437 Add --fp16-intermediates to use fp16 for intermediate values between nodes (#12953)
This is an experimental WIP option that might not work in your workflow but
should lower memory usage if it does.

Currently only the VAE and the load image node will output in fp16 when
this option is turned on.
2026-03-14 19:18:19 -04:00
1c5db7397d feat: Support mxfp8 (#12907) 2026-03-14 18:36:29 -04:00
7810f49702 comfy aimdo 0.2.11 + Improved RAM Pressure release strategies - Windows speedups (#12925)
* Implement seek and read for pins

Source pins from an mmap is pad because its its a CPU->CPU copy that
attempts to fully buffer the same data twice. Instead, use seek and
read which avoids the mmap buffering while usually being a faster
read in the first place (avoiding mmap faulting etc).

* pinned_memory: Use Aimdo pinner

The aimdo pinner bypasses pytorches CPU allocator which can leak
windows commit charge.

* ops: bypass init() of weight for embedding layer

This similarly consumes large commit charge especially for TEs. It can
cause a permanement leaked commit charge which can destabilize on
systems close to the commit ceiling and generally confuses the RAM
stats.

* model_patcher: implement pinned memory counter

Implement a pinned memory counter for better accounting of what volume
of memory pins have.

* implement touch accounting

Implement accounting of touching mmapped tensors.

* mm+mp: add residency mmap getter

* utils: use the aimdo mmap to load sft files

* model_management: Implement tigher RAM pressure semantics

Implement a pressure release on entire MMAPs as windows does perform
faster when mmaps are unloaded and model loads free ramp into fully
unallocated RAM.

Make the concept of freeing for pins a completely separate concept.
Now that pins are loadable directly from original file and don' touch
the mmap, tighten the freeing budget to just the current loaded model
- what you have left over. This still over-frees pins, but its a lot
better than before.

So after the pins are freed with that algorithm, bounce entire MMAPs
to free RAM based on what the model needs, deducting off any known
resident-in-mmap tensors to the free quota to keep it as tight as
possible.

* comfy-aimdo 0.2.11

Comfy aimdo 0.2.11

* mm: Implement file_slice path for QT

* ruff

* ops: put meta-tensors in place to allow custom nodes to check geo
2026-03-13 22:18:08 -04:00
3fa8c5686d fix: use frontend-compatible format for Float gradient_stops (#12789)
Co-authored-by: guill <jacob.e.segal@gmail.com>
Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com>
2026-03-12 10:14:28 -07:00
44f1246c89 Support flux 2 klein kv cache model: Use the FluxKVCache node. (#12905) 2026-03-12 11:30:50 -04:00
f6274c06b4 Fix issue with batch_size > 1 on some models. (#12892) 2026-03-11 16:37:31 -04:00
4f4f8659c2 fix: guard torch.AcceleratorError for compatibility with torch < 2.8.0 (#12874)
* fix: guard torch.AcceleratorError for compatibility with torch < 2.8.0

torch.AcceleratorError was introduced in PyTorch 2.8.0. Accessing it
directly raises AttributeError on older versions. Use a try/except
fallback at module load time, consistent with the existing pattern used
for OOM_EXCEPTION.


* fix: address review feedback for AcceleratorError compat

- Fall back to RuntimeError instead of type(None) for ACCELERATOR_ERROR,
  consistent with OOM_EXCEPTION fallback pattern and valid for except clauses
- Add "out of memory" message introspection for RuntimeError fallback case
- Use RuntimeError directly in discard_cuda_async_error except clause
---------
2026-03-11 10:04:13 -07:00
9642e4407b Add pre attention and post input patches to qwen image model. (#12879) 2026-03-11 00:09:35 -04:00
3ad36d6be6 Allow model patches to have a cleanup function. (#12878)
The function gets called after sampling is finished.
2026-03-10 20:09:12 -04:00
535c16ce6e Widen OOM_EXCEPTION to AcceleratorError form (#12835)
Pytorch only filters for OOMs in its own allocators however there are
paths that can OOM on allocators made outside the pytorch allocators.
These manifest as an AllocatorError as pytorch does not have universal
error translation to its OOM type on exception. Handle it. A log I have
for this also shows a double report of the error async, so call the
async discarder to cleanup and make these OOMs look like OOMs.
2026-03-10 00:41:02 -04:00
a912809c25 model_detection: deep clone pre edited edited weights (#12862)
Deep clone these weights as needed to avoid segfaulting when it tries
to touch the original mmap.
2026-03-09 23:50:10 -04:00
c4fb0271cd Add a way for nodes to add pre attn patches to flux model. (#12861) 2026-03-09 23:37:58 -04:00
06f85e2c79 Fix text encoder lora loading for wrapped models (#12852) 2026-03-09 16:08:51 -04:00
29b24cb517 refactor(assets): modular architecture + async two-phase scanner & background seeder (#12621) 2026-03-07 20:37:25 -05:00
bcf1a1fab1 mm: reset_cast_buffers: sync compute stream before free (#12822)
Sync the compute stream before freeing the cast buffers. This can cause
use after free issues when the cast stream frees the buffer while the
compute stream is behind enough to still needs a casted weight.
2026-03-07 09:38:08 -08:00
d69d30819b Don't run TE on cpu when dynamic vram enabled. (#12815) 2026-03-06 19:11:16 -05:00
f466b06601 Fix fp16 audio encoder models (#12811)
* mp: respect model_defined_dtypes in default caster

This is needed for parametrizations when the dtype changes between sd
and model.

* audio_encoders: archive model dtypes

Archive model dtypes to stop the state dict load override the dtypes
defined by the core for compute etc.
2026-03-06 18:20:07 -05:00