Commit Graph

2153 Commits

Author SHA1 Message Date
5b6726d6b7 fix: use Parameter assignment for Stable_Zero123 cc_projection weights (fixes #13492) (#13518)
On Windows with aimdo enabled, disable_weight_init.Linear uses lazy
initialization that sets weight and bias to None to avoid unnecessary
memory allocation. This caused a crash when copy_() was called on the
None weight attribute in Stable_Zero123.__init__.

Replace copy_() with direct torch.nn.Parameter assignment, which works
correctly on both Windows (aimdo enabled) and other platforms.
2026-04-23 17:18:59 -04:00
b41ab53b6f Use ErnieTEModel_ not ErnieTEModel. (#13431) 2026-04-16 10:11:58 -04:00
1de83f91c3 Fix OOM regression in _apply() for quantized models during inference (#13372)
Skip unnecessary clone of inference-mode tensors when already inside
torch.inference_mode(), matching the existing guard in set_attr_param.
The unconditional clone introduced in 20561aa9 caused transient VRAM
doubling during model movement for FP8/quantized models.
2026-04-15 02:10:36 -07:00
cb0bbde402 Fix ernie on devices that don't support fp64. (#13414) 2026-04-14 22:54:47 -04:00
722bc73319 Make text generation work with ministral model. (#13395)
Needs template before it works properly.
2026-04-13 20:43:57 -04:00
402ff1cdb7 Fix issue with ernie image. (#13393) 2026-04-13 16:38:42 -04:00
c2657d5fb9 Fix typo. (#13382) 2026-04-12 23:37:13 -04:00
31283d2892 Implement Ernie Image model. (#13369) 2026-04-11 22:29:31 -04:00
55ebd287ee Add a supports_fp64 function. (#13368) 2026-04-11 21:06:36 -04:00
a134423890 SDPose: resize input always (#13349) 2026-04-10 11:26:55 -10:00
b615af1c65 Add support for small flux.2 decoder (#13314) 2026-04-07 03:44:18 -04:00
40862c0776 Support Ace Step 1.5 XL model. (#13317) 2026-04-07 03:13:47 -04:00
0c63b4f6e3 Remove dead code. (#13251) 2026-04-01 20:22:06 -04:00
e2ddf28d78 Fix some fp8 scaled checkpoints no longer working. (#13239) 2026-03-31 14:27:17 -07:00
8d723d2caa Fix/tweak pinned memory accounting (#13221)
* mm: Lower windows pin threshold

Some workflows have more extranous use of shared GPU memory than is
accounted for in the 5% pin headroom. Lower this for safety.

* mm: Remove pin count clearing threshold.

TOTAL_PINNED_MEMORY is shared between the legacy and aimdo pinning
systems, however this catch-all assumes only the legacy system exists.
Remove the catch-all as the PINNED_MEMORY buffer is coherent already.
2026-03-29 16:43:24 -07:00
a500f1edac CORE-13 feat: Support RT-DETRv4 detection model (#12748) 2026-03-28 23:34:10 -04:00
3f77450ef1 Fix #13214 (#13216) 2026-03-28 22:35:59 -04:00
b353a7c863 Integrate RAM cache with model RAM management (#13173) 2026-03-27 21:34:16 -04:00
3a56201da5 Allow flux conditioning without a pooled output. (#13198) 2026-03-27 20:36:26 -04:00
b0fd65e884 fix: regression in text generate with LTXAV model (#13170) 2026-03-26 09:55:05 -07:00
2a1f402601 Make Qwen 8B work with TextGenerate node. (#13160) 2026-03-25 23:21:44 -04:00
404d7b9978 feat: Support Qwen3.5 text generation models (#12771) 2026-03-25 22:48:28 -04:00
5ebb0c2e0b FP8 bwd training (#13121) 2026-03-24 20:39:04 -04:00
e87858e974 feat: LTX2: Support reference audio (ID-LoRA) (#13111) 2026-03-23 18:22:24 -04:00
d49420b3c7 LongCat-Image edit (#13003) 2026-03-21 23:51:05 -04:00
25b6d1d629 wan: vae: Fix light/color change (#13101)
There was an issue where the resample split was too early and dropped one
of the rolling convolutions a frame early. This is most noticable as a
lighting/color change between pixel frames 5->6 (latent 2->3), or as a
lighting change between the first and last frame in an FLF wan flow.
2026-03-21 18:44:35 -04:00
11c15d8832 Fix fp16 intermediates giving different results. (#13100) 2026-03-21 17:53:25 -04:00
b5d32e6ad2 Fix sampling issue with fp16 intermediates. (#13099) 2026-03-21 17:47:42 -04:00
87cda1fc25 Move inline comfy.context_windows imports to top-level in model_base.py (#13083)
The recent PR that added resize_cond_for_context_window methods to
model classes used inline 'import comfy.context_windows' in each
method body. This moves that import to the top-level import section,
replacing 4 duplicate inline imports with a single top-level one.
2026-03-20 20:03:42 -04:00
589228e671 Add slice_cond and per-model context window cond resizing (#12645)
* Add slice_cond and per-model context window cond resizing

* Fix cond_value.size() call in context window cond resizing

* Expose additional advanced inputs for ContextWindowsManualNode

Necessary for WanAnimate context windows workflow, which needs cond_retain_index_list = 0 to work properly with its reference input.

---------
2026-03-19 20:42:42 -07:00
f49856af57 ltx: vae: Fix missing init variable (#13074)
Forgot to push this ammendment. Previous test results apply to this.
2026-03-19 22:34:58 -04:00
82b868a45a Fix VRAM leak in tiler fallback in video VAEs (#13073)
* sd: soft_empty_cache on tiler fallback

This doesnt cost a lot and creates the expected VRAM reduction in
resource monitors when you fallback to tiler.

* wan: vae: Don't recursion in local fns (move run_up)

Moved Decoder3d’s recursive run_up out of forward into a class
method to avoid nested closure self-reference cycles. This avoids
cyclic garbage that delays garbage of tensors which in turn delays
VRAM release before tiled fallback.

* ltx: vae: Don't recursion in local fns (move run_up)

Mov the recursive run_up out of forward into a class
method to avoid nested closure self-reference cycles. This avoids
cyclic garbage that delays garbage of tensors which in turn delays
VRAM release before tiled fallback.
2026-03-19 22:30:27 -04:00
8458ae2686 Revert "fix: run text encoders on MPS GPU instead of CPU for Apple Silicon (#…" (#13070)
This reverts commit b941913f1d.
2026-03-19 15:27:55 -04:00
fd0261d2bc Reduce tiled decode peak memory (#13050) 2026-03-19 13:29:34 -04:00
ab14541ef7 memory: Add more exclusion criteria to pinned read (#13067) 2026-03-19 10:03:20 -07:00
fabed694a2 ltx: vae: implement chunked encoder + CPU IO chunking (Big VRAM reductions) (#13062)
* ltx: vae: add cache state to downsample block

* ltx: vae: Add time stride awareness to causal_conv_3d

* ltx: vae: Automate truncation for encoder

Other VAEs just truncate without error. Do the same.

* sd/ltx: Make chunked_io a flag in its own right

Taking this bi-direcitonal, so make it a for-purpose named flag.

* ltx: vae: implement chunked encoder + CPU IO chunking

People are doing things with big frame counts in LTX including V2V
flows. Implement the time-chunked encoder to keep the VRAM down, with
the converse of the new CPU pre-allocation technique, where the chunks
are brought from the CPU JIT.

* ltx: vae-encode: round chunk sizes more strictly

Only powers of 2 and multiple of 8 are valid due to cache slicing.
2026-03-19 09:58:47 -07:00
f6b869d7d3 fp16 intermediates doen't work for some text enc models. (#13056) 2026-03-18 19:42:28 -04:00
56ff88f951 Fix regression. (#13053) 2026-03-18 18:35:25 -04:00
9fff091f35 Further Reduce LTX VAE decode peak RAM usage (#13052) 2026-03-18 18:32:26 -04:00
dcd659590f Make more intermediate values follow the intermediate dtype. (#13051) 2026-03-18 18:14:18 -04:00
b941913f1d fix: run text encoders on MPS GPU instead of CPU for Apple Silicon (#12809)
On Apple Silicon, `vram_state` is set to `VRAMState.SHARED` because
CPU and GPU share unified memory. However, `text_encoder_device()`
only checked for `HIGH_VRAM` and `NORMAL_VRAM`, causing all text
encoders to fall back to CPU on MPS devices.

Adding `VRAMState.SHARED` to the condition allows non-quantized text
encoders (e.g. bf16 Gemma 3 12B) to run on the MPS GPU, providing
significant speedup for text encoding and prompt generation.

Note: quantized models (fp4/fp8) that use float8_e4m3fn internally
will still fall back to CPU via the `supports_cast()` check in
`CLIP.__init__()`, since MPS does not support fp8 dtypes.
2026-03-17 21:21:32 -04:00
cad24ce262 cascade: remove dead weight init code (#13026)
This weight init process is fully shadowed be the weight load and
doesnt work in dynamic_vram were the weight allocation is deferred.
2026-03-17 20:59:10 -04:00
68d542cc06 Fix case where pixel space VAE could cause issues. (#13030) 2026-03-17 20:46:22 -04:00
735a0465e5 Inplace VAE output processing to reduce peak RAM consumption. (#13028) 2026-03-17 20:20:49 -04:00
035414ede4 Reduce WAN VAE VRAM, Save use cases for OOM/Tiler (#13014)
* wan: vae: encoder: Add feature cache layer that corks singles

If a downsample only gives you a single frame, save it to the feature
cache and return nothing to the top level. This increases the
efficiency of cacheability, but also prepares support for going two
by two rather than four by four on the frames.

* wan: remove all concatentation with the feature cache

The loopers are now responsible for ensuring that non-final frames are
processes at least two-by-two, elimiating the need for this cat case.

* wan: vae: recurse and chunk for 2+2 frames on decode

Avoid having to clone off slices of 4 frame chunks and reduce the size
of the big 6 frame convolutions down to 4. Save the VRAMs.

* wan: encode frames 2x2.

Reduce VRAM usage greatly by encoding frames 2 at a time rather than
4.

* wan: vae: remove cloning

The loopers now control the chunking such there is noever more than 2
frames, so just cache these slices directly and avoid the clone
allocations completely.

* wan: vae: free consumer caller tensors on recursion

* wan: vae: restyle a little to match LTX
2026-03-17 17:34:39 -04:00
1a157e1f97 Reduce LTX VAE VRAM usage and save use cases from OOMs/Tiler (#13013)
* ltx: vae: scale the chunk size with the users VRAM

Scale this linearly down for users with low VRAM.

* ltx: vae: free non-chunking recursive intermediates

* ltx: vae: cleanup some intermediates

The conv layer can be the VRAM peak and it does a torch.cat. So cleanup
the pieces of the cat. Also clear our the cache ASAP as each layer detect
its end as this VAE surges in VRAM at the end due to the ended padding
increasing the size of the final frame convolutions off-the-books to
the chunker. So if all the earlier layers free up their cache it can
offset that surge.

Its a fragmentation nightmare, and the chance of it having to recache the
pyt allocator is very high, but you wont OOM.
2026-03-17 17:32:43 -04:00
8cc746a864 fix: disable SageAttention for Hunyuan3D v2.1 DiT (#12772) 2026-03-16 22:27:27 -04:00
ca17fc8355 Fix potential issue. (#13009) 2026-03-16 21:38:40 -04:00
20561aa919 [Trainer] FP4, 8, 16 training by native dtype support and quant linear autograd function (#12681) 2026-03-16 21:31:50 -04:00
7a16e8aa4e Add --enable-dynamic-vram options to force enable it. (#13002) 2026-03-16 16:50:13 -04:00