Commit Graph

2227 Commits

Author SHA1 Message Date
85a403d1ea Disable sage attention in stable audio dit and VAE. (#14148) 2026-05-27 20:35:03 -04:00
987a937658 Support context window for PiD and fix lq_latent rounding (#14136) 2026-05-27 12:08:06 -07:00
e75a92c1b6 Add memory usage factor for lens model. (#14124) 2026-05-26 18:06:51 -07:00
d8d860a588 Closer memory usage factors for PID (#14123) 2026-05-26 18:04:55 -07:00
28f4ef277c feat: Support NVIDIA PixelDiT and PiD (CORE-201) (#14103) 2026-05-26 17:50:14 -07:00
f9f54cae42 Lens: some cleanup (#14112)
* Lens: remove redundant memory optimization
2026-05-26 10:32:53 +03:00
41812fa0ac feat: Microsoft Lens support (CORE-248) (#14077) 2026-05-25 23:01:51 -07:00
57414dadfe fix: cross-attention AdaLN scale, shift, sigma parameters calculation (#14097) 2026-05-25 20:07:09 -07:00
da49b7d0b6 Remove useless annotations imports. (#14105) 2026-05-25 19:23:29 -07:00
0a2dd86e78 MultiGPU Work Units For Accelerated Sampling (CORE-184) (#7063) 2026-05-25 18:26:40 -07:00
b30e980a20 cache-ram: lower thresholds (#14089)
Use the RAM right up to the wire as the community is bit accustomed too.

This trades off headroom for the case where large chunky intermediates
arrive and potenitally hits pagefile/swap, but a lot of people have
"it just fits" workflows out there, so strike a compromise with
75->90%.

Disable the incative cache for all but the very high RAM users.
2026-05-24 15:26:50 -07:00
39f963b4b0 mark loads to pins as cold immediately (#14088)
This does the posix_fadvise to kick pins out of the disk cache (to
avoid a double copy in RAM).
2026-05-24 15:25:59 -07:00
08d809d128 Fix --use-flash-attention ignored when xformers installed. (#14083) 2026-05-23 17:44:28 -07:00
d80fcafee7 Remove dead code. (#14072) 2026-05-22 19:56:36 -07:00
03e511862e Fix reshaping lora application (#14031)
* ModelPatcherDyanmic: purge stale vbar allocs on force cast

* ModelPatcherDynamic: restore backups before load

If doing a clean reload, mutative changes (lora application) could be
applied on-top of the already loaded weight. Restore from backup
unconditionally so that the new load is clean.
2026-05-21 09:47:16 -07:00
aab41a9ddb fix(lanczos): correct dimension transposition for single-channel tensors (#12679) 2026-05-21 23:47:20 +08:00
5aa5ccc9e0 Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117) (#13802)
* model_management: disable non-dynamic smart memory

Disable smart memory outright for non dynamic models.

This is a minor step towards deprecation of --disable-dynamic-vram
and the legacy ModelPatcher.

This is needed for estimate-free model development, where new models
can opt-out of supplying a memory estimate and not have to worry
about hard VRAM allocations due to legacy non-dynamic model patchers

This is also a general stability increase for a lot of stray use cases
where estimates may still be off and going forward we are not going
to accurately maintain such estimates.

* pinned_memory: implement with aimdo growable buffer

Use a single growable buffer so we can do threaded pre-warming on
pinned memory.

* mm: use aimdo to do transfer from disk to pin

Aimdo implements a faster threaded loader.

* Add stream host pin buffer for AIMDO casts

Introduce per-offload-stream HostBuffer reuse for pinned staging,
include it in cast buffer reset synchronization.

Defer actual casts that go via this pin path to a separate pass
such that the buffer can be allocated monolithically (to avoid
cudaHostRegister thrash).

* remove old pin path

* Implement JIT pinned memory pressure

Replace the predictive pin pressure mechanism with JIT PIN memory
pressure.

* LowVRAMPatch: change to two-phase visit

* lora: re-implement as inplace swiss-army-knife operation

* prepare for multiple pin sets

* implement pinned loras

* requirements: comfy-aimdo 0.4.0

* ops: remove unused arg

This was defeatured in aimdo iteration

* ops: sync the CPU with only the offload stream activity

This was syncing with the offload stream which itself is synced with the
compute stream, so this was syncing CPU with compute transitively. Define
the event to sync it more gently.

* pins: implement freeing intermediate for pinned memory

Pinning is more important than inactive intermediates and the stream
pin buffer is more important than even active intermediates.

* execution: implement pin eviction on RAM presure

Add back proper pin freeing on RAM pressure

* implement pin registration swaps

Uncap the windows pins from 50% by extending the pool and have a pressure
mechanism to move the pin reservations om demand.

This unfortunately implies a GPU sync to do the freeing so significant
hysterisis needs to be added to consolidate these pressure events.

* cli_args/execution: Implement lower background cache-ram threshold

Limit the amount of RAM background intermediates can use, so that
switching workflows doesn't degrade performance too much.

* make default

* bump aimdo

* model-patcher: force-cast tiny weights

Flux 2 gets crazy stalls due to a mix of tiny and giant weights
creating lopsided steam buffer rotations which creates stalls.

* ops: refactor in prep for chunking

* mm: delegate pin-on-the-way to aimdo

Aimdo is able to chunk and slice this on the way for better CPU->GPU
overlap. The main advantage is the ability to shorten the bus contention
window between previous weight transfer and the next weights vbar
fault.

* bump aimdo

* pinning updates

* specify hostbuf max allocation size

There a signs of virtual memory exhaustion on some linux systems when
throwing 128GB for every little piece. Pass the actual to save aimdo
from over-estimates

* tests: update execution tests for caching

The default caching changed to ram-cache so update these tests
accordingly.

Remove the LRU 0 test as this also falls through to RAM cache.
2026-05-20 17:03:58 -07:00
f9c84c94b4 Support Stable Audio 3 model. (#14010) 2026-05-20 11:34:22 -04:00
78b5dec6b6 fix: Hunyuan3D 2.1 batch size crashes in attention and forward pass (#13699) 2026-05-20 19:58:49 +08:00
yy
626b082838 Fix typo in ops.py (#11925) 2026-05-20 05:45:04 +08:00
a4382e056e Use temporal downscale to make empty audio latent nodes more reusable. (#13975) 2026-05-19 00:14:30 -04:00
990a7ae7f2 Initial work to make downscale_ratio_temporal work. (#13972) 2026-05-18 23:01:43 -04:00
187e5237e1 Fix BiRefNet issue (#13966) 2026-05-19 05:03:22 +08:00
16f862f02a implement dynamic clip saving (#13959)
Fix clip saving by doing the same patching process and diffusion
models.
2026-05-18 11:46:40 -07:00
971c9e3518 HiDream-O1: support area conditioning (#13944) 2026-05-18 01:17:05 -04:00
b39af210d0 Fix Qwen3.5 text generation with multiple input images (#13943) 2026-05-18 01:16:42 -04:00
f48d2a017e Log which quant ops are enabled/emulated. (#13946) 2026-05-17 16:30:54 -04:00
d3607a8e6d feat: Add downscaled IC-LoRA support to LTXVAddGuide (CORE-102) (#13896) 2026-05-16 15:02:57 +08:00
5d5a4554e1 Remove useless option and clarify what lowvram does. (#13922) 2026-05-15 17:59:02 -07:00
33ce449c8b Reduce LTX2.3 peak VRAM when guide_mask is in use (CORE-166) (#13735)
- Reduce peak VRAM by handling self_attn_mask more efficiently
- Fallback to SDPA when self_attention_mask is used
2026-05-16 00:02:27 +03:00
77e2ed5e01 feat: Support MoGe (CORE-168) (#13878) 2026-05-15 10:34:56 +08:00
74c17a25e5 Fix void failing with RuntimeError: start (0) + length (464) exceeds dimension size (461). (#13873) 2026-05-13 12:37:30 -07:00
2bd65f2091 Better Hidream O1 mem usage factor for non dynamic vram. (#13864) 2026-05-12 20:55:38 -07:00
0155ddcbe3 Fix dtype issue with hidream o1 (#13849) 2026-05-11 20:53:13 -07:00
8e53f001a4 feat: Support HiDream-O1-Image (CORE-187) (#13817)
* Initial HiDream01-image support

* Cleanup nodes

* Cleaner handling of empty placeholder models

* Remove snap_to_predefined, prefer tooltip for the trained resolutions

* Add model and block wrappers

* Fix shift tooltip

* Add node to work around the patch tile issue

Experimental, runs multiple passes with the patch grid offset and blends with various different methods.

* Qwen35 vision rotary_pos_emb cast fix

* Fix embedding layout type

* Some small optimizations

* Cleanup, don't need this fallback

* Prefix KV cache, cleanup

Bit of speed, reduce redundant code

* Get rid of redundant custom sampler, refactor noise scaling

Our existing lcm sampler is mathematically same, just added the missing options to it instead and a node to control them. Refactored the noise scaling and fix it for the stochastic samplers, add a generic node to control the initial noise scale.

* Update nodes_hidream_o1.py

* Fix some cache validation cases

* Keep existing sampling params

* Remove redundant video vision path

* Replace some numpy ops with torch

* Fx RoPE index for batch size > 1

* Prefer torch preprocessing

* Rename block_type to be compatible with existing patch nodes

* Fixes and tweaks
2026-05-11 20:35:53 -07:00
0a7d2ffd68 Support anima TE lora kohya format. (#13847) 2026-05-11 20:01:52 -07:00
20e439419c model_patcher: Fix safetensors saving of fp8 (#13835)
This was missing proper weight scale casting in the saving path.
2026-05-11 12:48:10 -07:00
f505cb4070 chore: remove extra word in comment (#13826) 2026-05-11 11:05:09 +08:00
3200f28e3a Support Wan-Dancer (#13813)
* initial WanDancer support

* nodes_wandancer: Add list form of chunker.

Create an alternate list form of the node so the chunk gens can be
trivially looped by the comfy executor.

* Closer match to original soxr resampling

* Remove librosa node

* Cleanup

---------

Co-authored-by: Rattus <rattus128@gmail.com>
2026-05-09 14:02:56 -07:00
66669b2ded I don't think there was any because nobody complained. (#13807) 2026-05-08 17:32:14 -07:00
c5ecd231a2 fix: Fix bug when mask not on same device (CORE-181) (#13801) 2026-05-08 23:06:29 +08:00
d3c18c1636 Add support for BiRefNet background remove model (CORE-46) (#12747) 2026-05-08 17:59:24 +08:00
bac6fc35fb Fix typos (#10986) 2026-05-08 17:14:45 +08:00
ef8f25601a Add I2V for causal forcing model. (#13719) 2026-05-07 18:38:36 -07:00
8dc3f3f209 Improve SAM3 large input handling (#13767) 2026-05-07 17:18:28 -07:00
cd8c7a2306 Throttle dynamic VRAM prepare logging (#13704) 2026-05-07 10:41:13 +08:00
78b3096bf3 Void model - pass 1 & 2 (CORE-38) (#13403) 2026-05-05 19:59:04 -07:00
e5369c0eec feat: Context windows - add causal_window_fix to improve blending of context windows (CORE-100) (#13563)
* Context windows: add causal_window_fix toggle

* Fix slice_cond to correctly handle causal anchor index for temporal offsets
2026-05-05 16:40:53 -07:00
1655f8089a Add temporal_downscale_ratio to LatentFormat (#13702)
Co-authored-by: ozbayb <17261091+ozbayb@users.noreply.github.com>
Co-authored-by: Alexis Rolland <alexisrolland@hotmail.com>
Co-authored-by: Jukka Seppänen <40791699+kijai@users.noreply.github.com>
Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com>
2026-05-05 16:30:00 -07:00
fed8d5efa6 feat: Auto-regressive video generation (CORE-25) (#13082) 2026-05-04 21:01:22 -07:00