Commit Graph

332 Commits

Author SHA1 Message Date
6e4b1f9d00 pythorch_attn_by_def_on_gfx1200 (#11793) 2026-01-10 16:51:05 -05:00
0f11869d55 Better detection if AMD torch compiled with efficient attention. (#11745) 2026-01-08 17:16:58 -05:00
2c03884f5f Skip fp4 matrix mult on devices that don't support it. (#11677) 2026-01-06 18:07:26 -05:00
6da00dd899 Initial ops changes to use comfy_kitchen: Initial nvfp4 checkpoint support. (#11635)
---------

Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com>
2026-01-05 21:48:58 -05:00
d157c3299d Refactor module_size function. (#11637) 2026-01-05 03:48:31 -05:00
53e762a3af Print memory summary on OOM to help with debugging. (#11613) 2026-01-03 22:28:38 -05:00
0e6221cc79 Add some warnings for pin and unpin errors. (#11561) 2025-12-29 18:26:42 -05:00
9ca7e143af mm: discard async errors from pinning failures (#10738)
Pretty much every error cudaHostRegister can throw also queues the same
error on the async GPU queue. This was fixed for repinning error case,
but there is the bad mmap and just enomem cases that are harder to
detect.

Do some dummy GPU work to clean the error state.
2025-12-29 18:19:34 -05:00
2943093a53 Enable async offload by default for AMD. (#11534) 2025-12-27 18:54:15 -05:00
cc4ddba1b6 Allow enabling use of MIOpen by setting COMFYUI_ENABLE_MIOPEN=1 as an env var (#11366) 2025-12-19 17:01:50 -05:00
50ca97e776 Speed up lora compute and lower memory usage by doing it in fp16. (#11161) 2025-12-06 18:36:20 -05:00
0ff0457892 mm: wrap the raw stream in context manager (#10958)
The documentation of torch.foo.Stream being usable with with: suggests
it starts at version 2.7. Use the old API for backwards compatibility.
2025-11-28 16:38:12 -05:00
f55c98a89f Disable offload stream when torch compile. (#10961) 2025-11-28 16:16:46 -05:00
9d8a817985 Enable async offloading by default on Nvidia. (#10953)
Add --disable-async-offload to disable it.

If this causes OOMs that go away when you --disable-async-offload please
report it.
2025-11-27 17:46:12 -05:00
f17251bec6 Account for the VRAM cost of weight offloading (#10733)
* mm: default to 0 for NUM_STREAMS

Dont count the compute stream as an offload stream. This makes async
offload accounting easier.

* mm: remove 128MB minimum

This is from a previous offloading system requirement. Remove it to
make behaviour of the loader and partial unloader consistent.

* mp: order the module list by offload expense

Calculate an approximate offloading temporary VRAM cost to offload a
weight and primary order the module load list by that. In the simple
case this is just the same as the module weight, but with Loras, a
weight with a lora consumes considerably more VRAM to do the Lora
application on-the-fly.

This will slightly prioritize lora weights, but is really for
proper VRAM offload accounting.

* mp: Account for the VRAM cost of weight offloading

when checking the VRAM headroom, assume that the weight needs to be
offloaded, and only load if it has space for both the load and offload
 * the number of streams.

As the weights are ordered from largest to smallest by offload cost
this is guaranteed to fit in VRAM (tm), as all weights that follow
will be smaller.

Make the partial unload aware of this system as well by saving the
budget for offload VRAM to the model state and accounting accordingly.
Its possible that partial unload increases the size of the largest
offloaded weights, and thus needs to unload a little bit more than
asked to accomodate the bigger temp buffers.

Honor the existing codes floor on model weight loading of 128MB by
having the patcher honor this separately withough regard to offloading.
Otherwise when MM specifies its 128MB minimum, MP will see the biggest
weights, and budget that 128MB to only offload buffer and load nothing
which isnt the intent of these minimums. The same clamp applies in
case of partial offload of the currently loading model.
2025-11-27 01:03:03 -05:00
b6805429b9 Allow pinning quantized tensors. (#10873) 2025-11-25 02:48:20 -05:00
18e7d6dba5 mm/mp: always unload re-used but modified models (#10724)
The partial unloader path in model re-use flow skips straight to the
actual unload without any check of the patching UUID. This means that
if you do an upscale flow with a model patch on an existing model, it
will not apply your patchings.

Fix by delaying the partial_unload until after the uuid checks. This
is done by making partial_unload a model of partial_load where extra_mem
is -ve.
2025-11-12 16:19:53 -05:00
1199411747 Don't pin tensor if not a torch.nn.parameter.Parameter (#10718) 2025-11-11 19:33:30 -05:00
dea899f221 Unload weights if vram usage goes up between runs. (#10690) 2025-11-09 18:51:33 -05:00
a1a70362ca Only unpin tensor if it was pinned by ComfyUI (#10677) 2025-11-07 11:15:05 -05:00
cf97b033ee mm: guard against double pin and unpin explicitly (#10672)
As commented, if you let cuda be the one to detect double pin/unpinning
it actually creates an asyc GPU error.
2025-11-06 21:20:48 -05:00
09dc24c8a9 Pinned mem also seems to work on AMD. (#10658) 2025-11-05 19:11:15 -05:00
1d69245981 Enable pinned memory by default on Nvidia. (#10656)
Removed the --fast pinned_memory flag.

You can use --disable-pinned-memory to disable it. Please report if it
causes any issues.
2025-11-05 18:08:13 -05:00
7f3e4d486c Limit amount of pinned memory on windows to prevent issues. (#10638) 2025-11-04 17:37:50 -05:00
ab7ab5be23 Fix Race condition in --async-offload that can cause corruption (#10501)
* mm: factor out the current stream getter

Make this a reusable function.

* ops: sync the offload stream with the consumption of w&b

This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.

Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.

So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.

The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).

The pattern is now:

cast_bias_weight(... , offloadable=True) #This might be offloaded
thing(weight, bias, ...)
uncast_bias_weight(...)

* controlnet: adopt new cast_bias_weight synchronization scheme

This is nessacary for safe async weight offloading.

* mm: sync the last stream in the queue, not the next

Currently this peeks ahead to sync the next stream in the queue of
streams with the compute stream. This doesnt allow a lot of
parallelization, as then end result is you can only get one weight load
ahead regardless of how many streams you have.

Rotate the loop logic here to synchronize the end of the queue before
returning the next stream. This allows weights to be loaded ahead of the
compute streams position.
2025-10-29 17:17:46 -04:00
3fa7a5c04a Speed up offloading using pinned memory. (#10526)
To enable this feature use: --fast pinned_memory
2025-10-29 00:21:01 -04:00
098a352f13 Add warning for torch-directml usage (#10482)
Added a warning message about the state of torch-directml.
2025-10-25 20:05:22 -04:00
426cde37f1 Remove useless function (#10472) 2025-10-24 19:56:51 -04:00
9cdc64998f Only disable cudnn on newer AMD GPUs. (#10437) 2025-10-21 19:15:23 -04:00
2c2aa409b0 Log message for cudnn disable on AMD. (#10418) 2025-10-20 15:43:24 -04:00
5b80addafd Turn off cuda malloc by default when --fast autotune is turned on. (#10393) 2025-10-18 22:35:46 -04:00
1c10b33f9b gfx942 doesn't support fp8 operations. (#10348) 2025-10-15 00:21:11 -04:00
c8674bc6e9 Enable RDNA4 pytorch attention on ROCm 7.0 and up. (#10332) 2025-10-13 21:19:03 -04:00
a125cd84b0 Improve AMD performance. (#10302)
I honestly have no idea why this improves things but it does.
2025-10-12 00:28:01 -04:00
c8d2117f02 Fix memory leak by properly detaching model finalizer (#9979)
When unloading models in load_models_gpu(), the model finalizer was not
being explicitly detached, leading to a memory leak. This caused
linear memory consumption increase over time as models are repeatedly
loaded and unloaded.

This change prevents orphaned finalizer references from accumulating in
memory during model switching operations.
2025-09-24 22:35:12 -04:00
8d6653fca6 Enable fp8 ops by default on gfx1200 (#9926) 2025-09-18 19:50:37 -04:00
fb763d4333 Fix amd_min_version crash when cpu device. (#9754) 2025-09-07 21:16:29 -04:00
bcbd7884e3 Don't enable pytorch attention on AMD if triton isn't available. (#9747) 2025-09-07 00:29:38 -04:00
27a0fcccc3 Enable bf16 VAE on RDNA4. (#9746) 2025-09-06 23:25:22 -04:00
0963493a9c Support for Qwen Diffsynth Controlnets canny and depth. (#9465)
These are not real controlnets but actually a patch on the model so they
will be treated as such.

Put them in the models/model_patches/ folder.

Use the new ModelPatchLoader and QwenImageDiffsynthControlnet nodes.
2025-08-20 22:26:37 -04:00
c991a5da65 Fix XPU iGPU regressions (#9322)
* Change bf16 check and switch non-blocking to off default with option to force to regain speed on certain classes of iGPUs and refactor xpu check.

* Turn non_blocking off by default for xpu.

* Update README.md for Intel GPUs.
2025-08-13 19:13:35 -04:00
5828607ccf Not sure if AMD actually support fp16 acc but it doesn't crash. (#9258) 2025-08-09 12:49:25 -04:00
735bb4bdb1 Users report gfx1201 is buggy on flux with pytorch attention. (#9244) 2025-08-08 04:21:00 -04:00
7d593baf91 Extra reserved vram on large cards on windows. (#9093) 2025-07-29 04:07:45 -04:00
69cb57b342 Print xpu device name. (#9035) 2025-07-24 15:06:25 -04:00
0ccc88b03f Support Iluvatar CoreX (#8585)
* Support Iluvatar CoreX
Co-authored-by: mingjiang.li <mingjiang.li@iluvatar.com>
2025-07-24 13:57:36 -04:00
d3504e1778 Enable pytorch attention by default for gfx1201 on torch 2.8 (#9029) 2025-07-23 19:21:29 -04:00
a86a58c308 Fix xpu function not implemented p2. (#9027) 2025-07-23 18:18:20 -04:00
39dda1d40d Fix xpu function not implemented. (#9026) 2025-07-23 18:10:59 -04:00
5ad33787de Add default device argument. (#9023) 2025-07-23 14:20:49 -04:00