Commit Graph

77 Commits

Author SHA1 Message Date
3b3ef9a77a Quantized Ops fixes (#10715)
* offload support, bug fixes, remove mixins

* add readme
2025-11-12 18:26:52 -05:00
c350009236 ops: Put weight cast on the offload stream (#10697)
This needs to be on the offload stream. This reproduced a black screen
with low resolution images on a slow bus when using FP8.
2025-11-09 22:52:11 -05:00
0f4ef3afa0 This seems to slow things down slightly on Linux. (#10624) 2025-11-03 21:47:14 -05:00
0652cb8e2d Speed up torch.compile (#10620) 2025-11-03 17:37:12 -05:00
135fa49ec2 Small speed improvements to --async-offload (#10593)
* ops: dont take an offload stream if you dont need one

* ops: prioritize mem transfer

The async offload streams reason for existence is to transfer from
RAM to GPU. The post processing compute steps are a bonus on the side
stream, but if the compute stream is running a long kernel, it can
stall the side stream, as it wait to type-cast the bias before
transferring the weight. So do a pure xfer of the weight straight up,
then do everything bias, then go back to fix the weight type and do
weight patches.
2025-11-01 18:48:53 -04:00
c58c13b2ba Fix torch compile regression on fp8 ops. (#10580) 2025-11-01 00:25:17 -04:00
906c089957 Fix small performance regression with fp8 fast and scaled fp8. (#10537) 2025-10-29 19:29:01 -04:00
ab7ab5be23 Fix Race condition in --async-offload that can cause corruption (#10501)
* mm: factor out the current stream getter

Make this a reusable function.

* ops: sync the offload stream with the consumption of w&b

This sync is nessacary as pytorch will queue cuda async frees on the
same stream as created to tensor. In the case of async offload, this
will be on the offload stream.

Weights and biases can go out of scope in python which then
triggers the pytorch garbage collector to queue the free operation on
the offload stream possible before the compute stream has used the
weight. This causes a use after free on weight data leading to total
corruption of some workflows.

So sync the offload stream with the compute stream after the weight
has been used so the free has to wait for the weight to be used.

The cast_bias_weight is extended in a backwards compatible way with
the new behaviour opt-in on a defaulted parameter. This handles
custom node packs calling cast_bias_weight and defeatures
async-offload for them (as they do not handle the race).

The pattern is now:

cast_bias_weight(... , offloadable=True) #This might be offloaded
thing(weight, bias, ...)
uncast_bias_weight(...)

* controlnet: adopt new cast_bias_weight synchronization scheme

This is nessacary for safe async weight offloading.

* mm: sync the last stream in the queue, not the next

Currently this peeks ahead to sync the next stream in the queue of
streams with the compute stream. This doesnt allow a lot of
parallelization, as then end result is you can only get one weight load
ahead regardless of how many streams you have.

Rotate the loop logic here to synchronize the end of the queue before
returning the next stream. This allows weights to be loaded ahead of the
compute streams position.
2025-10-29 17:17:46 -04:00
8817f8fc14 Mixed Precision Quantization System (#10498)
* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.

* Updated design using Tensor Subclasses

* Fix FP8 MM

* An actually functional POC

* Remove CK reference and ensure correct compute dtype

* Update unit tests

* ruff lint

* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.

* Updated design using Tensor Subclasses

* Fix FP8 MM

* An actually functional POC

* Remove CK reference and ensure correct compute dtype

* Update unit tests

* ruff lint

* Fix missing keys

* Rename quant dtype parameter

* Rename quant dtype parameter

* Fix unittests for CPU build
2025-10-28 16:20:53 -04:00
b4f30bd408 Pytorch is stupid. (#10398) 2025-10-19 01:25:35 -04:00
5b80addafd Turn off cuda malloc by default when --fast autotune is turned on. (#10393) 2025-10-18 22:35:46 -04:00
9da397ea2f Disable torch compiler for cast_bias_weight function (#10384)
* Disable torch compiler for cast_bias_weight function

* Fix torch compile.
2025-10-17 20:03:28 -04:00
b1293d50ef workaround also works on cudnn 91200 (#10375) 2025-10-16 19:59:56 -04:00
19b466160c Workaround for nvidia issue where VAE uses 3x more memory on torch 2.9 (#10373) 2025-10-16 18:16:03 -04:00
3374e900d0 Faster workflow cancelling. (#10301) 2025-10-13 23:43:53 -04:00
139addd53c More surgical fix for #10267 (#10276) 2025-10-09 16:37:35 -04:00
7be2b49b6b Fix LoRA Trainer bugs with FP8 models. (#9854)
* Fix adapter weight init

* Fix fp8 model training

* Avoid inference tensor
2025-09-20 21:24:48 -04:00
e2d1e5dad9 Enable Convolution AutoTuning (#9301) 2025-09-01 20:33:50 -04:00
4e5c230f6a Fix last commit not working on older pytorch. (#9346) 2025-08-14 23:44:02 -04:00
f0d5d0111f Avoid torch compile graphbreak for older pytorch versions (#9344)
Turns out torch.compile has some gaps in context manager decorator
syntax support. I've sent patches to fix that in PyTorch, but it won't
be available for all the folks running older versions of PyTorch, hence
this trivial patch.
2025-08-14 23:41:37 -04:00
9df8792d4b Make last PR not crash comfy on old pytorch. (#9324) 2025-08-13 15:12:41 -04:00
3da5a07510 SDPA backend priority (#9299) 2025-08-13 14:53:27 -04:00
111f583e00 Fallback to regular op when fp8 op throws exception. (#8761) 2025-07-02 00:57:13 -04:00
d42613686f Fix issue with fp8 ops on some models. (#8045)
_scaled_mm errors when an input is non contiguous.
2025-05-10 07:52:56 -04:00
ac10a0d69e Make loras work with --async-offload (#7824) 2025-04-26 19:56:22 -04:00
0dcc75ca54 Add experimental --async-offload lowvram weight offloading. (#7820)
This should speed up the lowvram mode a bit. It currently is only enabled when --async-offload is used but it will be enabled by default in the future if there are no problems.
2025-04-26 16:11:21 -04:00
9ad792f927 Basic support for hidream i1 model. 2025-04-15 17:35:05 -04:00
8a438115fb add RMSNorm to comfy.ops 2025-04-14 18:00:33 -04:00
1714a4c158 Add CublasOps support (#7574)
* CublasOps support

* Guard CublasOps behind --fast arg
2025-04-12 18:29:15 -04:00
70e15fd743 No need for scale_input when fp8 matrix mult is disabled. 2025-03-07 04:49:20 -05:00
e1474150de Support fp8_scaled diffusion models that don't use fp8 matrix mult. 2025-03-07 04:39:21 -05:00
4dc6709307 Rename argument in last commit and document the options. 2025-03-01 02:43:49 -05:00
4d55f16ae8 Use enum list for --fast options (#7024) 2025-03-01 02:37:35 -05:00
cf0b549d48 --fast now takes a number as argument to indicate how fast you want it.
The idea is that you can indicate how much quality vs speed you want.

At the moment:

--fast 2 enables fp16 accumulation if your pytorch supports it.
--fast 5 enables fp8 matrix mult on fp8 models and the optimization above.

--fast without a number enables all optimizations.
2025-02-28 02:48:20 -05:00
ab888e1e0b Add add_weight_wrapper function to model patcher.
Functions can now easily be added to wrap/modify model weights.
2025-02-12 05:55:35 -05:00
99a1fb6027 Make fast fp8 take a bit less peak memory. 2024-12-24 18:05:19 -05:00
fbf68c4e52 clamp input (#5928) 2024-12-07 14:00:31 -05:00
915fdb5745 Fix lowvram edge case. 2024-10-22 16:34:50 -04:00
8ce2a1052c Optimizations to --fast and scaled fp8. 2024-10-22 02:12:28 -04:00
0075c6d096 Mixed precision diffusion models with scaled fp8.
This change allows supports for diffusion models where all the linears are
scaled fp8 while the other weights are the original precision.
2024-10-21 18:12:51 -04:00
83ca891118 Support scaled fp8 t5xxl model. 2024-10-20 22:27:00 -04:00
f9f9faface Fixed model merging issue with scaled fp8. 2024-10-20 06:24:31 -04:00
a68bbafddb Support diffusion models with scaled fp8 weights. 2024-10-19 23:47:42 -04:00
67158994a4 Use the lowvram cast_to function for everything. 2024-10-17 17:25:56 -04:00
e38c94228b Add a weight_dtype fp8_e4m3fn_fast to the Diffusion Model Loader node.
This is used to load weights in fp8 and use fp8 matrix multiplication.
2024-10-09 19:43:17 -04:00
9c41bc8d10 Remove useless line. 2024-09-23 02:32:29 -04:00
dc96a1ae19 Load controlnet in fp8 if weights are in fp8. 2024-09-21 04:50:12 -04:00
8ae23d8e80 Fix onnx export. 2024-08-23 17:52:47 -04:00
c7ee4b37a1 Try to fix some lora issues. 2024-08-22 15:32:18 -04:00
904bf58e7d Make --fast work on pytorch nightly. 2024-08-21 14:01:41 -04:00