f4b99bc623
Made multigpu deepclone load model from disk to avoid needing to deepclone actual model object, fixed issues with merge, turn off cuda backend as it causes device mismatch issue with rope (and potentially other ops), will investigate
2026-02-17 04:55:00 -08:00
6165c38cb5
Optimize nvfp4 lora applying. ( #11866 )
...
This changes results a bit but it also speeds up things a lot.
2026-01-14 00:49:38 -05:00
b3c0e4de57
Make loras work on nvfp4 models. ( #11837 )
...
The initial applying is a bit slow but will probably be sped up in the
future.
2026-01-12 22:33:54 -05:00
21e8425087
Add warning for old pytorch. ( #11718 )
2026-01-07 21:07:26 -05:00
edee33f55e
Disable comfy kitchen cuda if pytorch cuda less than 13 ( #11681 )
2026-01-06 22:13:43 -05:00
6da00dd899
Initial ops changes to use comfy_kitchen: Initial nvfp4 checkpoint support. ( #11635 )
...
---------
Co-authored-by: Jedrzej Kosinski <kosinkadink1@gmail.com >
2026-01-05 21:48:58 -05:00
791e30ff50
Fix nan issue when quantizing fp16 tensor. ( #11213 )
2025-12-09 17:03:21 -05:00
43071e3de3
Make old scaled fp8 format use the new mixed quant ops system. ( #11000 )
2025-12-05 14:35:42 -05:00
6484ac89dc
fix QuantizedTensor.is_contiguous ( #10956 ) ( #10959 )
2025-11-28 16:33:07 -05:00
3f382a4f98
quant ops: Dequantize weight in-place ( #10935 )
...
In flux2 these weights are huge (200MB). As plain_tensor is a throw-away
deep copy, do this multiplication in-place to save VRAM.
2025-11-27 08:06:30 -08:00
bdb10a583f
Fix loras not working on mixed fp8. ( #10899 )
2025-11-26 00:07:58 -05:00
015a0599d0
I found a case where this is needed ( #10875 )
2025-11-25 03:23:19 -05:00
b6805429b9
Allow pinning quantized tensors. ( #10873 )
2025-11-25 02:48:20 -05:00
25022e0b09
Cleanup and fix issues with text encoder quants. ( #10872 )
2025-11-25 01:48:53 -05:00
3b3ef9a77a
Quantized Ops fixes ( #10715 )
...
* offload support, bug fixes, remove mixins
* add readme
2025-11-12 18:26:52 -05:00
af4b7b5edb
More fp8 torch.compile regressions fixed. ( #10625 )
2025-11-03 22:14:20 -05:00
6b88478f9f
Bring back fp8 torch compile performance to what it should be. ( #10622 )
2025-11-03 19:22:10 -05:00
e199c8cc67
Fixes ( #10621 )
2025-11-03 17:58:24 -05:00
958a17199a
People should update their pytorch versions. ( #10618 )
2025-11-03 17:08:30 -05:00
c58c13b2ba
Fix torch compile regression on fp8 ops. ( #10580 )
2025-11-01 00:25:17 -04:00
906c089957
Fix small performance regression with fp8 fast and scaled fp8. ( #10537 )
2025-10-29 19:29:01 -04:00
1a58087ac2
Reduce memory usage for fp8 scaled op. ( #10531 )
2025-10-29 15:43:51 -04:00
8817f8fc14
Mixed Precision Quantization System ( #10498 )
...
* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.
* Updated design using Tensor Subclasses
* Fix FP8 MM
* An actually functional POC
* Remove CK reference and ensure correct compute dtype
* Update unit tests
* ruff lint
* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.
* Updated design using Tensor Subclasses
* Fix FP8 MM
* An actually functional POC
* Remove CK reference and ensure correct compute dtype
* Update unit tests
* ruff lint
* Fix missing keys
* Rename quant dtype parameter
* Rename quant dtype parameter
* Fix unittests for CPU build
2025-10-28 16:20:53 -04:00