fd6cfe1ed0
v4.1 release update v2. ( #2481 )
2025-07-21 22:03:55 -04:00
a1aaf2300a
v4.1 release
2025-07-03 08:07:53 -04:00
8bdbfca682
v4.0 update. ( #2371 )
2025-06-06 02:39:20 -04:00
331a1f5b3f
cutlass 3.9 update ( #2255 )
...
* cutlass 3.9 update
* rebase
* fixes out of shared memory for blockwise Blackwell
* doc format
* fix issue 2253
* disable host ref by default
* fix sm120 smem capacity
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-04-24 15:42:40 -04:00
6f4921858b
v3.9 update ( #2203 )
...
* v3.9 update
* voidD
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-04-02 15:11:18 -04:00
62750a2b75
v3.9 ( #2185 )
...
* v3.8 update x
* fix blackwell gg
* doc change
* doc change
* doc change
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-03-21 01:52:23 -04:00
9b3772dfa6
Hopper Grouped GEMM support for FP8 Accum ( #2123 )
...
* Add support for fp8accum, with profiler extension
* Update .gitignore
* contri
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-02-20 21:55:26 -05:00
b84e9802d8
update 3.8 v2 ( #2112 )
...
* update 3.8 v2
* update 3.8
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-02-19 22:03:14 -05:00
833f6990e0
v3.8.0 update ( #2082 )
...
* 3.8 update
* fix Markus' name
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-02-06 21:33:40 -05:00
389e493055
CUTLASS 3.8 Release ( #2059 )
...
* CUTLASS 3.8 Release
* update
* Update README.md
* Revert "Update README.md"
This reverts commit b353e36fe8 .
* update
* update
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-25 02:44:06 -05:00
b78588d163
CUTLASS 3.7 ( #2045 )
...
* CUTLASS 3.7
* clean up changelog
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-18 09:53:07 -05:00
3d261a5974
3.6.0 update ( #2005 )
...
* 3.6.0 update
* doc and swap stuff
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2024-12-25 01:34:40 -05:00
cc3c29a81a
CUTLASS 3.6.0 ( #1850 )
...
* v3.6
* update changelog
* update readme
* fix typo
* fixing typos
* hopper gemm with weight prefetch
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2024-10-09 15:33:27 -04:00
e1976daacc
Add support for mixed 4-bit/8-bit data types GEMM ( #1413 )
...
* Add support for mixed 4-bit/8-bit data types GEMM
* fix ( and )
---------
Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2024-08-29 23:11:06 -04:00
3f084f7f3c
Add couple configs into generator.py for mixed input MM ( #1350 )
...
* Add couple configs into generator.py for mixed input MM
* change one unit test name; reenable 128x32 in the profiler
* Added U8/BF16 tests.
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2024-08-16 00:59:29 -04:00
8b2a0408bd
Profiler docs and argument update for raster order ( #1667 )
2024-07-31 16:40:10 -04:00
be60a0b272
CUTLASS 3.5.1 ( #1623 )
...
* CUTLASS 3.5.1
* updates, optimizations, fixes
2024-07-29 08:46:24 -04:00
7d49e6c7e2
Updates for CUTLASS 3.5.0 ( #1468 )
2024-04-11 21:33:40 -04:00
629f4653c3
CUTLASS 3.5.0 ( #1411 )
2024-03-19 17:51:04 -04:00
751eb9a885
Update license year ( #1306 )
2024-01-16 14:37:22 -05:00
2f589ffa76
Updates for 3.4 release. ( #1305 )
2024-01-16 13:42:51 -05:00
8236f30675
CUTLASS 3.4.0 ( #1286 )
...
* CUTLASS 3.4.0
* Update CHANGELOG.md
---------
Co-authored-by: Pradeep Ramani <prramani@nvidia.com >
2023-12-29 15:21:31 -05:00
c008b4aea8
CUTLASS 3.3.0 ( #1167 )
...
* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update
2023-11-02 11:09:05 -04:00
757275f279
Adding more Threadblock Tiles for Mixed-input TensorOp (BF16 * S8) in cutlass_library ( #1132 )
...
* Adding more tiles in the cutlass_library for mixed-input support.
* fix rebase issue
* more tiles to upcast a
2023-10-13 11:33:15 -04:00
ff02da2667
Fx parallel split-k ( #1116 )
2023-10-06 12:02:40 -04:00
7d8317a63e
Support for Mixed Input TensorOp ( #1084 )
...
* Passing warp-level mixed input F16*(S8/U8) tests
* passing device-level mixed input F16*(S8/U8) tests
* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)
* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)
* Speedup reference compilation (REVERT THIS COMMIT)
* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)
* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs
* BF16 * S8 (142 TFLOPs)
* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]
* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast
* Add device-level test and profiler support for upcast on operand A
* Move shfl before the cvt and reduce #shfls by 1/2
* fix smem_usage calculation for mixed_input types
* uncomment the stuff (getting ready for merge)
* profiler changes and mixed-input reference
* mixed input reference are in a new file
* use platform instead of std
* comments and typo only
* Use CreateGemmOperator and delete CreateMixedInputGemmOperator
* copyright for new files
* rebase follow-up
2023-09-27 11:18:30 -04:00
90d3b0fb18
CUTLASS 3.2.1 ( #1113 )
...
* Updates for 3.2.1 release.
* Minor fix in gemm op profiler for raster order.
* Add scheduler mapping for raster order in the kernels.
2023-09-26 17:24:26 -04:00
34bbadd3ff
standarize fp8 generator ( #1078 )
2023-09-07 14:36:33 -04:00
e01b9b5029
Shard gemm reference templates into multiple TUs for parallel compilation ( #1043 )
...
* Split apart gemm reference templates into multiple TUs for parallel compilation
* remove old files
* better balancing of ref kernels across TUs
* remove 3 new added refcheck kernels and some un-necessary fp8 library instances to reduce lib size
* remove auto fp8 kernels
* remove some redundant kernels
2023-08-30 16:46:30 -04:00
3a8f57a3c8
Add simple hash and eq methods for gemm_operations. ( #1053 )
2023-08-27 20:41:57 -04:00
a88c41cf8d
Updates for 3.2 release ( #1065 )
2023-08-25 23:05:46 -04:00
4575443d44
CUTLASS 3.2 ( #1024 )
...
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
473a67073e
Fix Int8 and TF32 generator ( #976 )
2023-06-12 12:32:52 -04:00
f079619f5e
More updates for 3.1 ( #958 )
...
* Updates for 3.1
* Minor change
* doc link fix
* Minor updates
2023-05-24 10:17:16 -04:00
b97404837e
Adding 128x256 tile for 16b input datatype WGMMA gemm ( #950 )
2023-05-17 17:13:23 -04:00
7c04f95415
Updates for 3.1 ( #932 )
2023-04-29 09:34:27 -04:00
df02482f1d
Add missing schedules argument in SM90 fp16 op generation ( #920 )
2023-04-26 16:44:49 -04:00
d572cc1aab
CUTLASS 3.1 ( #915 )
...
Co-authored-by: Aniket Shivam <ashivam@nvidia.com >
2023-04-14 23:19:34 -04:00
9b8166e3f0
fMHA: Add backward pass ( #844 )
...
* fMHA: Add backward pass
* Better checks for strides/alignments
* Remove fb-internal URL
* torch.Tensor.untyped_storage requires pytorch 2.0+
* minor changes
* make test
---------
Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2023-04-06 20:44:58 -04:00
e2d439ee7e
Add tile_n=32 and tile_k=32 kernels in generator.py ( #858 )
2023-04-06 10:00:52 -04:00
660a05f581
fix split_k_mode and add reduction kernel for f16 input/accum/output ( #896 )
2023-03-30 15:31:08 -04:00
7e370c9637
Fix typos 2 ( #842 )
...
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2023-03-09 23:22:56 -05:00
c4f6b8c6bc
Updates for 3.0 ( #857 )
...
Co-authored-by: Aniket Shivam <ashivam@nvidia.com >
2023-03-09 15:27:40 -05:00
a68e2f95f0
Reduce versbosity in manifest.py ( #845 )
2023-03-07 11:53:01 -05:00
65688c2a87
streamk fix ( #836 )
...
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2023-02-23 16:35:08 -05:00
9cdbe33570
Add fixed_channel and few_channel mode to int8 in generator ( #829 )
2023-02-21 21:15:39 -05:00
277bd6e537
CUTLASS 3.0.0 ( #786 )
...
* CUTLASS 3.0.0
2023-01-23 20:55:28 -05:00
66d9cddc83
New updates for 2.11 ( #775 )
...
* New updates.
* Minor profiler updates
Co-authored-by: Aniket Shivam <ashivam@nvidia.com >
2023-01-20 16:32:57 -05:00
764b840d6f
streamk example and performance tuning ( #760 )
...
* streamk example and performance tuning
* one missing file
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2023-01-10 16:10:02 -05:00
df81d847d7
Make Python interface work for non-SM80 targets ( #726 )
...
* Make Python interface work for non-SM80 targets
* Remove line in README
2022-12-07 21:53:33 -05:00