726 Commits

Author SHA1 Message Date
81a43e6d92 Set EpiTile correctly when TileN is not divisible by 32 (#2220)
If TileN is not divisible by 32 (e.g, 208), by default EpiTile would be set
to 128 x 32, which does not compile as TileN is required to divide EpiTileN
2025-04-21 00:02:51 -04:00
ade6376fa0 [SM90] Change register allocation for TileN=208 to avoid spills (#2219)
With the usual register allocation (producer 40, consumer 232) compiling Gemm
with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of
register spilling (e.g. ~3000 bytes spill). For this case we can change
the register allocation to producer 24, consumer 240, which avoids spills.
2025-04-21 00:02:30 -04:00
bb4dd682dd Fix broken links and alt text in cluster launch control docs (#2234)
* Fix broken links in cluster launch control docs

* Improve titles and alt text
2025-04-21 00:01:12 -04:00
5e497243f7 fix: fig link in cute docs (#2216) 2025-04-10 14:51:41 -04:00
b3f3c7758c Update tile_iterator.cu (#2204)
Some typos in comments
2025-04-10 14:49:58 -04:00
9e1b649827 fix-left-inverse-for-nvcc114 (#2196) 2025-04-10 14:48:46 -04:00
5120b21cc3 suppress compilation warnings (#2195) 2025-04-10 14:48:01 -04:00
dd76dec4ef [Doc] Make C++ code more plausible (#2156)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-10 14:35:46 -04:00
19cc2a5feb add support for sm89 in cute and the unit tests (#2177)
* add support for sm89 in cute and the unit tests

* rebase v3.9 and format code

* minor fix

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-10 14:16:36 -04:00
09df6ac464 [Doc]fix typo (#2174)
Co-authored-by: wenju.li <wenju.li@deepctr.cn>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-10 12:46:53 -04:00
df8a550d39 Update mma_atom.hpp (#2159)
remove useless code
2025-04-03 11:42:10 -04:00
79fc51f4b8 v3.9 update (#2213)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-04-03 02:10:16 -04:00
6f4921858b v3.9 update (#2203)
* v3.9 update

* voidD

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-04-02 15:11:18 -04:00
62750a2b75 v3.9 (#2185)
* v3.8 update x

* fix blackwell gg

* doc change

* doc change

* doc change

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-03-21 01:52:23 -04:00
8c4d1dc47d Treat negative zero as equivalent to positive zero in sm90_sparse_gemm_compressor.hpp (#2110)
* Treat negative zero as zero in the sparse gemm compressor

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* format

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* Apply patch

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* sm90_sparse_gemm_compressor.hpp

* test/unit/transform/CMakeLists.txt

* test/unit/transform/device/sm90_sparse_gemm_compressor_legacy.hpp

* include/cutlass/numeric_types.h

---------

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-03-21 01:44:17 -04:00
3fe62887d8 adding blackwell (#2143) 2025-03-17 22:20:40 -04:00
bd03b22f64 fix typo (#2136)
Co-authored-by: XiaoDong <xiaod@nvidia.com>
2025-03-17 22:19:43 -04:00
6c6b78550e Fix SM90 beta=1 hang and stream-K launch errors (#2172)
* Fix stream-K occupancy calculation

* Fix beta=1 hang
2025-03-13 14:07:37 -04:00
06e560d98a Blockwise/Groupwise kernel improvement and programatic dependent launch enablement (#2161)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2025-03-10 14:36:11 -04:00
df18f5e4f5 Improvements for: Groupwise scaling along M for FP8 gemm (#2095)
* fix blockwise fp8 kernels

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* wip, < 128 not working

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* fix < 128

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* reduce diff

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* review comments

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* support partial n blocks

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* fix build errors

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-02-27 22:39:29 -05:00
ca4fdbea70 Blockwise and Groupwise GEMM for Blackwell and Improvements for Hopper (#2139)
- Blockwise and Groupwise GEMM improvements for Hopper.
- Blockwise and Groupwise GEMM for Blackwell.
- Blockwise Grouped GEMM for Hopper.
- Static ScalePromotionInterval for Hopper FP8 GEMMs.

Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2025-02-26 12:44:58 -05:00
eefa171318 [EVT] Fix Row/Col broadcast with array arguments (#2120)
* Use constexpr in if to prevent invalid comparison.

* Move constexpr check into else scope.
2025-02-21 17:47:30 -05:00
afa1772203 truncate name for cutlass profiler (#2124)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
v3.8.0
2025-02-21 00:16:56 -05:00
9b3772dfa6 Hopper Grouped GEMM support for FP8 Accum (#2123)
* Add support for fp8accum, with profiler extension

* Update .gitignore

* contri

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-02-20 21:55:26 -05:00
b84e9802d8 update 3.8 v2 (#2112)
* update 3.8 v2

* update 3.8

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-19 22:03:14 -05:00
e9627ce55b Always use cudaGetDriverEntryPoint with CUDA 12 (#2086)
`cudaGetDriverEntryPointByVersion` has been added to drivers in 12.5, but we don't know at compile time the driver version.
In particular, we can build with nvcc 12.8 for a 12.2 driver for instance, and this was causing the following error:

```
undefined symbol: cudaGetDriverEntryPointByVersion,
```
2025-02-11 13:04:25 -05:00
ad6e1ec19c Add ParetoQ to PUBLICATIONS.md (#2089) 2025-02-10 16:47:02 -05:00
0642d46dd4 Update 0x_gemm_tutorial.md (#2090) 2025-02-10 16:46:43 -05:00
833f6990e0 v3.8.0 update (#2082)
* 3.8 update

* fix Markus' name

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-06 21:33:40 -05:00
affd1b693d [EVT] Add support for Row/Col broadcast PtrArray (#2033)
* Add group support to EVT row/col broadcast.

* small modifications

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-02-02 12:10:07 -05:00
6f55278121 bugfix generic-k code in top-k with softmax (#1993)
* bugfix generic-k code in top-k with softmax

* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

---------

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
2025-01-31 19:05:35 -05:00
3c28697b9f Groupwise scaling along M for FP8 gemm (#2037)
* FP8 groupwise scaling along M

* small updates

---------

Co-authored-by: zl <zl@deepseek.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-31 13:51:28 -05:00
bdd641790a Update README.md 2025-01-28 18:08:13 -05:00
cc19d4d22b fix a readme broken link (#2069) 2025-01-28 18:03:34 -05:00
47daa33c61 fix cuda 12.6 issues (#2066) 2025-01-28 17:28:29 -05:00
389e493055 CUTLASS 3.8 Release (#2059)
* CUTLASS 3.8 Release

* update

* Update README.md

* Revert "Update README.md"

This reverts commit b353e36fe8.

* update

* update

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-25 02:44:06 -05:00
9eb01fa0b0 update 3.7 docs (#2051)
* update docs

* update docs

* update docs

* update docs

* update docs

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-01-23 15:13:50 -05:00
b78588d163 CUTLASS 3.7 (#2045)
* CUTLASS 3.7

* clean up changelog

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
v3.7.0
2025-01-18 09:53:07 -05:00
902dff3663 fix assertion in integer_subbytes.h (#1961) 2025-01-09 22:47:58 -05:00
ef5620dd1d Blockwise Scaling for FP8 (#1932)
* F8 Blockwise Scaling

* two more NumProducerThreadEvents

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-09 11:22:09 -05:00
375e284e6a Add Line Break (#2020) 2025-01-08 23:46:59 -05:00
52b35e90ce Fix Typos (#2021)
* Fix Typo

* Fix Typo
2025-01-08 23:46:28 -05:00
24f991e879 Fix typo in library_defaults.py (#2024) 2025-01-08 15:44:11 -05:00
51b25e7b58 Add vector-types back to platform.h (#2026) 2025-01-08 15:31:59 -05:00
ZZK
7de6a59784 Add half->int8 saturate conversion to promise valid range (#1983)
* Add half->int8 saturate conversion to promise valid range

* add gpu only macro

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-08 09:01:07 -05:00
c506e16788 fix mem fence (#2030)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-01-07 19:02:26 -05:00
7494a180a4 fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong (#1989) 2025-01-06 22:05:12 -05:00
cffd5d32b7 Update 0x_gemm_tutorial.md (#1982)
Shouldn't this be BLK_M, BLK_**K**, k
2025-01-06 22:04:35 -05:00
bf9da7b76c Update CHANGELOG.md v3.6.0 2024-12-25 17:11:15 -05:00
3d261a5974 3.6.0 update (#2005)
* 3.6.0 update

* doc and swap stuff

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-12-25 01:34:40 -05:00