fd6cfe1ed0
v4.1 release update v2. ( #2481 )
2025-07-21 22:03:55 -04:00
a1aaf2300a
v4.1 release
2025-07-03 08:07:53 -04:00
8bdbfca682
v4.0 update. ( #2371 )
2025-06-06 02:39:20 -04:00
f115c3f854
Release v4.0.0 ( #2294 )
2025-05-13 15:55:29 -04:00
ad7b2f5e84
3.9.2 doc/version ( #2279 )
...
* 3.9.2 doc/version
* whitespace
2025-05-04 00:00:15 -04:00
89f6bf2739
Fix group scale gemm when K==128 ( #2275 )
...
Co-authored-by: Jiazhen Han <jiazhenh@nvidia.com >
2025-05-02 15:41:18 -04:00
f535c33634
3.9.1 doc/version change ( #2273 )
2025-05-01 00:27:00 -04:00
e5b810bed1
Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation schedule. ( #2256 )
...
Co-authored-by: Yuhang Qi <qiyuhang@bytedance.com >
2025-04-30 15:28:05 -04:00
2b78c2fe31
cherry-pick feature/hopper-blockwise-generalization-optimization ( #2270 )
2025-04-29 16:47:22 -04:00
697126019e
fix blackwell grouped groupwise hang ( #2267 )
2025-04-29 11:54:20 -04:00
331a1f5b3f
cutlass 3.9 update ( #2255 )
...
* cutlass 3.9 update
* rebase
* fixes out of shared memory for blockwise Blackwell
* doc format
* fix issue 2253
* disable host ref by default
* fix sm120 smem capacity
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-04-24 15:42:40 -04:00
8e345c5c5b
fix_missing_stdint ( #2199 )
...
* Update config.hpp
* 更新 config.hpp
* 更新 config.hpp
2025-04-23 22:21:22 -04:00
81a43e6d92
Set EpiTile correctly when TileN is not divisible by 32 ( #2220 )
...
If TileN is not divisible by 32 (e.g, 208), by default EpiTile would be set
to 128 x 32, which does not compile as TileN is required to divide EpiTileN
2025-04-21 00:02:51 -04:00
ade6376fa0
[SM90] Change register allocation for TileN=208 to avoid spills ( #2219 )
...
With the usual register allocation (producer 40, consumer 232) compiling Gemm
with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of
register spilling (e.g. ~3000 bytes spill). For this case we can change
the register allocation to producer 24, consumer 240, which avoids spills.
2025-04-21 00:02:30 -04:00
9e1b649827
fix-left-inverse-for-nvcc114 ( #2196 )
2025-04-10 14:48:46 -04:00
5120b21cc3
suppress compilation warnings ( #2195 )
2025-04-10 14:48:01 -04:00
19cc2a5feb
add support for sm89 in cute and the unit tests ( #2177 )
...
* add support for sm89 in cute and the unit tests
* rebase v3.9 and format code
* minor fix
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-04-10 14:16:36 -04:00
df8a550d39
Update mma_atom.hpp ( #2159 )
...
remove useless code
2025-04-03 11:42:10 -04:00
79fc51f4b8
v3.9 update ( #2213 )
...
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-04-03 02:10:16 -04:00
6f4921858b
v3.9 update ( #2203 )
...
* v3.9 update
* voidD
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-04-02 15:11:18 -04:00
62750a2b75
v3.9 ( #2185 )
...
* v3.8 update x
* fix blackwell gg
* doc change
* doc change
* doc change
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-03-21 01:52:23 -04:00
8c4d1dc47d
Treat negative zero as equivalent to positive zero in sm90_sparse_gemm_compressor.hpp ( #2110 )
...
* Treat negative zero as zero in the sparse gemm compressor
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
* format
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
* Apply patch
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
* sm90_sparse_gemm_compressor.hpp
* test/unit/transform/CMakeLists.txt
* test/unit/transform/device/sm90_sparse_gemm_compressor_legacy.hpp
* include/cutlass/numeric_types.h
---------
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-03-21 01:44:17 -04:00
6c6b78550e
Fix SM90 beta=1 hang and stream-K launch errors ( #2172 )
...
* Fix stream-K occupancy calculation
* Fix beta=1 hang
2025-03-13 14:07:37 -04:00
06e560d98a
Blockwise/Groupwise kernel improvement and programatic dependent launch enablement ( #2161 )
...
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2025-03-10 14:36:11 -04:00
df18f5e4f5
Improvements for: Groupwise scaling along M for FP8 gemm ( #2095 )
...
* fix blockwise fp8 kernels
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* wip, < 128 not working
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* fix < 128
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* reduce diff
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* review comments
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* support partial n blocks
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* fix build errors
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-02-27 22:39:29 -05:00
ca4fdbea70
Blockwise and Groupwise GEMM for Blackwell and Improvements for Hopper ( #2139 )
...
- Blockwise and Groupwise GEMM improvements for Hopper.
- Blockwise and Groupwise GEMM for Blackwell.
- Blockwise Grouped GEMM for Hopper.
- Static ScalePromotionInterval for Hopper FP8 GEMMs.
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2025-02-26 12:44:58 -05:00
eefa171318
[EVT] Fix Row/Col broadcast with array arguments ( #2120 )
...
* Use constexpr in if to prevent invalid comparison.
* Move constexpr check into else scope.
2025-02-21 17:47:30 -05:00
9b3772dfa6
Hopper Grouped GEMM support for FP8 Accum ( #2123 )
...
* Add support for fp8accum, with profiler extension
* Update .gitignore
* contri
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-02-20 21:55:26 -05:00
b84e9802d8
update 3.8 v2 ( #2112 )
...
* update 3.8 v2
* update 3.8
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-02-19 22:03:14 -05:00
e9627ce55b
Always use cudaGetDriverEntryPoint with CUDA 12 ( #2086 )
...
`cudaGetDriverEntryPointByVersion` has been added to drivers in 12.5, but we don't know at compile time the driver version.
In particular, we can build with nvcc 12.8 for a 12.2 driver for instance, and this was causing the following error:
```
undefined symbol: cudaGetDriverEntryPointByVersion,
```
2025-02-11 13:04:25 -05:00
833f6990e0
v3.8.0 update ( #2082 )
...
* 3.8 update
* fix Markus' name
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-02-06 21:33:40 -05:00
affd1b693d
[EVT] Add support for Row/Col broadcast PtrArray ( #2033 )
...
* Add group support to EVT row/col broadcast.
* small modifications
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-02-02 12:10:07 -05:00
6f55278121
bugfix generic-k code in top-k with softmax ( #1993 )
...
* bugfix generic-k code in top-k with softmax
* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
---------
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
2025-01-31 19:05:35 -05:00
3c28697b9f
Groupwise scaling along M for FP8 gemm ( #2037 )
...
* FP8 groupwise scaling along M
* small updates
---------
Co-authored-by: zl <zl@deepseek.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-31 13:51:28 -05:00
47daa33c61
fix cuda 12.6 issues ( #2066 )
2025-01-28 17:28:29 -05:00
389e493055
CUTLASS 3.8 Release ( #2059 )
...
* CUTLASS 3.8 Release
* update
* Update README.md
* Revert "Update README.md"
This reverts commit b353e36fe8 .
* update
* update
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-25 02:44:06 -05:00
b78588d163
CUTLASS 3.7 ( #2045 )
...
* CUTLASS 3.7
* clean up changelog
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-18 09:53:07 -05:00
ef5620dd1d
Blockwise Scaling for FP8 ( #1932 )
...
* F8 Blockwise Scaling
* two more NumProducerThreadEvents
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-09 11:22:09 -05:00
375e284e6a
Add Line Break ( #2020 )
2025-01-08 23:46:59 -05:00
51b25e7b58
Add vector-types back to platform.h ( #2026 )
2025-01-08 15:31:59 -05:00
7de6a59784
Add half->int8 saturate conversion to promise valid range ( #1983 )
...
* Add half->int8 saturate conversion to promise valid range
* add gpu only macro
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-08 09:01:07 -05:00
c506e16788
fix mem fence ( #2030 )
...
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-01-07 19:02:26 -05:00
7494a180a4
fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong ( #1989 )
2025-01-06 22:05:12 -05:00
3d261a5974
3.6.0 update ( #2005 )
...
* 3.6.0 update
* doc and swap stuff
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2024-12-25 01:34:40 -05:00
e1cd8c7866
Fix Typo ( #1962 )
2024-12-10 22:07:37 -05:00
2b6cfd34d1
fix a typo that fails the compiling when ElementScale is not the same as MmaType ( #1977 )
2024-12-10 15:54:44 -05:00
4c42f73fda
Improve mixed dtype GEMM ( #1972 )
...
* update
* fix a typo
2024-12-06 13:33:22 -05:00
80243e0b8c
add {uint4, uint2, int2} => {fp16, bf16} conversion ( #1966 )
2024-12-03 14:03:43 -05:00
8aa95dbb88
Fix the racing condition of mixed-input gemm when writing the registers ( #1931 )
...
* move two warpgroup_wait
* merge main
---------
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
2024-11-08 13:15:54 -05:00
d656afbd2a
fix undefined in device code error ( #1880 )
2024-11-06 14:56:54 -05:00