b2ca083d2b
Fixed compilation error when using StreamK scheduler + PDL. ( #2686 )
2025-10-21 23:11:14 -04:00
b1d6e2c9b3
v4.3 update. ( #2709 )
...
* v4.3 update.
* Update the cute_dsl_api changelog's doc link
* Update version to 4.3.0
* Update the example link
* Update doc to encourage user to install DSL from requirements.txt
---------
Co-authored-by: Larry Wu <larwu@nvidia.com >
2025-10-21 14:26:30 -04:00
e6e2cc29f5
fix ( #2684 )
2025-10-15 14:46:38 -04:00
f874df19ac
4.2.1 update
2025-09-23 13:45:13 -07:00
7a6d4ee099
v4.2.1 update. ( #2666 )
2025-09-23 13:25:43 -04:00
2b8dff1f90
Fix bfloat16 epsilon ( #2607 )
...
* Fix bfloat16 epsilon
* just use constants
---------
Co-authored-by: Konstantin <konstantin@MacBook-Air.local >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-09-21 23:43:59 -04:00
fd0312ddf6
Remove duplicate function calls ( #1584 )
2025-09-21 23:16:59 -04:00
57e3cfb47a
doc change for 4.2 ( #2639 )
...
* doc change
* fix broken links
* ragged gemm doc update
* move around texts about moe gemm
2025-09-15 22:02:45 -04:00
e7e0adddac
Update version.h
...
change version number to 4.2
2025-09-15 12:40:58 -04:00
6a35b4d22f
v4.2 tag release. ( #2638 )
2025-09-15 12:21:53 -04:00
76c96b0be3
Fix incorrect shapes in copy_atom doc comments. ( #2575 )
2025-09-04 16:57:24 -07:00
d98e7bf7ce
Fix comment in mma_atom.hpp ( #2579 )
2025-09-04 16:56:39 -07:00
2288c0c901
Fix bugs in matrix.h ( #2598 )
2025-09-04 16:55:11 -07:00
496654bf2c
Fix sm100 gemm wrong static constexpr that breaks compilation on Windows ( #2167 )
...
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows
* More Windows fixes
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com >
* Revert "More Windows fixes"
This reverts commit 2e8cfc1382 .
---------
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com >
2025-08-28 22:13:00 -04:00
a49a78ffef
v4.2 release. ( #2587 )
...
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.
* v4.2 release.
2025-08-22 18:11:24 -04:00
931359cec1
Fix typo in functional.h ( #2571 )
2025-08-19 22:22:31 -04:00
42e7c546c4
Add movmatrix support (movmatrix.sync.aligned.m8n8.trans.b16) ( #2562 )
2025-08-19 22:22:02 -04:00
052afcd314
fix typo ( #2529 )
2025-08-10 22:44:02 -04:00
9e6ab77d27
Fix a copy error in the SM70 main loop when loading data from smem to rmem ( #2540 )
2025-08-10 22:42:01 -04:00
6dd13d4278
Facebook:This commit makes its files safe for use with -Wimplicit-fallthrough. ( #2324 )
2025-07-31 20:55:19 -04:00
6c891db9f6
Fix epilogue: 🧵 :Convert cannot be used with cute::collective::DefaultEpilogue. ( #2333 )
2025-07-30 22:12:53 -04:00
26b7450023
support fp16 accmulator for sm89 fp8 mma ( #2378 )
...
* add support for sm89 in cute and the unit tests
* support fp16 accmulator for sm89 fp8 mma
* format code
2025-07-30 22:12:08 -04:00
f09045d660
Corrected minor nit in mma_traits.hpp ( #2447 )
...
* Corrected minor nit in mma_traits.hpp
The entry and descriptions were jumbled up.
* Update mma_traits.hpp
* Update mma_traits.hpp
2025-07-30 22:11:23 -04:00
664c4f7b3e
Update CUTLASS version to 4.1
...
Update CUTLASS version to 4.1.
2025-07-26 20:11:04 -04:00
fd6cfe1ed0
v4.1 release update v2. ( #2481 )
2025-07-21 22:03:55 -04:00
a1aaf2300a
v4.1 release
2025-07-03 08:07:53 -04:00
8bdbfca682
v4.0 update. ( #2371 )
2025-06-06 02:39:20 -04:00
f115c3f854
Release v4.0.0 ( #2294 )
2025-05-13 15:55:29 -04:00
ad7b2f5e84
3.9.2 doc/version ( #2279 )
...
* 3.9.2 doc/version
* whitespace
2025-05-04 00:00:15 -04:00
89f6bf2739
Fix group scale gemm when K==128 ( #2275 )
...
Co-authored-by: Jiazhen Han <jiazhenh@nvidia.com >
2025-05-02 15:41:18 -04:00
f535c33634
3.9.1 doc/version change ( #2273 )
2025-05-01 00:27:00 -04:00
e5b810bed1
Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation schedule. ( #2256 )
...
Co-authored-by: Yuhang Qi <qiyuhang@bytedance.com >
2025-04-30 15:28:05 -04:00
2b78c2fe31
cherry-pick feature/hopper-blockwise-generalization-optimization ( #2270 )
2025-04-29 16:47:22 -04:00
697126019e
fix blackwell grouped groupwise hang ( #2267 )
2025-04-29 11:54:20 -04:00
331a1f5b3f
cutlass 3.9 update ( #2255 )
...
* cutlass 3.9 update
* rebase
* fixes out of shared memory for blockwise Blackwell
* doc format
* fix issue 2253
* disable host ref by default
* fix sm120 smem capacity
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-04-24 15:42:40 -04:00
8e345c5c5b
fix_missing_stdint ( #2199 )
...
* Update config.hpp
* 更新 config.hpp
* 更新 config.hpp
2025-04-23 22:21:22 -04:00
81a43e6d92
Set EpiTile correctly when TileN is not divisible by 32 ( #2220 )
...
If TileN is not divisible by 32 (e.g, 208), by default EpiTile would be set
to 128 x 32, which does not compile as TileN is required to divide EpiTileN
2025-04-21 00:02:51 -04:00
ade6376fa0
[SM90] Change register allocation for TileN=208 to avoid spills ( #2219 )
...
With the usual register allocation (producer 40, consumer 232) compiling Gemm
with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of
register spilling (e.g. ~3000 bytes spill). For this case we can change
the register allocation to producer 24, consumer 240, which avoids spills.
2025-04-21 00:02:30 -04:00
9e1b649827
fix-left-inverse-for-nvcc114 ( #2196 )
2025-04-10 14:48:46 -04:00
5120b21cc3
suppress compilation warnings ( #2195 )
2025-04-10 14:48:01 -04:00
19cc2a5feb
add support for sm89 in cute and the unit tests ( #2177 )
...
* add support for sm89 in cute and the unit tests
* rebase v3.9 and format code
* minor fix
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-04-10 14:16:36 -04:00
df8a550d39
Update mma_atom.hpp ( #2159 )
...
remove useless code
2025-04-03 11:42:10 -04:00
79fc51f4b8
v3.9 update ( #2213 )
...
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-04-03 02:10:16 -04:00
6f4921858b
v3.9 update ( #2203 )
...
* v3.9 update
* voidD
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-04-02 15:11:18 -04:00
62750a2b75
v3.9 ( #2185 )
...
* v3.8 update x
* fix blackwell gg
* doc change
* doc change
* doc change
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-03-21 01:52:23 -04:00
8c4d1dc47d
Treat negative zero as equivalent to positive zero in sm90_sparse_gemm_compressor.hpp ( #2110 )
...
* Treat negative zero as zero in the sparse gemm compressor
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
* format
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
* Apply patch
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
* sm90_sparse_gemm_compressor.hpp
* test/unit/transform/CMakeLists.txt
* test/unit/transform/device/sm90_sparse_gemm_compressor_legacy.hpp
* include/cutlass/numeric_types.h
---------
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-03-21 01:44:17 -04:00
6c6b78550e
Fix SM90 beta=1 hang and stream-K launch errors ( #2172 )
...
* Fix stream-K occupancy calculation
* Fix beta=1 hang
2025-03-13 14:07:37 -04:00
06e560d98a
Blockwise/Groupwise kernel improvement and programatic dependent launch enablement ( #2161 )
...
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2025-03-10 14:36:11 -04:00
df18f5e4f5
Improvements for: Groupwise scaling along M for FP8 gemm ( #2095 )
...
* fix blockwise fp8 kernels
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* wip, < 128 not working
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* fix < 128
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* reduce diff
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* review comments
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* support partial n blocks
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* fix build errors
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-02-27 22:39:29 -05:00
ca4fdbea70
Blockwise and Groupwise GEMM for Blackwell and Improvements for Hopper ( #2139 )
...
- Blockwise and Groupwise GEMM improvements for Hopper.
- Blockwise and Groupwise GEMM for Blackwell.
- Blockwise Grouped GEMM for Hopper.
- Static ScalePromotionInterval for Hopper FP8 GEMMs.
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2025-02-26 12:44:58 -05:00