Commit Graph

655 Commits

Author SHA1 Message Date
2e2af190bd Revert "[ex77] fix mla split; add fwd lse; add bwd varlen (#2366)" (#2370)
This reverts commit f12b1d75c9.
2025-06-05 23:14:57 -04:00
f12b1d75c9 [ex77] fix mla split; add fwd lse; add bwd varlen (#2366) 2025-06-05 18:39:46 -04:00
b244379d9b Merge pull request #2359 from NVIDIA/oss_ci
Initial Workflow Definition for blossom-ci support on CUTLASS GitHub
2025-06-03 14:04:35 -07:00
9d165a3b8e Handle get_masked_trip_count for small length in fmha example (#2292)
* handle get_masked_trip_count for small length

* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>

* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>

---------

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>
2025-05-30 22:51:18 -04:00
b9b110a9ea Correct divmod order in example 77 (blackwell fmha) (#2291)
* correct divmod naming

* order bidh/bidb
2025-05-30 22:50:40 -04:00
8206e7a0f5 Pre-compile in CuteDsl/ampere/elementwise_apply.py (#2340) 2025-05-28 10:24:39 -04:00
6316b6f867 Fix typos (#2311)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-05-23 08:30:10 -04:00
9354bfd7c1 Keep the documentation consistent with the sgemm_1.cu code. (#2285)
* Keep the documentation consistent with the sgemm_1.cu code.

* fix typo

---------

Co-authored-by: zky <zky@126.com>
2025-05-19 22:53:15 -04:00
5e9b8e2a25 fix docx (#2290)
Co-authored-by: xiayongqiang <xiayq1@chinatelecom.cn>
2025-05-19 22:52:37 -04:00
1ec230c4bf Fix typo (#2299)
Needs == for pip to parse the file
2025-05-15 09:38:42 -04:00
f89cd95b16 Update elementwise_add.ipynb (#2298) 2025-05-15 09:38:27 -04:00
f115c3f854 Release v4.0.0 (#2294) 2025-05-13 15:55:29 -04:00
ad7b2f5e84 3.9.2 doc/version (#2279)
* 3.9.2 doc/version

* whitespace
v3.9.2
2025-05-04 00:00:15 -04:00
40f124ef27 [CUTLASS] Add GNA to PUBLICATIONS.md (#2276)
Adds "Generalized Neighborhood Attention" to list of publications using
CUTLASS.

https://arxiv.org/abs/2504.16922

Co-authored-by: Ali Hassani <ahassani@nvidia.com>
2025-05-02 16:57:19 -04:00
89f6bf2739 Fix group scale gemm when K==128 (#2275)
Co-authored-by: Jiazhen Han <jiazhenh@nvidia.com>
2025-05-02 15:41:18 -04:00
f535c33634 3.9.1 doc/version change (#2273) v3.9.1 2025-05-01 00:27:00 -04:00
e3cb8a773a Import cuda, cudart, nvrtc lazily (#2251)
* Lazy cuda import

* More lazy cuda import

* More lazy cuda imports

* minor fixes

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-30 23:10:33 -04:00
c4bdfe821c Lazy scipy import (#2250) 2025-04-30 16:10:00 -04:00
b3ce7e12b7 Make cc a positional argument (#2249) 2025-04-30 16:09:25 -04:00
fe75ead92e Import pydot lazily (#2248) 2025-04-30 16:08:17 -04:00
35136f5564 Fix wrong detection of python version for use_rmm. (#2224) 2025-04-30 15:29:33 -04:00
e5b810bed1 Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation schedule. (#2256)
Co-authored-by: Yuhang Qi <qiyuhang@bytedance.com>
2025-04-30 15:28:05 -04:00
2b78c2fe31 cherry-pick feature/hopper-blockwise-generalization-optimization (#2270) 2025-04-29 16:47:22 -04:00
697126019e fix blackwell grouped groupwise hang (#2267) 2025-04-29 11:54:20 -04:00
e94e888df3 Update CHANGELOG.md v3.9.0 2025-04-24 21:51:34 -04:00
be73ad20a5 Update CHANGELOG.md for 3.9 2025-04-24 16:54:06 -04:00
f02a7c2976 Update README.md for 3.9 2025-04-24 16:51:45 -04:00
331a1f5b3f cutlass 3.9 update (#2255)
* cutlass 3.9 update

* rebase

* fixes out of shared memory for blockwise Blackwell

* doc format

* fix issue 2253

* disable host ref by default

* fix sm120 smem capacity

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-24 15:42:40 -04:00
8e345c5c5b fix_missing_stdint (#2199)
* Update config.hpp

* 更新 config.hpp

* 更新 config.hpp
2025-04-23 22:21:22 -04:00
81a43e6d92 Set EpiTile correctly when TileN is not divisible by 32 (#2220)
If TileN is not divisible by 32 (e.g, 208), by default EpiTile would be set
to 128 x 32, which does not compile as TileN is required to divide EpiTileN
2025-04-21 00:02:51 -04:00
ade6376fa0 [SM90] Change register allocation for TileN=208 to avoid spills (#2219)
With the usual register allocation (producer 40, consumer 232) compiling Gemm
with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of
register spilling (e.g. ~3000 bytes spill). For this case we can change
the register allocation to producer 24, consumer 240, which avoids spills.
2025-04-21 00:02:30 -04:00
bb4dd682dd Fix broken links and alt text in cluster launch control docs (#2234)
* Fix broken links in cluster launch control docs

* Improve titles and alt text
2025-04-21 00:01:12 -04:00
5e497243f7 fix: fig link in cute docs (#2216) 2025-04-10 14:51:41 -04:00
b3f3c7758c Update tile_iterator.cu (#2204)
Some typos in comments
2025-04-10 14:49:58 -04:00
9e1b649827 fix-left-inverse-for-nvcc114 (#2196) 2025-04-10 14:48:46 -04:00
5120b21cc3 suppress compilation warnings (#2195) 2025-04-10 14:48:01 -04:00
dd76dec4ef [Doc] Make C++ code more plausible (#2156)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-10 14:35:46 -04:00
19cc2a5feb add support for sm89 in cute and the unit tests (#2177)
* add support for sm89 in cute and the unit tests

* rebase v3.9 and format code

* minor fix

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-10 14:16:36 -04:00
09df6ac464 [Doc]fix typo (#2174)
Co-authored-by: wenju.li <wenju.li@deepctr.cn>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-10 12:46:53 -04:00
df8a550d39 Update mma_atom.hpp (#2159)
remove useless code
2025-04-03 11:42:10 -04:00
79fc51f4b8 v3.9 update (#2213)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-04-03 02:10:16 -04:00
6f4921858b v3.9 update (#2203)
* v3.9 update

* voidD

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-04-02 15:11:18 -04:00
62750a2b75 v3.9 (#2185)
* v3.8 update x

* fix blackwell gg

* doc change

* doc change

* doc change

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-03-21 01:52:23 -04:00
8c4d1dc47d Treat negative zero as equivalent to positive zero in sm90_sparse_gemm_compressor.hpp (#2110)
* Treat negative zero as zero in the sparse gemm compressor

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* format

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* Apply patch

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* sm90_sparse_gemm_compressor.hpp

* test/unit/transform/CMakeLists.txt

* test/unit/transform/device/sm90_sparse_gemm_compressor_legacy.hpp

* include/cutlass/numeric_types.h

---------

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-03-21 01:44:17 -04:00
3fe62887d8 adding blackwell (#2143) 2025-03-17 22:20:40 -04:00
bd03b22f64 fix typo (#2136)
Co-authored-by: XiaoDong <xiaod@nvidia.com>
2025-03-17 22:19:43 -04:00
6c6b78550e Fix SM90 beta=1 hang and stream-K launch errors (#2172)
* Fix stream-K occupancy calculation

* Fix beta=1 hang
2025-03-13 14:07:37 -04:00
06e560d98a Blockwise/Groupwise kernel improvement and programatic dependent launch enablement (#2161)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2025-03-10 14:36:11 -04:00
df18f5e4f5 Improvements for: Groupwise scaling along M for FP8 gemm (#2095)
* fix blockwise fp8 kernels

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* wip, < 128 not working

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* fix < 128

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* reduce diff

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* review comments

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* support partial n blocks

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* fix build errors

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-02-27 22:39:29 -05:00
ca4fdbea70 Blockwise and Groupwise GEMM for Blackwell and Improvements for Hopper (#2139)
- Blockwise and Groupwise GEMM improvements for Hopper.
- Blockwise and Groupwise GEMM for Blackwell.
- Blockwise Grouped GEMM for Hopper.
- Static ScalePromotionInterval for Hopper FP8 GEMMs.

Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2025-02-26 12:44:58 -05:00