fd6cfe1ed0
v4.1 release update v2. ( #2481 )
2025-07-21 22:03:55 -04:00
9baa06dd57
Add Blackwell MLA forward (shape: d=192, dv=128) implementation in example_77 ( #2472 )
2025-07-18 01:27:48 -04:00
ebe98c549a
cache procedural_name in GemmOperation ( #2317 )
2025-07-16 22:25:02 -04:00
9892624b66
Fix typos in the text ( #2417 )
2025-07-16 21:51:12 -04:00
a1aaf2300a
v4.1 release
2025-07-03 08:07:53 -04:00
b995f93317
4.0 doc change ( #2425 )
v4.0.0
2025-06-27 09:35:06 -04:00
889ff20648
v4.0 update v2. ( #2420 )
...
* Ex77 forward kernel fix.
2025-06-25 12:56:25 -04:00
dc4817921e
v4.0 update. ( #2398 )
...
* Ex77 fix.
2025-06-12 09:10:29 -04:00
5c6bca0441
Update requirements.txt ( #2390 )
...
Remove the dev suffix in the wheel version
2025-06-10 02:31:49 -04:00
c2ad7c5b20
fix link in readme ( #2379 )
2025-06-07 07:38:38 -04:00
cc23f6d1e9
fix link ( #2377 )
2025-06-07 06:00:39 -04:00
5a287538c2
"Update CHANGELOG for 4.0 tagging" ( #2374 )
2025-06-06 10:07:36 -04:00
8bdbfca682
v4.0 update. ( #2371 )
2025-06-06 02:39:20 -04:00
2e2af190bd
Revert "[ex77] fix mla split; add fwd lse; add bwd varlen ( #2366 )" ( #2370 )
...
This reverts commit f12b1d75c9 .
2025-06-05 23:14:57 -04:00
f12b1d75c9
[ex77] fix mla split; add fwd lse; add bwd varlen ( #2366 )
2025-06-05 18:39:46 -04:00
b244379d9b
Merge pull request #2359 from NVIDIA/oss_ci
...
Initial Workflow Definition for blossom-ci support on CUTLASS GitHub
2025-06-03 14:04:35 -07:00
9d165a3b8e
Handle get_masked_trip_count for small length in fmha example ( #2292 )
...
* handle get_masked_trip_count for small length
* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
---------
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
2025-05-30 22:51:18 -04:00
b9b110a9ea
Correct divmod order in example 77 (blackwell fmha) ( #2291 )
...
* correct divmod naming
* order bidh/bidb
2025-05-30 22:50:40 -04:00
8206e7a0f5
Pre-compile in CuteDsl/ampere/elementwise_apply.py ( #2340 )
2025-05-28 10:24:39 -04:00
6316b6f867
Fix typos ( #2311 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-05-23 08:30:10 -04:00
9354bfd7c1
Keep the documentation consistent with the sgemm_1.cu code. ( #2285 )
...
* Keep the documentation consistent with the sgemm_1.cu code.
* fix typo
---------
Co-authored-by: zky <zky@126.com >
2025-05-19 22:53:15 -04:00
5e9b8e2a25
fix docx ( #2290 )
...
Co-authored-by: xiayongqiang <xiayq1@chinatelecom.cn >
2025-05-19 22:52:37 -04:00
1ec230c4bf
Fix typo ( #2299 )
...
Needs == for pip to parse the file
2025-05-15 09:38:42 -04:00
f89cd95b16
Update elementwise_add.ipynb ( #2298 )
2025-05-15 09:38:27 -04:00
f115c3f854
Release v4.0.0 ( #2294 )
2025-05-13 15:55:29 -04:00
ad7b2f5e84
3.9.2 doc/version ( #2279 )
...
* 3.9.2 doc/version
* whitespace
v3.9.2
2025-05-04 00:00:15 -04:00
40f124ef27
[CUTLASS] Add GNA to PUBLICATIONS.md ( #2276 )
...
Adds "Generalized Neighborhood Attention" to list of publications using
CUTLASS.
https://arxiv.org/abs/2504.16922
Co-authored-by: Ali Hassani <ahassani@nvidia.com >
2025-05-02 16:57:19 -04:00
89f6bf2739
Fix group scale gemm when K==128 ( #2275 )
...
Co-authored-by: Jiazhen Han <jiazhenh@nvidia.com >
2025-05-02 15:41:18 -04:00
f535c33634
3.9.1 doc/version change ( #2273 )
v3.9.1
2025-05-01 00:27:00 -04:00
e3cb8a773a
Import cuda, cudart, nvrtc lazily ( #2251 )
...
* Lazy cuda import
* More lazy cuda import
* More lazy cuda imports
* minor fixes
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-04-30 23:10:33 -04:00
c4bdfe821c
Lazy scipy import ( #2250 )
2025-04-30 16:10:00 -04:00
b3ce7e12b7
Make cc a positional argument ( #2249 )
2025-04-30 16:09:25 -04:00
fe75ead92e
Import pydot lazily ( #2248 )
2025-04-30 16:08:17 -04:00
35136f5564
Fix wrong detection of python version for use_rmm. ( #2224 )
2025-04-30 15:29:33 -04:00
e5b810bed1
Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation schedule. ( #2256 )
...
Co-authored-by: Yuhang Qi <qiyuhang@bytedance.com >
2025-04-30 15:28:05 -04:00
2b78c2fe31
cherry-pick feature/hopper-blockwise-generalization-optimization ( #2270 )
2025-04-29 16:47:22 -04:00
697126019e
fix blackwell grouped groupwise hang ( #2267 )
2025-04-29 11:54:20 -04:00
e94e888df3
Update CHANGELOG.md
v3.9.0
2025-04-24 21:51:34 -04:00
be73ad20a5
Update CHANGELOG.md for 3.9
2025-04-24 16:54:06 -04:00
f02a7c2976
Update README.md for 3.9
2025-04-24 16:51:45 -04:00
331a1f5b3f
cutlass 3.9 update ( #2255 )
...
* cutlass 3.9 update
* rebase
* fixes out of shared memory for blockwise Blackwell
* doc format
* fix issue 2253
* disable host ref by default
* fix sm120 smem capacity
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-04-24 15:42:40 -04:00
8e345c5c5b
fix_missing_stdint ( #2199 )
...
* Update config.hpp
* 更新 config.hpp
* 更新 config.hpp
2025-04-23 22:21:22 -04:00
81a43e6d92
Set EpiTile correctly when TileN is not divisible by 32 ( #2220 )
...
If TileN is not divisible by 32 (e.g, 208), by default EpiTile would be set
to 128 x 32, which does not compile as TileN is required to divide EpiTileN
2025-04-21 00:02:51 -04:00
ade6376fa0
[SM90] Change register allocation for TileN=208 to avoid spills ( #2219 )
...
With the usual register allocation (producer 40, consumer 232) compiling Gemm
with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of
register spilling (e.g. ~3000 bytes spill). For this case we can change
the register allocation to producer 24, consumer 240, which avoids spills.
2025-04-21 00:02:30 -04:00
bb4dd682dd
Fix broken links and alt text in cluster launch control docs ( #2234 )
...
* Fix broken links in cluster launch control docs
* Improve titles and alt text
2025-04-21 00:01:12 -04:00
5e497243f7
fix: fig link in cute docs ( #2216 )
2025-04-10 14:51:41 -04:00
b3f3c7758c
Update tile_iterator.cu ( #2204 )
...
Some typos in comments
2025-04-10 14:49:58 -04:00
9e1b649827
fix-left-inverse-for-nvcc114 ( #2196 )
2025-04-10 14:48:46 -04:00
5120b21cc3
suppress compilation warnings ( #2195 )
2025-04-10 14:48:01 -04:00
dd76dec4ef
[Doc] Make C++ code more plausible ( #2156 )
...
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-04-10 14:35:46 -04:00