19772cd63e
Fix typo in smem_allocator.py ( #2517 )
2025-08-10 22:44:22 -04:00
052afcd314
fix typo ( #2529 )
2025-08-10 22:44:02 -04:00
86cf63e2d4
NIT: Grammar ( #2537 )
2025-08-10 22:42:45 -04:00
a267d47f9b
Update batched_gemm.cu ( #2538 )
2025-08-10 22:42:21 -04:00
9e6ab77d27
Fix a copy error in the SM70 main loop when loading data from smem to rmem ( #2540 )
2025-08-10 22:42:01 -04:00
d0eada85a3
Support both CUDA 12 and 13 cccl header locations ( #2543 )
2025-08-10 22:41:25 -04:00
23139309e9
Fix incorrect K dim in CuTe MMA Atom doc. ( #2544 )
2025-08-10 22:40:56 -04:00
6dd13d4278
Facebook:This commit makes its files safe for use with -Wimplicit-fallthrough. ( #2324 )
2025-07-31 20:55:19 -04:00
3b054767b3
Fix typo ( #2514 )
2025-07-30 22:14:54 -04:00
6fb5e667c1
[Doc fix] incorrect compute cap. for Blackwell RTX ( #2511 )
...
Blackwell RTX is compute capability 12.0 (SM120) but incorrectly listed
as SM100 in the README.
2025-07-30 22:14:13 -04:00
6c891db9f6
Fix epilogue: 🧵 :Convert cannot be used with cute::collective::DefaultEpilogue. ( #2333 )
2025-07-30 22:12:53 -04:00
da47886e34
Fix example bug ( #2351 )
2025-07-30 22:12:33 -04:00
26b7450023
support fp16 accmulator for sm89 fp8 mma ( #2378 )
...
* add support for sm89 in cute and the unit tests
* support fp16 accmulator for sm89 fp8 mma
* format code
2025-07-30 22:12:08 -04:00
a39cf6b511
Fix example in CuTe tutorials ( #2416 )
2025-07-30 22:11:47 -04:00
f09045d660
Corrected minor nit in mma_traits.hpp ( #2447 )
...
* Corrected minor nit in mma_traits.hpp
The entry and descriptions were jumbled up.
* Update mma_traits.hpp
* Update mma_traits.hpp
2025-07-30 22:11:23 -04:00
84a27b3926
fix: examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu GridDim miscalculated ( #2492 )
...
* fix: examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu Launch dimGrid error
* feat: add cta tiler
* Update examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu
use cluster_layout_vmnk instead of cta_tiler
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com >
* feat: remove cta_tiler
---------
Co-authored-by: qinghongzeng <qinghongzeng@deeproute.ai >
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com >
2025-07-30 22:11:04 -04:00
e093b4f691
Fix tutorial comment in sgemm_1.cu: use tCrC instead of tCsA in axpby explanation ( #2448 )
2025-07-30 22:09:55 -04:00
664c4f7b3e
Update CUTLASS version to 4.1
...
Update CUTLASS version to 4.1.
2025-07-26 20:11:04 -04:00
0e026982ce
Example 77 add blackwell fmha bwd for MLA shape ( #2466 )
...
* Update examples/77_blackwell_fmha/device/fmha_device_bwd.hpp
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
* bug fix & use existing value rather than pass one more argument to support different dim in bwd_convert
* Fix casual mask cnt when IsQBegin==false
* bug fix in casual mask backward
* code sync
---------
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
2025-07-24 18:41:11 -04:00
9a9a579714
Merge pull request #2489 from NVIDIA/update_workflow_script
...
Support "CuTe DSL" auto-labeling in workflow
2025-07-23 15:33:43 +08:00
51d730b8be
Support "CuTe DSL" auto-labeling in workflow
2025-07-23 00:28:01 -07:00
6c0c8b7484
1. Update bug/feature report template to add component selection. ( #2485 )
...
2. Add workflow to apply component label automatically
2025-07-22 12:38:03 -04:00
e51efbfe18
Update CHANGELOG.md
v4.1.0
2025-07-21 22:09:56 -04:00
fd6cfe1ed0
v4.1 release update v2. ( #2481 )
2025-07-21 22:03:55 -04:00
9baa06dd57
Add Blackwell MLA forward (shape: d=192, dv=128) implementation in example_77 ( #2472 )
2025-07-18 01:27:48 -04:00
ebe98c549a
cache procedural_name in GemmOperation ( #2317 )
2025-07-16 22:25:02 -04:00
9892624b66
Fix typos in the text ( #2417 )
2025-07-16 21:51:12 -04:00
a1aaf2300a
v4.1 release
2025-07-03 08:07:53 -04:00
b995f93317
4.0 doc change ( #2425 )
v4.0.0
2025-06-27 09:35:06 -04:00
889ff20648
v4.0 update v2. ( #2420 )
...
* Ex77 forward kernel fix.
2025-06-25 12:56:25 -04:00
dc4817921e
v4.0 update. ( #2398 )
...
* Ex77 fix.
2025-06-12 09:10:29 -04:00
5c6bca0441
Update requirements.txt ( #2390 )
...
Remove the dev suffix in the wheel version
2025-06-10 02:31:49 -04:00
c2ad7c5b20
fix link in readme ( #2379 )
2025-06-07 07:38:38 -04:00
cc23f6d1e9
fix link ( #2377 )
2025-06-07 06:00:39 -04:00
5a287538c2
"Update CHANGELOG for 4.0 tagging" ( #2374 )
2025-06-06 10:07:36 -04:00
8bdbfca682
v4.0 update. ( #2371 )
2025-06-06 02:39:20 -04:00
2e2af190bd
Revert "[ex77] fix mla split; add fwd lse; add bwd varlen ( #2366 )" ( #2370 )
...
This reverts commit f12b1d75c9 .
2025-06-05 23:14:57 -04:00
f12b1d75c9
[ex77] fix mla split; add fwd lse; add bwd varlen ( #2366 )
2025-06-05 18:39:46 -04:00
b244379d9b
Merge pull request #2359 from NVIDIA/oss_ci
...
Initial Workflow Definition for blossom-ci support on CUTLASS GitHub
2025-06-03 14:04:35 -07:00
9d165a3b8e
Handle get_masked_trip_count for small length in fmha example ( #2292 )
...
* handle get_masked_trip_count for small length
* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
---------
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
2025-05-30 22:51:18 -04:00
b9b110a9ea
Correct divmod order in example 77 (blackwell fmha) ( #2291 )
...
* correct divmod naming
* order bidh/bidb
2025-05-30 22:50:40 -04:00
8206e7a0f5
Pre-compile in CuteDsl/ampere/elementwise_apply.py ( #2340 )
2025-05-28 10:24:39 -04:00
6316b6f867
Fix typos ( #2311 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-05-23 08:30:10 -04:00
9354bfd7c1
Keep the documentation consistent with the sgemm_1.cu code. ( #2285 )
...
* Keep the documentation consistent with the sgemm_1.cu code.
* fix typo
---------
Co-authored-by: zky <zky@126.com >
2025-05-19 22:53:15 -04:00
5e9b8e2a25
fix docx ( #2290 )
...
Co-authored-by: xiayongqiang <xiayq1@chinatelecom.cn >
2025-05-19 22:52:37 -04:00
1ec230c4bf
Fix typo ( #2299 )
...
Needs == for pip to parse the file
2025-05-15 09:38:42 -04:00
f89cd95b16
Update elementwise_add.ipynb ( #2298 )
2025-05-15 09:38:27 -04:00
f115c3f854
Release v4.0.0 ( #2294 )
2025-05-13 15:55:29 -04:00
ad7b2f5e84
3.9.2 doc/version ( #2279 )
...
* 3.9.2 doc/version
* whitespace
v3.9.2
2025-05-04 00:00:15 -04:00
40f124ef27
[CUTLASS] Add GNA to PUBLICATIONS.md ( #2276 )
...
Adds "Generalized Neighborhood Attention" to list of publications using
CUTLASS.
https://arxiv.org/abs/2504.16922
Co-authored-by: Ali Hassani <ahassani@nvidia.com >
2025-05-02 16:57:19 -04:00