64579189ec
Feature/add bottom causal mask ( #2480 )
...
* Rebase to latest
* update
* upd
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* Update fmha_fusion.hpp
* Update fmha_fusion.hpp
fixed flipped logic for isQBegin
* Update fmha_fusion.hpp
* Avoid use of booleans
The current expression is confusing
* fmt
* Update fmha_fusion.hpp
Reproduce error/fix with:
./77_blackwell_fmha_fp16 --verify --b=1 --q=1013 --k=1024 --h=1 --h_k=1 --mask=causal --causal-type=qend
* add test, format
---------
Co-authored-by: Richard Cai <ricai@nvidia.com >
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-09-18 17:11:23 -04:00
74825181f2
Remove old-version dsl examples. ( #2644 )
2025-09-17 22:23:30 -04:00
6a35b4d22f
v4.2 tag release. ( #2638 )
2025-09-15 12:21:53 -04:00
56f0718a97
ex77 backwards GQA ( #2556 )
...
* bwd GQA init
* Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu
* ref kernel type conversion fix
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-09-09 12:53:28 -04:00
b6ccf34aef
Fix Copy_Atom type mismatch in sgemm_sm80.cu ( #2582 )
2025-09-04 16:56:17 -07:00
9ca7e877b2
fix gqa issue for blackwell fmha.py ( #2599 )
2025-08-28 11:15:20 -04:00
a49a78ffef
v4.2 release. ( #2587 )
...
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.
* v4.2 release.
2025-08-22 18:11:24 -04:00
11cad1f67b
fix a typo. ( #2561 )
2025-08-19 22:23:09 -04:00
19772cd63e
Fix typo in smem_allocator.py ( #2517 )
2025-08-10 22:44:22 -04:00
a267d47f9b
Update batched_gemm.cu ( #2538 )
2025-08-10 22:42:21 -04:00
da47886e34
Fix example bug ( #2351 )
2025-07-30 22:12:33 -04:00
84a27b3926
fix: examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu GridDim miscalculated ( #2492 )
...
* fix: examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu Launch dimGrid error
* feat: add cta tiler
* Update examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu
use cluster_layout_vmnk instead of cta_tiler
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com >
* feat: remove cta_tiler
---------
Co-authored-by: qinghongzeng <qinghongzeng@deeproute.ai >
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com >
2025-07-30 22:11:04 -04:00
e093b4f691
Fix tutorial comment in sgemm_1.cu: use tCrC instead of tCsA in axpby explanation ( #2448 )
2025-07-30 22:09:55 -04:00
0e026982ce
Example 77 add blackwell fmha bwd for MLA shape ( #2466 )
...
* Update examples/77_blackwell_fmha/device/fmha_device_bwd.hpp
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
* bug fix & use existing value rather than pass one more argument to support different dim in bwd_convert
* Fix casual mask cnt when IsQBegin==false
* bug fix in casual mask backward
* code sync
---------
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
2025-07-24 18:41:11 -04:00
fd6cfe1ed0
v4.1 release update v2. ( #2481 )
2025-07-21 22:03:55 -04:00
9baa06dd57
Add Blackwell MLA forward (shape: d=192, dv=128) implementation in example_77 ( #2472 )
2025-07-18 01:27:48 -04:00
a1aaf2300a
v4.1 release
2025-07-03 08:07:53 -04:00
889ff20648
v4.0 update v2. ( #2420 )
...
* Ex77 forward kernel fix.
2025-06-25 12:56:25 -04:00
dc4817921e
v4.0 update. ( #2398 )
...
* Ex77 fix.
2025-06-12 09:10:29 -04:00
8bdbfca682
v4.0 update. ( #2371 )
2025-06-06 02:39:20 -04:00
2e2af190bd
Revert "[ex77] fix mla split; add fwd lse; add bwd varlen ( #2366 )" ( #2370 )
...
This reverts commit f12b1d75c9 .
2025-06-05 23:14:57 -04:00
f12b1d75c9
[ex77] fix mla split; add fwd lse; add bwd varlen ( #2366 )
2025-06-05 18:39:46 -04:00
9d165a3b8e
Handle get_masked_trip_count for small length in fmha example ( #2292 )
...
* handle get_masked_trip_count for small length
* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
---------
Co-authored-by: Vijay Thakkar <vijaythakkar@me.com >
2025-05-30 22:51:18 -04:00
b9b110a9ea
Correct divmod order in example 77 (blackwell fmha) ( #2291 )
...
* correct divmod naming
* order bidh/bidb
2025-05-30 22:50:40 -04:00
8206e7a0f5
Pre-compile in CuteDsl/ampere/elementwise_apply.py ( #2340 )
2025-05-28 10:24:39 -04:00
f89cd95b16
Update elementwise_add.ipynb ( #2298 )
2025-05-15 09:38:27 -04:00
f115c3f854
Release v4.0.0 ( #2294 )
2025-05-13 15:55:29 -04:00
331a1f5b3f
cutlass 3.9 update ( #2255 )
...
* cutlass 3.9 update
* rebase
* fixes out of shared memory for blockwise Blackwell
* doc format
* fix issue 2253
* disable host ref by default
* fix sm120 smem capacity
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-04-24 15:42:40 -04:00
b3f3c7758c
Update tile_iterator.cu ( #2204 )
...
Some typos in comments
2025-04-10 14:49:58 -04:00
79fc51f4b8
v3.9 update ( #2213 )
...
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-04-03 02:10:16 -04:00
6f4921858b
v3.9 update ( #2203 )
...
* v3.9 update
* voidD
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-04-02 15:11:18 -04:00
62750a2b75
v3.9 ( #2185 )
...
* v3.8 update x
* fix blackwell gg
* doc change
* doc change
* doc change
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
2025-03-21 01:52:23 -04:00
df18f5e4f5
Improvements for: Groupwise scaling along M for FP8 gemm ( #2095 )
...
* fix blockwise fp8 kernels
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* wip, < 128 not working
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* fix < 128
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* reduce diff
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* review comments
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* support partial n blocks
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
* fix build errors
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-02-27 22:39:29 -05:00
ca4fdbea70
Blockwise and Groupwise GEMM for Blackwell and Improvements for Hopper ( #2139 )
...
- Blockwise and Groupwise GEMM improvements for Hopper.
- Blockwise and Groupwise GEMM for Blackwell.
- Blockwise Grouped GEMM for Hopper.
- Static ScalePromotionInterval for Hopper FP8 GEMMs.
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2025-02-26 12:44:58 -05:00
b84e9802d8
update 3.8 v2 ( #2112 )
...
* update 3.8 v2
* update 3.8
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-02-19 22:03:14 -05:00
833f6990e0
v3.8.0 update ( #2082 )
...
* 3.8 update
* fix Markus' name
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-02-06 21:33:40 -05:00
6f55278121
bugfix generic-k code in top-k with softmax ( #1993 )
...
* bugfix generic-k code in top-k with softmax
* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
---------
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
2025-01-31 19:05:35 -05:00
3c28697b9f
Groupwise scaling along M for FP8 gemm ( #2037 )
...
* FP8 groupwise scaling along M
* small updates
---------
Co-authored-by: zl <zl@deepseek.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-31 13:51:28 -05:00
389e493055
CUTLASS 3.8 Release ( #2059 )
...
* CUTLASS 3.8 Release
* update
* Update README.md
* Revert "Update README.md"
This reverts commit b353e36fe8 .
* update
* update
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-25 02:44:06 -05:00
b78588d163
CUTLASS 3.7 ( #2045 )
...
* CUTLASS 3.7
* clean up changelog
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-18 09:53:07 -05:00
ef5620dd1d
Blockwise Scaling for FP8 ( #1932 )
...
* F8 Blockwise Scaling
* two more NumProducerThreadEvents
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-09 11:22:09 -05:00
52b35e90ce
Fix Typos ( #2021 )
...
* Fix Typo
* Fix Typo
2025-01-08 23:46:28 -05:00
3d261a5974
3.6.0 update ( #2005 )
...
* 3.6.0 update
* doc and swap stuff
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2024-12-25 01:34:40 -05:00
4c42f73fda
Improve mixed dtype GEMM ( #1972 )
...
* update
* fix a typo
2024-12-06 13:33:22 -05:00
8aa95dbb88
Fix the racing condition of mixed-input gemm when writing the registers ( #1931 )
...
* move two warpgroup_wait
* merge main
---------
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
2024-11-08 13:15:54 -05:00
08101d9d0c
Improve sm90 mixed dtype kernel ( #1883 )
2024-10-17 20:06:38 -04:00
cc3c29a81a
CUTLASS 3.6.0 ( #1850 )
...
* v3.6
* update changelog
* update readme
* fix typo
* fixing typos
* hopper gemm with weight prefetch
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2024-10-09 15:33:27 -04:00
dbdae514e0
Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm ( #1795 )
2024-09-11 00:07:31 -04:00
be60a0b272
CUTLASS 3.5.1 ( #1623 )
...
* CUTLASS 3.5.1
* updates, optimizations, fixes
2024-07-29 08:46:24 -04:00
843adf0408
Fix SMEM index for C in CuTe examples ( #1477 )
2024-07-10 11:14:15 -04:00