Commit Graph

151 Commits

Author SHA1 Message Date
64579189ec Feature/add bottom causal mask (#2480)
* Rebase to latest

* update

* upd

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Update fmha_fusion.hpp

* Update fmha_fusion.hpp

fixed flipped logic for isQBegin

* Update fmha_fusion.hpp

* Avoid use of booleans

The current expression is confusing

* fmt

* Update fmha_fusion.hpp

Reproduce error/fix with: 
./77_blackwell_fmha_fp16 --verify --b=1 --q=1013 --k=1024 --h=1 --h_k=1 --mask=causal --causal-type=qend

* add test, format

---------

Co-authored-by: Richard Cai <ricai@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-09-18 17:11:23 -04:00
74825181f2 Remove old-version dsl examples. (#2644) 2025-09-17 22:23:30 -04:00
6a35b4d22f v4.2 tag release. (#2638) 2025-09-15 12:21:53 -04:00
56f0718a97 ex77 backwards GQA (#2556)
* bwd GQA init

* Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu

* ref kernel type conversion fix

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-09-09 12:53:28 -04:00
b6ccf34aef Fix Copy_Atom type mismatch in sgemm_sm80.cu (#2582) 2025-09-04 16:56:17 -07:00
9ca7e877b2 fix gqa issue for blackwell fmha.py (#2599) 2025-08-28 11:15:20 -04:00
a49a78ffef v4.2 release. (#2587)
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.

* v4.2 release.
2025-08-22 18:11:24 -04:00
11cad1f67b fix a typo. (#2561) 2025-08-19 22:23:09 -04:00
19772cd63e Fix typo in smem_allocator.py (#2517) 2025-08-10 22:44:22 -04:00
a267d47f9b Update batched_gemm.cu (#2538) 2025-08-10 22:42:21 -04:00
da47886e34 Fix example bug (#2351) 2025-07-30 22:12:33 -04:00
84a27b3926 fix: examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu GridDim miscalculated (#2492)
* fix: examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu Launch dimGrid error

* feat: add cta tiler

* Update examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu

use cluster_layout_vmnk instead of cta_tiler

Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>

* feat: remove cta_tiler

---------

Co-authored-by: qinghongzeng <qinghongzeng@deeproute.ai>
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>
2025-07-30 22:11:04 -04:00
e093b4f691 Fix tutorial comment in sgemm_1.cu: use tCrC instead of tCsA in axpby explanation (#2448) 2025-07-30 22:09:55 -04:00
0e026982ce Example 77 add blackwell fmha bwd for MLA shape (#2466)
* Update examples/77_blackwell_fmha/device/fmha_device_bwd.hpp

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>

* bug fix & use existing value rather than pass one more argument to support different dim in bwd_convert

* Fix casual mask cnt when IsQBegin==false

* bug fix in casual mask backward

* code sync

---------

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>
2025-07-24 18:41:11 -04:00
fd6cfe1ed0 v4.1 release update v2. (#2481) 2025-07-21 22:03:55 -04:00
9baa06dd57 Add Blackwell MLA forward (shape: d=192, dv=128) implementation in example_77 (#2472) 2025-07-18 01:27:48 -04:00
a1aaf2300a v4.1 release 2025-07-03 08:07:53 -04:00
889ff20648 v4.0 update v2. (#2420)
* Ex77 forward kernel fix.
2025-06-25 12:56:25 -04:00
dc4817921e v4.0 update. (#2398)
* Ex77 fix.
2025-06-12 09:10:29 -04:00
8bdbfca682 v4.0 update. (#2371) 2025-06-06 02:39:20 -04:00
2e2af190bd Revert "[ex77] fix mla split; add fwd lse; add bwd varlen (#2366)" (#2370)
This reverts commit f12b1d75c9.
2025-06-05 23:14:57 -04:00
f12b1d75c9 [ex77] fix mla split; add fwd lse; add bwd varlen (#2366) 2025-06-05 18:39:46 -04:00
9d165a3b8e Handle get_masked_trip_count for small length in fmha example (#2292)
* handle get_masked_trip_count for small length

* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>

* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>

---------

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>
2025-05-30 22:51:18 -04:00
b9b110a9ea Correct divmod order in example 77 (blackwell fmha) (#2291)
* correct divmod naming

* order bidh/bidb
2025-05-30 22:50:40 -04:00
8206e7a0f5 Pre-compile in CuteDsl/ampere/elementwise_apply.py (#2340) 2025-05-28 10:24:39 -04:00
f89cd95b16 Update elementwise_add.ipynb (#2298) 2025-05-15 09:38:27 -04:00
f115c3f854 Release v4.0.0 (#2294) 2025-05-13 15:55:29 -04:00
331a1f5b3f cutlass 3.9 update (#2255)
* cutlass 3.9 update

* rebase

* fixes out of shared memory for blockwise Blackwell

* doc format

* fix issue 2253

* disable host ref by default

* fix sm120 smem capacity

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-24 15:42:40 -04:00
b3f3c7758c Update tile_iterator.cu (#2204)
Some typos in comments
2025-04-10 14:49:58 -04:00
79fc51f4b8 v3.9 update (#2213)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-04-03 02:10:16 -04:00
6f4921858b v3.9 update (#2203)
* v3.9 update

* voidD

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-04-02 15:11:18 -04:00
62750a2b75 v3.9 (#2185)
* v3.8 update x

* fix blackwell gg

* doc change

* doc change

* doc change

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-03-21 01:52:23 -04:00
df18f5e4f5 Improvements for: Groupwise scaling along M for FP8 gemm (#2095)
* fix blockwise fp8 kernels

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* wip, < 128 not working

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* fix < 128

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* reduce diff

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* review comments

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* support partial n blocks

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* fix build errors

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-02-27 22:39:29 -05:00
ca4fdbea70 Blockwise and Groupwise GEMM for Blackwell and Improvements for Hopper (#2139)
- Blockwise and Groupwise GEMM improvements for Hopper.
- Blockwise and Groupwise GEMM for Blackwell.
- Blockwise Grouped GEMM for Hopper.
- Static ScalePromotionInterval for Hopper FP8 GEMMs.

Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2025-02-26 12:44:58 -05:00
b84e9802d8 update 3.8 v2 (#2112)
* update 3.8 v2

* update 3.8

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-19 22:03:14 -05:00
833f6990e0 v3.8.0 update (#2082)
* 3.8 update

* fix Markus' name

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-06 21:33:40 -05:00
6f55278121 bugfix generic-k code in top-k with softmax (#1993)
* bugfix generic-k code in top-k with softmax

* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

---------

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
2025-01-31 19:05:35 -05:00
3c28697b9f Groupwise scaling along M for FP8 gemm (#2037)
* FP8 groupwise scaling along M

* small updates

---------

Co-authored-by: zl <zl@deepseek.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-31 13:51:28 -05:00
389e493055 CUTLASS 3.8 Release (#2059)
* CUTLASS 3.8 Release

* update

* Update README.md

* Revert "Update README.md"

This reverts commit b353e36fe8.

* update

* update

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-25 02:44:06 -05:00
b78588d163 CUTLASS 3.7 (#2045)
* CUTLASS 3.7

* clean up changelog

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-18 09:53:07 -05:00
ef5620dd1d Blockwise Scaling for FP8 (#1932)
* F8 Blockwise Scaling

* two more NumProducerThreadEvents

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-09 11:22:09 -05:00
52b35e90ce Fix Typos (#2021)
* Fix Typo

* Fix Typo
2025-01-08 23:46:28 -05:00
3d261a5974 3.6.0 update (#2005)
* 3.6.0 update

* doc and swap stuff

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-12-25 01:34:40 -05:00
4c42f73fda Improve mixed dtype GEMM (#1972)
* update

* fix a typo
2024-12-06 13:33:22 -05:00
8aa95dbb88 Fix the racing condition of mixed-input gemm when writing the registers (#1931)
* move two warpgroup_wait

* merge main

---------

Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
2024-11-08 13:15:54 -05:00
08101d9d0c Improve sm90 mixed dtype kernel (#1883) 2024-10-17 20:06:38 -04:00
cc3c29a81a CUTLASS 3.6.0 (#1850)
* v3.6

* update changelog

* update readme

* fix typo

* fixing typos

* hopper gemm with weight prefetch

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-10-09 15:33:27 -04:00
dbdae514e0 Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm (#1795) 2024-09-11 00:07:31 -04:00
be60a0b272 CUTLASS 3.5.1 (#1623)
* CUTLASS 3.5.1

* updates, optimizations, fixes
2024-07-29 08:46:24 -04:00
843adf0408 Fix SMEM index for C in CuTe examples (#1477) 2024-07-10 11:14:15 -04:00