cutlass

Author	SHA1	Message	Date
kernyan	e093b4f691	Fix tutorial comment in sgemm_1.cu: use tCrC instead of tCsA in axpby explanation (#2448 )	2025-07-30 22:09:55 -04:00
Haicheng Wu	664c4f7b3e	Update CUTLASS version to 4.1 Update CUTLASS version to 4.1.	2025-07-26 20:11:04 -04:00
Zeyu WANG	0e026982ce	Example 77 add blackwell fmha bwd for MLA shape (#2466 ) * Update examples/77_blackwell_fmha/device/fmha_device_bwd.hpp Co-authored-by: Vijay Thakkar <vijaythakkar@me.com> * bug fix & use existing value rather than pass one more argument to support different dim in bwd_convert * Fix casual mask cnt when IsQBegin==false * bug fix in casual mask backward * code sync --------- Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>	2025-07-24 18:41:11 -04:00
Larry Wu	9a9a579714	Merge pull request #2489 from NVIDIA/update_workflow_script Support "CuTe DSL" auto-labeling in workflow	2025-07-23 15:33:43 +08:00
Larry Wu	51d730b8be	Support "CuTe DSL" auto-labeling in workflow	2025-07-23 00:28:01 -07:00
Larry Wu	6c0c8b7484	1. Update bug/feature report template to add component selection. (#2485 ) 2. Add workflow to apply component label automatically	2025-07-22 12:38:03 -04:00
Haicheng Wu	e51efbfe18	Update CHANGELOG.md v4.1.0	2025-07-21 22:09:56 -04:00
Junkai-Wu	fd6cfe1ed0	v4.1 release update v2. (#2481 )	2025-07-21 22:03:55 -04:00
zhang	9baa06dd57	Add Blackwell MLA forward (shape: d=192, dv=128) implementation in example_77 (#2472 )	2025-07-18 01:27:48 -04:00
Colin Peppler	ebe98c549a	cache procedural_name in GemmOperation (#2317 )	2025-07-16 22:25:02 -04:00
Oleksandr Pavlyk	9892624b66	Fix typos in the text (#2417 )	2025-07-16 21:51:12 -04:00
Junkai-Wu	a1aaf2300a	v4.1 release	2025-07-03 08:07:53 -04:00
Haicheng Wu	b995f93317	4.0 doc change (#2425 ) v4.0.0	2025-06-27 09:35:06 -04:00
Junkai-Wu	889ff20648	v4.0 update v2. (#2420 ) * Ex77 forward kernel fix.	2025-06-25 12:56:25 -04:00
Junkai-Wu	dc4817921e	v4.0 update. (#2398 ) * Ex77 fix.	2025-06-12 09:10:29 -04:00
brandonsun	5c6bca0441	Update requirements.txt (#2390 ) Remove the dev suffix in the wheel version	2025-06-10 02:31:49 -04:00
drazi	c2ad7c5b20	fix link in readme (#2379 )	2025-06-07 07:38:38 -04:00
drazi	cc23f6d1e9	fix link (#2377 )	2025-06-07 06:00:39 -04:00
Vijay Thakkar	5a287538c2	"Update CHANGELOG for 4.0 tagging" (#2374 )	2025-06-06 10:07:36 -04:00
Junkai-Wu	8bdbfca682	v4.0 update. (#2371 )	2025-06-06 02:39:20 -04:00
Manish Gupta	2e2af190bd	Revert "[ex77] fix mla split; add fwd lse; add bwd varlen (#2366 )" (#2370 ) This reverts commit `f12b1d75c9`.	2025-06-05 23:14:57 -04:00
Markus Hoehnerbach	f12b1d75c9	[ex77] fix mla split; add fwd lse; add bwd varlen (#2366 )	2025-06-05 18:39:46 -04:00
zekunf-nv	b244379d9b	Merge pull request #2359 from NVIDIA/oss_ci Initial Workflow Definition for blossom-ci support on CUTLASS GitHub	2025-06-03 14:04:35 -07:00
Taebum Kim	9d165a3b8e	Handle get_masked_trip_count for small length in fmha example (#2292 ) * handle get_masked_trip_count for small length * Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp Co-authored-by: Vijay Thakkar <vijaythakkar@me.com> * Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp Co-authored-by: Vijay Thakkar <vijaythakkar@me.com> --------- Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>	2025-05-30 22:51:18 -04:00
Taebum Kim	b9b110a9ea	Correct divmod order in example 77 (blackwell fmha) (#2291 ) * correct divmod naming * order bidh/bidb	2025-05-30 22:50:40 -04:00
Gabriel Wu	8206e7a0f5	Pre-compile in CuteDsl/ampere/elementwise_apply.py (#2340 )	2025-05-28 10:24:39 -04:00
co63oc	6316b6f867	Fix typos (#2311 ) Signed-off-by: co63oc <co63oc@users.noreply.github.com>	2025-05-23 08:30:10 -04:00
zkyue	9354bfd7c1	Keep the documentation consistent with the sgemm_1.cu code. (#2285 ) * Keep the documentation consistent with the sgemm_1.cu code. * fix typo --------- Co-authored-by: zky <zky@126.com>	2025-05-19 22:53:15 -04:00
1096125073	5e9b8e2a25	fix docx (#2290 ) Co-authored-by: xiayongqiang <xiayq1@chinatelecom.cn>	2025-05-19 22:52:37 -04:00
Ruyman	1ec230c4bf	Fix typo (#2299 ) Needs == for pip to parse the file	2025-05-15 09:38:42 -04:00
Driss Guessous	f89cd95b16	Update elementwise_add.ipynb (#2298 )	2025-05-15 09:38:27 -04:00
Kihiro Bando	f115c3f854	Release v4.0.0 (#2294 )	2025-05-13 15:55:29 -04:00
Haicheng Wu	ad7b2f5e84	3.9.2 doc/version (#2279 ) * 3.9.2 doc/version * whitespace v3.9.2	2025-05-04 00:00:15 -04:00
Ali Hassani	40f124ef27	[CUTLASS] Add GNA to PUBLICATIONS.md (#2276 ) Adds "Generalized Neighborhood Attention" to list of publications using CUTLASS. https://arxiv.org/abs/2504.16922 Co-authored-by: Ali Hassani <ahassani@nvidia.com>	2025-05-02 16:57:19 -04:00
Jiazhen Han	89f6bf2739	Fix group scale gemm when K==128 (#2275 ) Co-authored-by: Jiazhen Han <jiazhenh@nvidia.com>	2025-05-02 15:41:18 -04:00
Haicheng Wu	f535c33634	3.9.1 doc/version change (#2273 ) v3.9.1	2025-05-01 00:27:00 -04:00
Michael Lazos	e3cb8a773a	Import cuda, cudart, nvrtc lazily (#2251 ) * Lazy cuda import * More lazy cuda import * More lazy cuda imports * minor fixes --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-04-30 23:10:33 -04:00
Michael Lazos	c4bdfe821c	Lazy scipy import (#2250 )	2025-04-30 16:10:00 -04:00
Michael Lazos	b3ce7e12b7	Make cc a positional argument (#2249 )	2025-04-30 16:09:25 -04:00
Michael Lazos	fe75ead92e	Import pydot lazily (#2248 )	2025-04-30 16:08:17 -04:00
Ruoxi	35136f5564	Fix wrong detection of python version for `use_rmm`. (#2224 )	2025-04-30 15:29:33 -04:00
Qi Yuhang	e5b810bed1	Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation schedule. (#2256 ) Co-authored-by: Yuhang Qi <qiyuhang@bytedance.com>	2025-04-30 15:28:05 -04:00
Lain	2b78c2fe31	cherry-pick feature/hopper-blockwise-generalization-optimization (#2270 )	2025-04-29 16:47:22 -04:00
Haicheng Wu	697126019e	fix blackwell grouped groupwise hang (#2267 )	2025-04-29 11:54:20 -04:00
Haicheng Wu	e94e888df3	Update CHANGELOG.md v3.9.0	2025-04-24 21:51:34 -04:00
Haicheng Wu	be73ad20a5	Update CHANGELOG.md for 3.9	2025-04-24 16:54:06 -04:00
Haicheng Wu	f02a7c2976	Update README.md for 3.9	2025-04-24 16:51:45 -04:00
Yujia Zhai	331a1f5b3f	cutlass 3.9 update (#2255 ) * cutlass 3.9 update * rebase * fixes out of shared memory for blockwise Blackwell * doc format * fix issue 2253 * disable host ref by default * fix sm120 smem capacity --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-04-24 15:42:40 -04:00
吴坎	8e345c5c5b	fix_missing_stdint (#2199 ) * Update config.hpp * 更新 config.hpp * 更新 config.hpp	2025-04-23 22:21:22 -04:00
Tri Dao	81a43e6d92	Set EpiTile correctly when TileN is not divisible by 32 (#2220 ) If TileN is not divisible by 32 (e.g, 208), by default EpiTile would be set to 128 x 32, which does not compile as TileN is required to divide EpiTileN	2025-04-21 00:02:51 -04:00

1 2 3 4 5 ...

675 Commits