cutlass

Author	SHA1	Message	Date
Junkai-Wu	a1aaf2300a	v4.1 release	2025-07-03 08:07:53 -04:00
Junkai-Wu	8bdbfca682	v4.0 update. (#2371 )	2025-06-06 02:39:20 -04:00
Kihiro Bando	f115c3f854	Release v4.0.0 (#2294 )	2025-05-13 15:55:29 -04:00
Yujia Zhai	331a1f5b3f	cutlass 3.9 update (#2255 ) * cutlass 3.9 update * rebase * fixes out of shared memory for blockwise Blackwell * doc format * fix issue 2253 * disable host ref by default * fix sm120 smem capacity --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-04-24 15:42:40 -04:00
kf-zhang	19cc2a5feb	add support for sm89 in cute and the unit tests (#2177 ) * add support for sm89 in cute and the unit tests * rebase v3.9 and format code * minor fix --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-04-10 14:16:36 -04:00
Yujia Zhai	79fc51f4b8	v3.9 update (#2213 ) Co-authored-by: yuzhai <yuzhai@nvidia.com>	2025-04-03 02:10:16 -04:00
Yujia Zhai	6f4921858b	v3.9 update (#2203 ) * v3.9 update * voidD --------- Co-authored-by: yuzhai <yuzhai@nvidia.com>	2025-04-02 15:11:18 -04:00
Yujia Zhai	62750a2b75	v3.9 (#2185 ) * v3.8 update x * fix blackwell gg * doc change * doc change * doc change --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2025-03-21 01:52:23 -04:00
Tyler Michael Smith	8c4d1dc47d	Treat negative zero as equivalent to positive zero in sm90_sparse_gemm_compressor.hpp (#2110 ) * Treat negative zero as zero in the sparse gemm compressor Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * format Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Apply patch Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * sm90_sparse_gemm_compressor.hpp * test/unit/transform/CMakeLists.txt * test/unit/transform/device/sm90_sparse_gemm_compressor_legacy.hpp * include/cutlass/numeric_types.h --------- Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2025-03-21 01:44:17 -04:00
Yujia Zhai	b84e9802d8	update 3.8 v2 (#2112 ) * update 3.8 v2 * update 3.8 --------- Co-authored-by: yuzhai <yuzhai@nvidia.com>	2025-02-19 22:03:14 -05:00
Yujia Zhai	833f6990e0	v3.8.0 update (#2082 ) * 3.8 update * fix Markus' name --------- Co-authored-by: yuzhai <yuzhai@nvidia.com>	2025-02-06 21:33:40 -05:00
mihir-awatramani	389e493055	CUTLASS 3.8 Release (#2059 ) * CUTLASS 3.8 Release * update * Update README.md * Revert "Update README.md" This reverts commit `b353e36fe8`. * update * update --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-25 02:44:06 -05:00
Yujia Zhai	b78588d163	CUTLASS 3.7 (#2045 ) * CUTLASS 3.7 * clean up changelog --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-18 09:53:07 -05:00
Yujia Zhai	3d261a5974	3.6.0 update (#2005 ) * 3.6.0 update * doc and swap stuff --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-12-25 01:34:40 -05:00
Lain	80243e0b8c	add {uint4, uint2, int2} => {fp16, bf16} conversion (#1966 )	2024-12-03 14:03:43 -05:00
侯奇	12626bcfe4	Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (#1569 ) fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`	2024-10-23 12:56:36 -04:00
Xinyu Yang	f3a3bfcbf2	add maximum support (#1833 )	2024-10-23 12:44:56 -04:00
Yujia Zhai	cc3c29a81a	CUTLASS 3.6.0 (#1850 ) * v3.6 * update changelog * update readme * fix typo * fixing typos * hopper gemm with weight prefetch --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-10-09 15:33:27 -04:00
Alexander Zinoviev	477a677317	Fix typos in test/unit/conv/cache_testbed_output.h (#1652 ) Co-authored-by: Alexander Zinoviev <azinoviev@tesla.com>	2024-10-07 12:39:11 -04:00
Junkai-Wu	dbdae514e0	Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm (#1795 )	2024-09-11 00:07:31 -04:00
Aleksandar Samardžić	e1976daacc	Add support for mixed 4-bit/8-bit data types GEMM (#1413 ) * Add support for mixed 4-bit/8-bit data types GEMM * fix ( and ) --------- Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-08-29 23:11:06 -04:00
Aleksandar Samardžić	3f084f7f3c	Add couple configs into generator.py for mixed input MM (#1350 ) * Add couple configs into generator.py for mixed input MM * change one unit test name; reenable 128x32 in the profiler * Added U8/BF16 tests. --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2024-08-16 00:59:29 -04:00
Haicheng Wu	8d8cfdf375	update 3.5.1 readme/changelog	2024-08-14 21:12:44 -07:00
Mark Hoemmen	19b4c5e065	Fix isnan namespace qualification in cutlass/functional.h (#1679 ) * Fix unrelated MSVC build warnings * Fix use of isnan in functional.h Correct namespace qualification of isnan in functional.h so that it invokes cutlass::isnan for half_t, instead of converting half_t to float and invoking std::isnan (on host, or ::isnan on device).	2024-08-05 14:28:13 -04:00
Vijay Thakkar	be60a0b272	CUTLASS 3.5.1 (#1623 ) * CUTLASS 3.5.1 * updates, optimizations, fixes	2024-07-29 08:46:24 -04:00
Daniel Richard G	d6580c3dc0	Support use of external/system GTest installation (#1469 ) * Support use of system/external GTest installation * Create working directory for tests explicitly	2024-07-10 11:07:57 -04:00
Alexander Zinoviev	dbfced05e7	Fix typos in convolution tests (#1433 )	2024-07-10 11:00:52 -04:00
Vijay Thakkar	7d49e6c7e2	Updates for CUTLASS 3.5.0 (#1468 )	2024-04-11 21:33:40 -04:00
reed	19f3cc33f1	Fix uint128 operator add (#1400 ) * fix uint128 operator add for 64-bit hilo implemenation * add uint128 test for operator add * make clang happy --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-04-02 13:32:18 -04:00
Vijay Thakkar	629f4653c3	CUTLASS 3.5.0 (#1411 )	2024-03-19 17:51:04 -04:00
ANIKET SHIVAM	bbe579a9e3	Updates for CUTLASS 3.4.1 (#1346 ) * Updates for CUTLASS 3.4.1 * minor epi change	2024-02-15 15:48:34 -05:00
Aleksandar Samardžić	ca37d632c9	Remove sparse GEMM with row broadcasted bias vector (#1302 ) This reverts commit `d3e72719b4`. Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>	2024-01-17 14:06:27 -05:00
Chengquan Jiang	362abbf274	Support ElementD to be void for tma (#1153 ) * Support void D with AuxStore * refine get_element_aux	2024-01-16 18:15:42 -05:00
ANIKET SHIVAM	751eb9a885	Update license year (#1306 )	2024-01-16 14:37:22 -05:00
ANIKET SHIVAM	2f589ffa76	Updates for 3.4 release. (#1305 )	2024-01-16 13:42:51 -05:00
Ali Hassani	d4be5ab5d7	Allow per-column bias in EpilogueTensorBroadcast (#1275 ) * Allow per-column bias in EpilogueTensorBroadcast EpilogueTensorBroadcast only supports per-row vector broadcast, because the bias stride is hardcoded. It can easily support both if the bias stride is made conditional, and the original behavior is maintained by defaulting to per-row. * Add unit test for EpilogueTensorBroadcast with per-col bias --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Ali Hassani <ali@hippoml.com>	2024-01-04 12:48:31 -05:00
Pradeep Ramani	8236f30675	CUTLASS 3.4.0 (#1286 ) * CUTLASS 3.4.0 * Update CHANGELOG.md --------- Co-authored-by: Pradeep Ramani <prramani@nvidia.com>	2023-12-29 15:21:31 -05:00
Christian Sigg	e1483d5fa0	Collection of changes to fix clang build. (#1200 ) * Remove unused variables * Qualify calls to make_fragment_? from templated base class. Fixes clang build error. * Add missing `#include <cstdio>` * Various changes to fix clang compile errors. * More changes to fix clang build. Remaining issues: - `params` initializer of `CollectiveEpilogue`. - `ops` initializer of `Sm90VisitorImplBase`. - `__usAtomicCAS` needs to be added to clang upstream. * Fix remaining clang build issues. * Qualify `cute::rank()` calls. * Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace. * Double-escape special registers in inline asm. * small change --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-12-08 14:42:12 -05:00
Pradeep Ramani	e9e30c2304	Updates and Bug fixes to CUTLASS 3.3 (#1232 )	2023-12-05 09:50:49 -05:00
Christian Sigg	60c8251b72	Remove unused variables (#1195 )	2023-12-01 09:52:19 -05:00
wang-y-z	557be3ab0e	Fix several typos (#1169 ) Co-authored-by: isaacw <isaacw@nvidia.com>	2023-11-02 23:54:46 -04:00
Pradeep Ramani	c008b4aea8	CUTLASS 3.3.0 (#1167 ) * Release 3.3.0 Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. * minor doc update	2023-11-02 11:09:05 -04:00
Manish Gupta	757275f279	Adding more Threadblock Tiles for Mixed-input TensorOp (BF16 * S8) in cutlass_library (#1132 ) * Adding more tiles in the cutlass_library for mixed-input support. * fix rebase issue * more tiles to upcast a	2023-10-13 11:33:15 -04:00
masahi	ff61a49dd1	Allow changing epsilon parameter in RMS norm kernel (#1112 )	2023-10-02 20:40:28 -04:00
Manish Gupta	7d8317a63e	Support for Mixed Input TensorOp (#1084 ) * Passing warp-level mixed input F16(S8/U8) tests passing device-level mixed input F16(S8/U8) tests add to profiler - I8 (111 TFLOPs), U (123 TFLOPs) * fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs) * Speedup reference compilation (REVERT THIS COMMIT) * wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s) * Improve s8->f16 cvt and support bf16u8 @158 TFLOPs BF16 * S8 (142 TFLOPs) * Handle mixed-input upcast on OperandA (Support [S8\|U8][F16\|BF16] rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast * Add device-level test and profiler support for upcast on operand A * Move shfl before the cvt and reduce #shfls by 1/2 * fix smem_usage calculation for mixed_input types * uncomment the stuff (getting ready for merge) * profiler changes and mixed-input reference * mixed input reference are in a new file * use platform instead of std * comments and typo only * Use CreateGemmOperator and delete CreateMixedInputGemmOperator * copyright for new files * rebase follow-up	2023-09-27 11:18:30 -04:00
ANIKET SHIVAM	90d3b0fb18	CUTLASS 3.2.1 (#1113 ) * Updates for 3.2.1 release. * Minor fix in gemm op profiler for raster order. * Add scheduler mapping for raster order in the kernels.	2023-09-26 17:24:26 -04:00
ANIKET SHIVAM	a88c41cf8d	Updates for 3.2 release (#1065 )	2023-08-25 23:05:46 -04:00
ANIKET SHIVAM	4575443d44	CUTLASS 3.2 (#1024 ) * CUTLASS 3.2	2023-08-07 20:50:32 -04:00
masahi	f679663224	Add RMS norm (#979 )	2023-07-10 21:31:27 -04:00
Jack Kosaian	7dbf423763	Add conversion from ElementBias to ElementCompute (#961 )	2023-05-26 23:08:36 -04:00

1 2

95 Commits