cutlass

Author	SHA1	Message	Date
reed	092f14db05	fix tile_size_mnk compilation warning (#1294 )	2024-01-29 21:21:15 -05:00
Haicheng Wu	9385141f19	Update PUBLICATIONS.md ptq paper from goog	2024-01-19 14:17:55 -05:00
Haicheng Wu	b4b5b11070	Update PUBLICATIONS.md add odyssey llm paper from metuan	2024-01-18 10:30:21 -05:00
jayhshah	139b93db61	update publications (#1308 )	2024-01-17 14:06:46 -05:00
Aleksandar Samardžić	ca37d632c9	Remove sparse GEMM with row broadcasted bias vector (#1302 ) This reverts commit `d3e72719b4`. Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>	2024-01-17 14:06:27 -05:00
Chengquan Jiang	362abbf274	Support ElementD to be void for tma (#1153 ) * Support void D with AuxStore * refine get_element_aux	2024-01-16 18:15:42 -05:00
ANIKET SHIVAM	751eb9a885	Update license year (#1306 ) v3.4.0	2024-01-16 14:37:22 -05:00
ANIKET SHIVAM	2f589ffa76	Updates for 3.4 release. (#1305 )	2024-01-16 13:42:51 -05:00
Tianao Ge	acba5beee5	Fix flops calculation and tensor b stride calculation in the example 36 (#1278 ) * Fix flops calculation and tensor b stride calculation in the example 36 * Fix datatype * Update gather_scatter_fusion.cu	2024-01-08 17:27:30 -05:00
Eugene Zhulenev	74d1f3e63a	Fix cute::array<T, 0> iterator (#1273 )	2024-01-08 17:10:09 -05:00
Kun Wu	8ac2edc810	expose stream API in python kernel call interfaces (#1287 ) * expose stream API in python kernel call interfaces * add stream to ReductionArguments; document stream arg * add stream argument to GemmGroupedArguments	2024-01-05 08:27:45 -05:00
Ali Hassani	d4be5ab5d7	Allow per-column bias in EpilogueTensorBroadcast (#1275 ) * Allow per-column bias in EpilogueTensorBroadcast EpilogueTensorBroadcast only supports per-row vector broadcast, because the bias stride is hardcoded. It can easily support both if the bias stride is made conditional, and the original behavior is maintained by defaulting to per-row. * Add unit test for EpilogueTensorBroadcast with per-col bias --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Ali Hassani <ali@hippoml.com>	2024-01-04 12:48:31 -05:00
Jee Li	c9591a694d	fix typo (#1279 )	2024-01-04 12:41:39 -05:00
Aleksandar Samardžić	5c756eb774	Add support for sparse GEMM with visitor epilogue (#1189 ) * Add support for sparse GEMM with visitor epilogue * Refactor changes at the kernel level	2024-01-04 12:38:11 -05:00
Pradeep Ramani	8236f30675	CUTLASS 3.4.0 (#1286 ) * CUTLASS 3.4.0 * Update CHANGELOG.md --------- Co-authored-by: Pradeep Ramani <prramani@nvidia.com>	2023-12-29 15:21:31 -05:00
Christian Sigg	b7508e3379	Fix inline ptx escaping for predicates. (#1264 ) * Fix inline ptx escaping for predicates. Prevents `error: invalid % escape in inline assembly string` when compiling with clang. * More double-quoting.	2023-12-14 11:16:15 -05:00
Gregory Meyer (gregjm)	f60786b536	Remove undefined behavior from default constructor of PredicatedTileAccessIteratorParams. (#1258 ) Currently, the default constructor of `PredicatedTileAccessIteratorParams` will invoke undefined behavior in its invocation of the `initialize` function. Specifically, it will attempt to read from the uninitialized variables `desc.element_size_bits` and `desc.advance_rank`. This commit changes the default constructors of both `Params` and `Desc` to zero-initialize all uninitialized members.	2023-12-11 23:01:53 -05:00
Andrey Portnoy	30ec1a4649	Use size_t index to iterate up to std::vector::size() (#1251 ) Fixes a different signedness compare warning.	2023-12-09 08:44:31 -05:00
Christian Sigg	e1483d5fa0	Collection of changes to fix clang build. (#1200 ) * Remove unused variables * Qualify calls to make_fragment_? from templated base class. Fixes clang build error. * Add missing `#include <cstdio>` * Various changes to fix clang compile errors. * More changes to fix clang build. Remaining issues: - `params` initializer of `CollectiveEpilogue`. - `ops` initializer of `Sm90VisitorImplBase`. - `__usAtomicCAS` needs to be added to clang upstream. * Fix remaining clang build issues. * Qualify `cute::rank()` calls. * Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace. * Double-escape special registers in inline asm. * small change --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-12-08 14:42:12 -05:00
Ali Hassani	f4a0216601	Fix bug in single source GEMM with residual + streamk (#1249 ) Followup to #1224. A change in the stream-k threadblock swizzle ctor since 3.3 breaks single source GEMM with fused epilogue and stream-k. Multi-source was already corrected. Co-authored-by: Ali Hassani <ahassanijr@gmail.com>	2023-12-07 11:12:02 -05:00
Valeriy Fedyunin	f188f9b709	Fix typo in quickstart.md (#1257 )	2023-12-07 09:49:52 -05:00
Haicheng Wu	9c9b51d35c	Update PUBLICATIONS.md	2023-12-07 00:02:36 -05:00
Ali Hassani	a75b4ac483	Fix Stream-K reduce bug in epilogue with broadcast (#1224 ) Co-authored-by: Ali Hassani <ahassanijr@gmail.com> v3.3.0	2023-12-05 15:35:41 -05:00
Pradeep Ramani	e9e30c2304	Updates and Bug fixes to CUTLASS 3.3 (#1232 )	2023-12-05 09:50:49 -05:00
Haicheng Wu	4a1709e17e	Fixed illegal PTX syntax (#1225 )	2023-12-01 12:29:48 -05:00
Christian Sigg	bef1fbcbe6	Add missing `#include <cstdio>` (#1197 ) * Add missing `#include <cstdio>` * move to non nvrtc part --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-12-01 11:58:53 -05:00
Christian Sigg	2375a07d01	Qualify calls to make_fragment_? from templated base class. (#1196 ) Fixes clang build error.	2023-12-01 09:52:57 -05:00
Christian Sigg	60c8251b72	Remove unused variables (#1195 )	2023-12-01 09:52:19 -05:00
cyyever	10b850f9c7	Fix some sign conversion warnings (#1172 ) * Fix sign conversion warnings * Fix type conversion warnings * Fix sign conversion warnings * Change smem_size_ to constexpr * clang warnings * undo cast change * one miss change * missing part --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-11-30 00:28:40 -05:00
Christian Sigg	99c4eebe3b	Explicitly cast `blockIdx` to `uint3` (#1192 ) This works around a clang issue where blockIdx is of a different type.	2023-11-30 00:26:23 -05:00
Christian Sigg	a759e85f5f	Add subclass declarations to generated files. (#1193 )	2023-11-30 00:25:40 -05:00
Christian Sigg	56fc3df03b	Adding missing `typename` (#1191 ) Fixes clang build failures.	2023-11-29 00:20:20 -05:00
reed	eb01d5449d	fix cp.async L2 prefetch typo (#1187 )	2023-11-28 16:58:04 -05:00
Jack Kosaian	8098336d51	Updates to Python interface for PyPI packaging (#1209 ) * Updates * Updates to notebooks	2023-11-28 13:52:12 -05:00
Sergey Klevtsov	b5d8a5d9cc	Allow SM90 pingpong kernel to use custom tile schedulers (#1194 ) Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2023-11-15 13:45:17 -05:00
reed	6e60b9b17c	enable L2::128B prefetch for cp.async by default (#1177 )	2023-11-13 13:30:13 -05:00
Changho Hwang	1ab6cc7b68	Fix `std::abs` overloading for `bfloat16_t` (#1179 )	2023-11-13 13:29:45 -05:00
Manish Gupta	5ae8133cfa	Doc only change changelog 3.3 (#1180 )	2023-11-13 13:29:22 -05:00
reed	39c6a83f23	fix missing return warning (#1173 )	2023-11-03 22:42:59 -04:00
wang-y-z	1d7f2a207e	Fix several broken links (#1168 ) Co-authored-by: isaacw <isaacw@nvidia.com>	2023-11-03 00:01:25 -04:00
wang-y-z	557be3ab0e	Fix several typos (#1169 ) Co-authored-by: isaacw <isaacw@nvidia.com>	2023-11-02 23:54:46 -04:00
Pradeep Ramani	c008b4aea8	CUTLASS 3.3.0 (#1167 ) * Release 3.3.0 Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. * minor doc update	2023-11-02 11:09:05 -04:00
reed	922fb5108b	clean the format (#1140 )	2023-10-24 22:59:06 -04:00
cyyever	7a7796afae	Fix is_zero (#1147 ) * Fix is_zero * Use constexpr * Add CUTLASS_PRAGMA_UNROLL to loops * Avoid if branches in is_zero	2023-10-23 12:09:37 -04:00
milesvant	fb10fa5308	Fix broken pipeline link in docs (#1143 )	2023-10-18 12:55:46 -04:00
Haicheng Wu	5e1a0a5adb	fix alignmentC for h16816_s8xf16 (#1146 ) * fix alignmentC for h16816_s8xf16 * manish's change	2023-10-17 15:15:39 -04:00
Manish Gupta	757275f279	Adding more Threadblock Tiles for Mixed-input TensorOp (BF16 * S8) in cutlass_library (#1132 ) * Adding more tiles in the cutlass_library for mixed-input support. * fix rebase issue * more tiles to upcast a	2023-10-13 11:33:15 -04:00
reed	fa8dfe631f	fix missing return warning for repeat and axpby (#1124 )	2023-10-12 00:05:45 -04:00
Jake Hemstad	112590114d	Add config.yml issue template with Discord link. (#1135 )	2023-10-10 12:13:04 -04:00
Manish Gupta	ff02da2667	Fx parallel split-k (#1116 )	2023-10-06 12:02:40 -04:00

1 2 3 4 5 ...

480 Commits