cutlass

Author	SHA1	Message	Date
Adnan Akhundov	3c995c7606	Extend DualGemm: support batched mode + decouple B0/B1 layouts (#790 ) * Fix MHA kernel Summary: ATT Test Plan: Reviewers: Subscribers: Tasks: Tags: * Extend DualGemm to support batched mode (#5) Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set. * Decouple LayoutB0 and LayoutB1 in DualGemm The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously. In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above. * Remove comment as no longer relevant * Revert Fix MHA kernel --------- Co-authored-by: mikeiovine <mikeiovine@fb.com>	2023-02-13 15:27:13 -05:00
Shuai Shao	ce8597dc14	Fix type bug in conv2d/gemm with broadcast (#796 ) add ElementVector --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-09 20:53:25 -05:00
dan_the_3rd	2e10404d26	xFormer updates to fMHA FW (#773 ) * xFormer updates to fMHA FW * Convert format to BMHK for '41_fused_multi_head_attention_fixed_seqlen' * Add missing files * Remove xFormers specific code * Update fused_multihead_attention_fixed_seqlen.cu * rebase and solve conflicts * remove white space --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-08 23:00:10 -05:00
Jack Kosaian	5ff5209ed5	Add acc2smem in epilogue/threadblock/epilogue.h (#806 )	2023-02-06 22:04:16 -05:00
Jack Kosaian	5921043981	Re-enable all alignments for int accumulators (#807 )	2023-02-06 22:01:15 -05:00
Mark Hoemmen	add4ba622f	Fix 8.4 + CUDA 11.4 build (#789 ) Work around a likely GCC 8.x issue with fold expressions and generic lambdas. Only use the work-around when the host compiler is GCC 8.x. This avoids any concerns about the work-around possibly hindering inlining for a critical CuTe function (product). Users can experiment with the work-around for other compilers or compiler versions by defining the following macro. CUTE_FOLD_GENERIC_LAMBDA_WORKAROUND Fixes https://github.com/NVIDIA/cutlass/issues/788 Co-authored-by: Mark Hoemmen <mhoemmen@nvidia.com>	2023-01-27 09:18:59 -05:00
Vijay Thakkar	277bd6e537	CUTLASS 3.0.0 (#786 ) * CUTLASS 3.0.0	2023-01-23 20:55:28 -05:00
ANIKET SHIVAM	66d9cddc83	New updates for 2.11 (#775 ) * New updates. * Minor profiler updates Co-authored-by: Aniket Shivam <ashivam@nvidia.com> v2.11.0	2023-01-20 16:32:57 -05:00
psaab	d49bef88f9	Enable aarch64 support (#779 )	2023-01-20 15:51:58 -05:00
Haicheng Wu	8b42e751c6	streamk paper link (#765 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-10 22:10:43 -05:00
Muhammad Osama	eb7f99d3dd	@hwu36 Adding the individual arXiv link for Stream-K paper. (#764 ) * Stream-K individual paper entry. * arXiv links updated.	2023-01-10 20:39:06 -05:00
Haicheng Wu	764b840d6f	streamk example and performance tuning (#760 ) * streamk example and performance tuning * one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-10 16:10:02 -05:00
Ali Hassani	a1046d49c1	Adds missing semicolon (#759 )	2023-01-09 21:50:46 -05:00
Haicheng Wu	1cd994b4cf	Update PUBLICATIONS.md @neoblizz @dumerrill thesis covering streamk	2023-01-08 00:42:19 -05:00
Gregory Meyer (gregjm)	7bdba07310	Add definitions for tag structs. (#752 ) This commit changes the declarations of MMA operator class (SIMT, Tensor Core, WMMA Tensor Core) and operator type (multiply-add and so on) to definitions. This is done so that these tag structs are no longer incomplete types, which allows the `typeid` operator to be used on these tag structs. This is necessary for these tag structs to be used as type parameters in [GoogleTest typed tests](https://google.github.io/googletest/advanced.html#typed-tests).	2023-01-06 09:46:52 -05:00
Gregory Meyer (gregjm)	c54ede3a9e	Add const overloads for iterator functions. (#753 ) This commit adds `const`-correct overloads for `Array::{begin,end,rbegin,rend}`. These overloads are necessary for usage with [the GMock Container Matchers](http://google.github.io/googletest/reference/matchers.html#container-matchers), which cast the `Container` argument to a constant reference.	2023-01-06 09:46:34 -05:00
Haicheng Wu	ff6e733fe1	restore the old epilogue for everything except streamk (#749 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-04 11:02:55 -05:00
Haicheng Wu	5989b7e1d7	Update PUBLICATIONS.md Add coconet paper to the publication list. @abhijangda	2023-01-04 09:18:38 -05:00
Haicheng Wu	1e64f153b3	improve streamk load balance (#743 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-25 13:56:33 -05:00
Matthew Nicely	78b30d3191	Update README.md	2022-12-21 11:58:19 -05:00
Matthew Nicely	59de82688b	Update README.md	2022-12-21 11:57:55 -05:00
Gregory Meyer (gregjm)	b85865d1ad	Add missing #include directives (#741 ) This commit adds two `#include` directives so that the definitions of `cutlass::gemm::warp::WarpSize` from "cutlass/gemm/warp/mma.h" and `cutlass::arch::OpClassSimt` from "cutlass/arch/mma.h" are visible to "cutlass/epilogue/threadblock/default_epilogue_simt.h". Without them, there are compiler errors when building the header standalone: ``` In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:32: error: no member named 'warp' in namespace 'cutlass::gemm'; did you mean simply 'warp'? static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ^ ./cutlass/include/cutlass/epilogue/warp/tile_iterator_simt.h:49:11: note: 'warp' declared here namespace warp { ^ In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:53: error: no member named 'WarpSize' in namespace 'cutlass::epilogue::warp' static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ~~~~~~^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:68: error: no member named 'OpClassSimt' in namespace 'cutlass::arch' static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ~~~~~~^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:82: error: no member named 'value' in the global namespace static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ~~^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:367:5: error: use of class template 'OutputTileThreadMap' requires template arguments OutputTileThreadMap, ^ ./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> { ^ In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:391:5: error: use of class template 'OutputTileThreadMap' requires template arguments OutputTileThreadMap, ^ ./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> { ^ In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:405:5: error: unknown type name 'OutputTileIterator'; did you mean 'WarpTileIterator'? OutputTileIterator, ^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:380:9: note: 'WarpTileIterator' declared here using WarpTileIterator = cutlass::epilogue::warp::TileIteratorSimtDirect2dConv< ^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:408:5: error: use of class template 'SharedLoadIterator' requires template arguments SharedLoadIterator, ^ ./cutlass/include/cutlass/epilogue/threadblock/shared_load_iterator.h:67:7: note: template is declared here class SharedLoadIterator { ^ ```	2022-12-21 11:40:20 -05:00
Haicheng Wu	3f2bb17722	minor chagnes (#730 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-10 14:44:53 -05:00
ANIKET SHIVAM	38193d76e3	Updates for stream-k (#728 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-12-08 23:48:10 -05:00
Gregory Meyer (gregjm)	1d7772f218	Add missing #include directive (#727 )	2022-12-08 18:58:31 -05:00
Jack Kosaian	df81d847d7	Make Python interface work for non-SM80 targets (#726 ) * Make Python interface work for non-SM80 targets * Remove line in README	2022-12-07 21:53:33 -05:00
Mike Iovine	d6117ca362	Relax stream K gemm alignment constraints (#717 ) * Relax stream K gemm alignment constraints The current alignment requirements are too strict. Make them identical to the checks for the regular universal gemm. * Revert "Relax stream K gemm alignment constraints" This reverts commit `31e80a250e`. * Relax stream K gemm alignment constraints The current alignment requirements are too strict. Make them identical to the checks for the regular universal gemm. Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-07 11:17:49 -05:00
Ali Hassani	9c0518608e	Fix typos in conv problem sizes (#720 ) * Fix typos in conv problem sizes * Typos	2022-12-05 15:54:58 -05:00
Haicheng Wu	9f1f37aa21	misc (#719 ) * misc * minor Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-05 12:07:20 -05:00
Wenzhuo Liu	84213b0b8e	fix: make arch.h self contained (#714 )	2022-12-01 19:25:48 -05:00
tpoisonooo	8567b87d65	Update quickstart.md (#704 ) * Update quickstart.md * Update doxygen_mainpage.md * Update doxygen_mainpage.md * Update terminology.md	2022-11-29 21:43:03 -05:00
Aditya Atluri	c975e2ccbb	releaase 2.11 (#703 )	2022-11-19 09:02:15 -05:00
Wenzhuo Liu	3c90f6aea6	add `#pragma once` for header file in example 42 (#698 )	2022-11-15 22:50:24 -05:00
seventh	06eb90cc0d	Fix identity sigmoid activation (#659 ) * activation support Identity * fix Sigmoid activation operator() with CUTLASS_HOST_DEVICE	2022-11-09 14:42:23 -05:00
seventh	168ea8b0e1	ensure singleton::get thread safe construct instance (#658 ) * ensure singleton::get thread safe construct instance * fix singleton return reference Co-authored-by: xuweiqi <xuweiqi117@gmail.com>	2022-11-08 21:44:32 -05:00
Haicheng Wu	012c62c748	bug fixes and enharcement to gemm reductionK fusion (#682 ) * add two missing files * fix bunch of bugs of gemm-reducek fusion and add a device interface * small changes Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-11-03 11:07:50 -04:00
FZC	cc85b64cf6	fix typo (#677 )	2022-11-01 14:07:33 -04:00
dan_the_3rd	1b4e24470a	Example 43 - DualGemm (#670 ) * Ex50 wip * IS_PROFILING mode * MultiStage2 - but is slower * Add SwiGLU * Support SplitKSerial reduction Support not storing D0/D1 Cleanup code * Option to disable bias * Renumber example * Fix build * Remove references to pb_size_0 / pb_size_1 * Add support for bf16 inputs with float accum * small changes Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-10-26 14:04:42 -04:00
Jack Kosaian	8c1bf9b784	Bump CUTLASS Python container version (#672 ) * Update example 40 README * Update CUTLASS Python README	2022-10-22 21:09:39 -04:00
Yuriy Chernyshov	7d0dd6706e	Remove excessive includes from examples/41_multi_head_attention (#669 ) The rationale behind this change is explained in #563	2022-10-21 22:23:15 -04:00
hlu1	9b47403b2d	Add missing CUTLASS_HOST_DEVICE (#671 )	2022-10-21 22:20:38 -04:00
dan_the_3rd	4db6a6140e	ex42: Fused MHA imported from xFormers (#662 ) * ex42: Fused MHA imported from xFormers * Remove std:: references * Support K>128 in the example * Support causal option * Support different head size for V, and different seqlength for KV * Update FLOPS counter * Remove bit_cast * fix build: Replace M_LOG2E * Add doc * Revert "Remove bit_cast" This reverts commit `9662fa86bb`. * Explicit casts to int32_t for windows build Co-authored-by: danthe3rd <danthe3rd>	2022-10-17 10:49:33 -04:00
Matthew Nicely	3bf95e90c2	Update labeler.yml	2022-10-13 08:03:28 -04:00
Matthew Nicely	75fed7493e	Update labeler.yml	2022-10-13 08:01:21 -04:00
Matthew Nicely	98b73fc95d	Update labeler.yml	2022-10-13 07:55:33 -04:00
Matthew Nicely	4990e3686d	Update labeler.yml	2022-10-13 07:52:38 -04:00
Matthew Nicely	4b7365388c	Update labeler.yml	2022-10-13 07:32:55 -04:00
Matthew Nicely	0d8405588d	Update labeler.yml	2022-10-12 15:32:38 -04:00
Alexander Freudenberg	cb539dab78	Correct typos in comments (#639 ) * Correct typos in comments Correct comments in code on type of generated distribution. Improve Gaussian RNG to take advantage of Box Muller method * Inline Box Muller Added inline function for the Box Muller algorithm and updated code comments to be more concise * Update tensor_fill.h * Update tensor_fill.h * small changes to pass tests Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-09-30 22:51:30 -04:00
Ying Zhang	dadc881a96	Bug fix for gemm broadcast (#650 ) * gemm_universal_with_broadcast, +2 sources. * Revert "gemm_universal_with_broadcast, +2 sources." This reverts commit `fb063251f2`. * gemm broadcast bug fix	2022-09-30 10:00:38 -04:00

1 2 3 4 5 ...

329 Commits