cutlass

Author	SHA1	Message	Date
Josh Fromm	affd1b693d	[EVT] Add support for Row/Col broadcast PtrArray (#2033 ) * Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-02-02 12:10:07 -05:00
Tadej Ciglarič	6f55278121	bugfix generic-k code in top-k with softmax (#1993 ) * bugfix generic-k code in top-k with softmax * Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> * Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> --------- Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>	2025-01-31 19:05:35 -05:00
Liang	3c28697b9f	Groupwise scaling along M for FP8 gemm (#2037 ) * FP8 groupwise scaling along M * small updates --------- Co-authored-by: zl <zl@deepseek.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-31 13:51:28 -05:00
Haicheng Wu	bdd641790a	Update README.md	2025-01-28 18:08:13 -05:00
Haicheng Wu	cc19d4d22b	fix a readme broken link (#2069 )	2025-01-28 18:03:34 -05:00
Haicheng Wu	47daa33c61	fix cuda 12.6 issues (#2066 )	2025-01-28 17:28:29 -05:00
mihir-awatramani	389e493055	CUTLASS 3.8 Release (#2059 ) * CUTLASS 3.8 Release * update * Update README.md * Revert "Update README.md" This reverts commit `b353e36fe8`. * update * update --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-25 02:44:06 -05:00
Yujia Zhai	9eb01fa0b0	update 3.7 docs (#2051 ) * update docs * update docs * update docs * update docs * update docs --------- Co-authored-by: yuzhai <yuzhai@nvidia.com>	2025-01-23 15:13:50 -05:00
Yujia Zhai	b78588d163	CUTLASS 3.7 (#2045 ) * CUTLASS 3.7 * clean up changelog --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com> v3.7.0	2025-01-18 09:53:07 -05:00
bobliao	902dff3663	fix assertion in integer_subbytes.h (#1961 )	2025-01-09 22:47:58 -05:00
Manish Gupta	ef5620dd1d	Blockwise Scaling for FP8 (#1932 ) * F8 Blockwise Scaling * two more NumProducerThreadEvents --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-09 11:22:09 -05:00
Lei Mao	375e284e6a	Add Line Break (#2020 )	2025-01-08 23:46:59 -05:00
Lei Mao	52b35e90ce	Fix Typos (#2021 ) * Fix Typo * Fix Typo	2025-01-08 23:46:28 -05:00
ZincCat	24f991e879	Fix typo in library_defaults.py (#2024 )	2025-01-08 15:44:11 -05:00
Driss Guessous	51b25e7b58	Add vector-types back to platform.h (#2026 )	2025-01-08 15:31:59 -05:00
ZZK	7de6a59784	Add half->int8 saturate conversion to promise valid range (#1983 ) * Add half->int8 saturate conversion to promise valid range * add gpu only macro --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-08 09:01:07 -05:00
Yujia Zhai	c506e16788	fix mem fence (#2030 ) Co-authored-by: yuzhai <yuzhai@nvidia.com>	2025-01-07 19:02:26 -05:00
Dongxu.Wang	7494a180a4	fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong (#1989 )	2025-01-06 22:05:12 -05:00
Andrew O'Neill	cffd5d32b7	Update 0x_gemm_tutorial.md (#1982 ) Shouldn't this be BLK_M, BLK_K, k	2025-01-06 22:04:35 -05:00
Haicheng Wu	bf9da7b76c	Update CHANGELOG.md v3.6.0	2024-12-25 17:11:15 -05:00
Yujia Zhai	3d261a5974	3.6.0 update (#2005 ) * 3.6.0 update * doc and swap stuff --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-12-25 01:34:40 -05:00
Lei Mao	e1cd8c7866	Fix Typo (#1962 )	2024-12-10 22:07:37 -05:00
Lei Mao	33c584364e	Fix CuTe README Typo (#1951 )	2024-12-10 22:05:40 -05:00
Lain	2b6cfd34d1	fix a typo that fails the compiling when ElementScale is not the same as MmaType (#1977 )	2024-12-10 15:54:44 -05:00
Lain	4c42f73fda	Improve mixed dtype GEMM (#1972 ) * update * fix a typo	2024-12-06 13:33:22 -05:00
Lain	80243e0b8c	add {uint4, uint2, int2} => {fp16, bf16} conversion (#1966 )	2024-12-03 14:03:43 -05:00
dan_the_3rd	b0e09d7cd3	Fix `cutlass` python library with cuda `12.6.2.post1` (#1942 ) * Fix `cutlass` python library with cuda `12.6.2.post1` Previously we had this error: ``` File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp> _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")] ^^^^^^ ValueError: invalid literal for int() with base 10: 'post1' ``` * Update sm90_utils.py * Update generator.py * Update python/cutlass_library/generator.py Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> * Update python/cutlass_library/sm90_utils.py Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> --------- Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>	2024-11-18 09:06:32 -05:00
Lain	8aa95dbb88	Fix the racing condition of mixed-input gemm when writing the registers (#1931 ) * move two warpgroup_wait * merge main --------- Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>	2024-11-08 13:15:54 -05:00
LiYu Lu	d656afbd2a	fix undefined in device code error (#1880 )	2024-11-06 14:56:54 -05:00
LiuQiang	32e3c38aef	remove restriction of stride == kernel in nhwc_pooling (#1896 )	2024-11-06 14:54:53 -05:00
Wenlei Bao	9004ed2d1b	Update publications (#1912 )	2024-11-06 14:54:15 -05:00
chenwei	19f51596e8	feat: support kFactor 8 used in mma tensor op tile iterator (#1512 )	2024-10-29 11:56:59 -04:00
azhurkevich	e8a8b69365	Refactor some GroupedGEMM logic (#1899 )	2024-10-25 20:14:01 -04:00
LiYu Lu	08a49953a0	Add a print for the uint{x}b_t type. (#1871 )	2024-10-24 14:39:22 -04:00
Caleb_Du	a424ca6cf9	fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support (#1856 ) * fix wrong A/BLayout in MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for m8n8k128, m16n8k128 mma.and.popc in MMA_Traits instantiation * add "print" template for subbyte_reference<T>	2024-10-24 14:38:35 -04:00
Lain	be692b48b0	remove redundant hardcoded packing configs in mixed dtype gemm (#1894 ) Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>	2024-10-23 14:24:09 -04:00
侯奇	12626bcfe4	Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (#1569 ) fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`	2024-10-23 12:56:36 -04:00
MaxAkaAltmer	f02913c34e	Include of regular_tile_iterator.h fixed for NVRTC (#1765 ) * Include of regular_tile_iterator.h fixed for NVRTC * More include fixed for NVRTC	2024-10-23 12:55:59 -04:00
103yiran	03e3bffaec	Adjusting code indentation (#1639 )	2024-10-23 12:55:02 -04:00
Lei Mao	e5f3caf145	Fix README (#1658 ) * Fix README * Improve README --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2024-10-23 12:52:43 -04:00
Bogumil Sapinski Mobica	83ae20c740	added mapping for bf16 to torch::kBFloat16 (#1843 ) Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2024-10-23 12:48:31 -04:00
Xinyu Yang	b0c09ed077	fix by adding public (#1753 )	2024-10-23 12:45:58 -04:00
sijialou	ea69cc2849	fix typo (#1853 )	2024-10-23 12:45:28 -04:00
Xinyu Yang	f3a3bfcbf2	add maximum support (#1833 )	2024-10-23 12:44:56 -04:00
Sergey Klevtsov	d65266a868	Add all supported GMMA shapes (#1890 )	2024-10-22 18:13:36 -04:00
Tri Dao	5b50a8faaf	Add GMMA shape m64n40k16 (#1864 )	2024-10-21 20:41:47 -04:00
Sergey Klevtsov	08101d9d0c	Improve sm90 mixed dtype kernel (#1883 )	2024-10-17 20:06:38 -04:00
Haicheng Wu	755194a7bd	add is_last_tile	2024-10-17 12:11:02 -07:00
Saagar Jha	53668799b2	Handle MNK Sm90{Row, Col}Reduction problem shapes (#1803 )	2024-10-14 19:46:20 -04:00
Yujia Zhai	cc3c29a81a	CUTLASS 3.6.0 (#1850 ) * v3.6 * update changelog * update readme * fix typo * fixing typos * hopper gemm with weight prefetch --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-10-09 15:33:27 -04:00

1 2 3 4 5 ...

597 Commits