Commit Graph

  • 40f124ef27 [CUTLASS] Add GNA to PUBLICATIONS.md (#2276) Ali Hassani 2025-05-02 16:57:19 -04:00
  • 89f6bf2739 Fix group scale gemm when K==128 (#2275) Jiazhen Han 2025-05-02 12:41:18 -07:00
  • f535c33634 3.9.1 doc/version change (#2273) v3.9.1 Haicheng Wu 2025-05-01 00:27:00 -04:00
  • e3cb8a773a Import cuda, cudart, nvrtc lazily (#2251) Michael Lazos 2025-04-30 20:10:33 -07:00
  • c4bdfe821c Lazy scipy import (#2250) Michael Lazos 2025-04-30 13:10:00 -07:00
  • b3ce7e12b7 Make cc a positional argument (#2249) Michael Lazos 2025-04-30 13:09:25 -07:00
  • fe75ead92e Import pydot lazily (#2248) Michael Lazos 2025-04-30 13:08:17 -07:00
  • 35136f5564 Fix wrong detection of python version for use_rmm. (#2224) Ruoxi 2025-04-30 12:29:33 -07:00
  • e5b810bed1 Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation schedule. (#2256) Qi Yuhang 2025-05-01 03:28:05 +08:00
  • 2b78c2fe31 cherry-pick feature/hopper-blockwise-generalization-optimization (#2270) Lain 2025-04-29 13:47:22 -07:00
  • 697126019e fix blackwell grouped groupwise hang (#2267) Haicheng Wu 2025-04-29 11:54:20 -04:00
  • e94e888df3 Update CHANGELOG.md v3.9.0 Haicheng Wu 2025-04-24 21:51:34 -04:00
  • be73ad20a5 Update CHANGELOG.md for 3.9 Haicheng Wu 2025-04-24 16:54:06 -04:00
  • f02a7c2976 Update README.md for 3.9 Haicheng Wu 2025-04-24 16:51:45 -04:00
  • 331a1f5b3f cutlass 3.9 update (#2255) Yujia Zhai 2025-04-24 12:42:40 -07:00
  • 8e345c5c5b fix_missing_stdint (#2199) 吴坎 2025-04-24 10:21:22 +08:00
  • 81a43e6d92 Set EpiTile correctly when TileN is not divisible by 32 (#2220) Tri Dao 2025-04-21 00:02:51 -04:00
  • ade6376fa0 [SM90] Change register allocation for TileN=208 to avoid spills (#2219) Tri Dao 2025-04-21 00:02:30 -04:00
  • bb4dd682dd Fix broken links and alt text in cluster launch control docs (#2234) milesvant 2025-04-20 21:01:12 -07:00
  • 5e497243f7 fix: fig link in cute docs (#2216) Zhang_kg 2025-04-11 02:51:41 +08:00
  • b3f3c7758c Update tile_iterator.cu (#2204) Haisheng Chen 2025-04-10 11:49:58 -07:00
  • 9e1b649827 fix-left-inverse-for-nvcc114 (#2196) reed 2025-04-11 02:48:46 +08:00
  • 5120b21cc3 suppress compilation warnings (#2195) reed 2025-04-11 02:48:01 +08:00
  • dd76dec4ef [Doc] Make C++ code more plausible (#2156) Ronan Keryell 2025-04-10 11:35:46 -07:00
  • 19cc2a5feb add support for sm89 in cute and the unit tests (#2177) kf-zhang 2025-04-11 02:16:36 +08:00
  • 09df6ac464 [Doc]fix typo (#2174) liwenju0 2025-04-11 00:46:53 +08:00
  • df8a550d39 Update mma_atom.hpp (#2159) liujshi 2025-04-03 23:42:10 +08:00
  • 79fc51f4b8 v3.9 update (#2213) Yujia Zhai 2025-04-02 23:10:16 -07:00
  • 6f4921858b v3.9 update (#2203) Yujia Zhai 2025-04-02 12:11:18 -07:00
  • 62750a2b75 v3.9 (#2185) Yujia Zhai 2025-03-20 22:52:23 -07:00
  • 8c4d1dc47d Treat negative zero as equivalent to positive zero in sm90_sparse_gemm_compressor.hpp (#2110) Tyler Michael Smith 2025-03-20 22:44:17 -07:00
  • 3fe62887d8 adding blackwell (#2143) Mohamed Mekkouri 2025-03-18 03:20:40 +01:00
  • bd03b22f64 fix typo (#2136) dongxiao 2025-03-18 10:19:43 +08:00
  • 6c6b78550e Fix SM90 beta=1 hang and stream-K launch errors (#2172) Jack Kosaian 2025-03-13 13:07:37 -05:00
  • 06e560d98a Blockwise/Groupwise kernel improvement and programatic dependent launch enablement (#2161) dePaul Miller 2025-03-10 11:36:11 -07:00
  • e9a75581fe DeepGemm Support - Step 2 (#2142) Deepseek Yuxi Chi 2025-02-28 23:11:59 +08:00
  • df18f5e4f5 Improvements for: Groupwise scaling along M for FP8 gemm (#2095) Lucas Wilkinson 2025-02-27 22:39:29 -05:00
  • ca4fdbea70 Blockwise and Groupwise GEMM for Blackwell and Improvements for Hopper (#2139) dePaul Miller 2025-02-26 09:44:58 -08:00
  • ac210faef8 DeepGemm Support (#2137) Yuxi Chi 2025-02-26 20:01:12 +08:00
  • 15f5468872 Migrate FlashMLA codes to example. (#2135) Junkai-Wu 2025-02-26 14:29:07 +08:00
  • af5519d938 Flash MLA Support - Step 2 (#2134) myu-guo 2025-02-26 12:18:03 +08:00
  • 415d587ebf Flash MLA support (#2130) myu-guo 2025-02-24 21:31:56 +08:00
  • eefa171318 [EVT] Fix Row/Col broadcast with array arguments (#2120) Josh Fromm 2025-02-21 14:47:30 -08:00
  • afa1772203 truncate name for cutlass profiler (#2124) v3.8.0 Yujia Zhai 2025-02-20 21:16:56 -08:00
  • 9b3772dfa6 Hopper Grouped GEMM support for FP8 Accum (#2123) ANIKET SHIVAM 2025-02-20 18:55:26 -08:00
  • b84e9802d8 update 3.8 v2 (#2112) Yujia Zhai 2025-02-19 19:03:14 -08:00
  • e9627ce55b Always use cudaGetDriverEntryPoint with CUDA 12 (#2086) dan_the_3rd 2025-02-11 19:04:25 +01:00
  • ad6e1ec19c Add ParetoQ to PUBLICATIONS.md (#2089) Sijia(Jackson) Chen 2025-02-10 13:47:02 -08:00
  • 0642d46dd4 Update 0x_gemm_tutorial.md (#2090) botbw 2025-02-11 05:46:43 +08:00
  • 833f6990e0 v3.8.0 update (#2082) Yujia Zhai 2025-02-06 18:33:40 -08:00
  • affd1b693d [EVT] Add support for Row/Col broadcast PtrArray (#2033) Josh Fromm 2025-02-02 09:10:07 -08:00
  • 6f55278121 bugfix generic-k code in top-k with softmax (#1993) Tadej Ciglarič 2025-02-01 01:05:35 +01:00
  • 3c28697b9f Groupwise scaling along M for FP8 gemm (#2037) Liang 2025-02-01 02:51:28 +08:00
  • bdd641790a Update README.md Haicheng Wu 2025-01-28 18:08:13 -05:00
  • cc19d4d22b fix a readme broken link (#2069) Haicheng Wu 2025-01-28 18:03:34 -05:00
  • 47daa33c61 fix cuda 12.6 issues (#2066) Haicheng Wu 2025-01-28 17:28:29 -05:00
  • 389e493055 CUTLASS 3.8 Release (#2059) mihir-awatramani 2025-01-24 23:44:06 -08:00
  • 9eb01fa0b0 update 3.7 docs (#2051) Yujia Zhai 2025-01-23 12:13:50 -08:00
  • b78588d163 CUTLASS 3.7 (#2045) v3.7.0 Yujia Zhai 2025-01-18 06:53:07 -08:00
  • 902dff3663 fix assertion in integer_subbytes.h (#1961) bobliao 2025-01-10 11:47:58 +08:00
  • ef5620dd1d Blockwise Scaling for FP8 (#1932) Manish Gupta 2025-01-09 08:22:09 -08:00
  • 375e284e6a Add Line Break (#2020) Lei Mao 2025-01-08 20:46:59 -08:00
  • 52b35e90ce Fix Typos (#2021) Lei Mao 2025-01-08 20:46:28 -08:00
  • 24f991e879 Fix typo in library_defaults.py (#2024) ZincCat 2025-01-08 12:44:11 -08:00
  • 51b25e7b58 Add vector-types back to platform.h (#2026) Driss Guessous 2025-01-08 12:31:59 -08:00
  • 7de6a59784 Add half->int8 saturate conversion to promise valid range (#1983) ZZK 2025-01-08 22:01:07 +08:00
  • c506e16788 fix mem fence (#2030) Yujia Zhai 2025-01-07 16:02:26 -08:00
  • 7494a180a4 fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong (#1989) Dongxu.Wang 2025-01-07 11:05:12 +08:00
  • cffd5d32b7 Update 0x_gemm_tutorial.md (#1982) Andrew O'Neill 2025-01-06 19:04:35 -08:00
  • bf9da7b76c Update CHANGELOG.md v3.6.0 Haicheng Wu 2024-12-25 17:11:15 -05:00
  • 3d261a5974 3.6.0 update (#2005) Yujia Zhai 2024-12-24 22:34:40 -08:00
  • e1cd8c7866 Fix Typo (#1962) Lei Mao 2024-12-10 19:07:37 -08:00
  • 33c584364e Fix CuTe README Typo (#1951) Lei Mao 2024-12-10 19:05:40 -08:00
  • 2b6cfd34d1 fix a typo that fails the compiling when ElementScale is not the same as MmaType (#1977) Lain 2024-12-10 12:54:44 -08:00
  • 4c42f73fda Improve mixed dtype GEMM (#1972) Lain 2024-12-06 10:33:22 -08:00
  • 80243e0b8c add {uint4, uint2, int2} => {fp16, bf16} conversion (#1966) Lain 2024-12-03 11:03:43 -08:00
  • b0e09d7cd3 Fix cutlass python library with cuda 12.6.2.post1 (#1942) dan_the_3rd 2024-11-18 15:06:32 +01:00
  • 8aa95dbb88 Fix the racing condition of mixed-input gemm when writing the registers (#1931) Lain 2024-11-08 10:15:54 -08:00
  • d656afbd2a fix undefined in device code error (#1880) LiYu Lu 2024-11-07 03:56:54 +08:00
  • 32e3c38aef remove restriction of stride == kernel in nhwc_pooling (#1896) LiuQiang 2024-11-07 03:54:53 +08:00
  • 9004ed2d1b Update publications (#1912) Wenlei Bao 2024-11-06 11:54:15 -08:00
  • 19f51596e8 feat: support kFactor 8 used in mma tensor op tile iterator (#1512) chenwei 2024-10-29 23:56:59 +08:00
  • e8a8b69365 Refactor some GroupedGEMM logic (#1899) azhurkevich 2024-10-25 17:14:01 -07:00
  • 08a49953a0 Add a print for the uint{x}b_t type. (#1871) LiYu Lu 2024-10-25 02:39:22 +08:00
  • a424ca6cf9 fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support (#1856) Caleb_Du 2024-10-25 02:38:35 +08:00
  • be692b48b0 remove redundant hardcoded packing configs in mixed dtype gemm (#1894) Lain 2024-10-23 11:24:09 -07:00
  • 12626bcfe4 Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (#1569) 侯奇 2024-10-24 00:56:36 +08:00
  • f02913c34e Include of regular_tile_iterator.h fixed for NVRTC (#1765) MaxAkaAltmer 2024-10-23 19:55:59 +03:00
  • 03e3bffaec Adjusting code indentation (#1639) 103yiran 2024-10-24 00:55:02 +08:00
  • e5f3caf145 Fix README (#1658) Lei Mao 2024-10-23 09:52:43 -07:00
  • 83ae20c740 added mapping for bf16 to torch::kBFloat16 (#1843) Bogumil Sapinski Mobica 2024-10-23 18:48:31 +02:00
  • b0c09ed077 fix by adding public (#1753) Xinyu Yang 2024-10-24 00:45:58 +08:00
  • ea69cc2849 fix typo (#1853) sijialou 2024-10-24 00:45:28 +08:00
  • f3a3bfcbf2 add maximum support (#1833) Xinyu Yang 2024-10-24 00:44:56 +08:00
  • d65266a868 Add all supported GMMA shapes (#1890) Sergey Klevtsov 2024-10-22 15:13:36 -07:00
  • 5b50a8faaf Add GMMA shape m64n40k16 (#1864) Tri Dao 2024-10-21 17:41:47 -07:00
  • 08101d9d0c Improve sm90 mixed dtype kernel (#1883) Sergey Klevtsov 2024-10-17 17:06:38 -07:00
  • 755194a7bd add is_last_tile Haicheng Wu 2024-10-17 12:11:02 -07:00
  • 53668799b2 Handle MNK Sm90{Row, Col}Reduction problem shapes (#1803) Saagar Jha 2024-10-14 16:46:20 -07:00
  • cc3c29a81a CUTLASS 3.6.0 (#1850) Yujia Zhai 2024-10-09 12:33:27 -07:00