Commit Graph

597 Commits

Author SHA1 Message Date
affd1b693d [EVT] Add support for Row/Col broadcast PtrArray (#2033)
* Add group support to EVT row/col broadcast.

* small modifications

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-02-02 12:10:07 -05:00
6f55278121 bugfix generic-k code in top-k with softmax (#1993)
* bugfix generic-k code in top-k with softmax

* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

---------

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
2025-01-31 19:05:35 -05:00
3c28697b9f Groupwise scaling along M for FP8 gemm (#2037)
* FP8 groupwise scaling along M

* small updates

---------

Co-authored-by: zl <zl@deepseek.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-31 13:51:28 -05:00
bdd641790a Update README.md 2025-01-28 18:08:13 -05:00
cc19d4d22b fix a readme broken link (#2069) 2025-01-28 18:03:34 -05:00
47daa33c61 fix cuda 12.6 issues (#2066) 2025-01-28 17:28:29 -05:00
389e493055 CUTLASS 3.8 Release (#2059)
* CUTLASS 3.8 Release

* update

* Update README.md

* Revert "Update README.md"

This reverts commit b353e36fe8.

* update

* update

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-25 02:44:06 -05:00
9eb01fa0b0 update 3.7 docs (#2051)
* update docs

* update docs

* update docs

* update docs

* update docs

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-01-23 15:13:50 -05:00
b78588d163 CUTLASS 3.7 (#2045)
* CUTLASS 3.7

* clean up changelog

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
v3.7.0
2025-01-18 09:53:07 -05:00
902dff3663 fix assertion in integer_subbytes.h (#1961) 2025-01-09 22:47:58 -05:00
ef5620dd1d Blockwise Scaling for FP8 (#1932)
* F8 Blockwise Scaling

* two more NumProducerThreadEvents

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-09 11:22:09 -05:00
375e284e6a Add Line Break (#2020) 2025-01-08 23:46:59 -05:00
52b35e90ce Fix Typos (#2021)
* Fix Typo

* Fix Typo
2025-01-08 23:46:28 -05:00
24f991e879 Fix typo in library_defaults.py (#2024) 2025-01-08 15:44:11 -05:00
51b25e7b58 Add vector-types back to platform.h (#2026) 2025-01-08 15:31:59 -05:00
ZZK
7de6a59784 Add half->int8 saturate conversion to promise valid range (#1983)
* Add half->int8 saturate conversion to promise valid range

* add gpu only macro

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-08 09:01:07 -05:00
c506e16788 fix mem fence (#2030)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-01-07 19:02:26 -05:00
7494a180a4 fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong (#1989) 2025-01-06 22:05:12 -05:00
cffd5d32b7 Update 0x_gemm_tutorial.md (#1982)
Shouldn't this be BLK_M, BLK_**K**, k
2025-01-06 22:04:35 -05:00
bf9da7b76c Update CHANGELOG.md v3.6.0 2024-12-25 17:11:15 -05:00
3d261a5974 3.6.0 update (#2005)
* 3.6.0 update

* doc and swap stuff

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-12-25 01:34:40 -05:00
e1cd8c7866 Fix Typo (#1962) 2024-12-10 22:07:37 -05:00
33c584364e Fix CuTe README Typo (#1951) 2024-12-10 22:05:40 -05:00
2b6cfd34d1 fix a typo that fails the compiling when ElementScale is not the same as MmaType (#1977) 2024-12-10 15:54:44 -05:00
4c42f73fda Improve mixed dtype GEMM (#1972)
* update

* fix a typo
2024-12-06 13:33:22 -05:00
80243e0b8c add {uint4, uint2, int2} => {fp16, bf16} conversion (#1966) 2024-12-03 14:03:43 -05:00
b0e09d7cd3 Fix cutlass python library with cuda 12.6.2.post1 (#1942)
* Fix `cutlass` python library with cuda `12.6.2.post1`

Previously we had this error:
```
  File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp>
    _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")]
                       ^^^^^^
ValueError: invalid literal for int() with base 10: 'post1'
```

* Update sm90_utils.py

* Update generator.py

* Update python/cutlass_library/generator.py

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>

* Update python/cutlass_library/sm90_utils.py

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>

---------

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
2024-11-18 09:06:32 -05:00
8aa95dbb88 Fix the racing condition of mixed-input gemm when writing the registers (#1931)
* move two warpgroup_wait

* merge main

---------

Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
2024-11-08 13:15:54 -05:00
d656afbd2a fix undefined in device code error (#1880) 2024-11-06 14:56:54 -05:00
32e3c38aef remove restriction of stride == kernel in nhwc_pooling (#1896) 2024-11-06 14:54:53 -05:00
9004ed2d1b Update publications (#1912) 2024-11-06 14:54:15 -05:00
19f51596e8 feat: support kFactor 8 used in mma tensor op tile iterator (#1512) 2024-10-29 11:56:59 -04:00
e8a8b69365 Refactor some GroupedGEMM logic (#1899) 2024-10-25 20:14:01 -04:00
08a49953a0 Add a print for the uint{x}b_t type. (#1871) 2024-10-24 14:39:22 -04:00
a424ca6cf9 fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support (#1856)
* fix wrong A/BLayout in  MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for  m8n8k128, m16n8k128  mma.and.popc in MMA_Traits instantiation

* add "print" template for  subbyte_reference<T>
2024-10-24 14:38:35 -04:00
be692b48b0 remove redundant hardcoded packing configs in mixed dtype gemm (#1894)
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
2024-10-23 14:24:09 -04:00
12626bcfe4 Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (#1569)
fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`
2024-10-23 12:56:36 -04:00
f02913c34e Include of regular_tile_iterator.h fixed for NVRTC (#1765)
* Include of regular_tile_iterator.h fixed for NVRTC

* More include fixed for NVRTC
2024-10-23 12:55:59 -04:00
03e3bffaec Adjusting code indentation (#1639) 2024-10-23 12:55:02 -04:00
e5f3caf145 Fix README (#1658)
* Fix README

* Improve README

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-10-23 12:52:43 -04:00
83ae20c740 added mapping for bf16 to torch::kBFloat16 (#1843)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-10-23 12:48:31 -04:00
b0c09ed077 fix by adding public (#1753) 2024-10-23 12:45:58 -04:00
ea69cc2849 fix typo (#1853) 2024-10-23 12:45:28 -04:00
f3a3bfcbf2 add maximum support (#1833) 2024-10-23 12:44:56 -04:00
d65266a868 Add all supported GMMA shapes (#1890) 2024-10-22 18:13:36 -04:00
5b50a8faaf Add GMMA shape m64n40k16 (#1864) 2024-10-21 20:41:47 -04:00
08101d9d0c Improve sm90 mixed dtype kernel (#1883) 2024-10-17 20:06:38 -04:00
755194a7bd add is_last_tile 2024-10-17 12:11:02 -07:00
53668799b2 Handle MNK Sm90{Row, Col}Reduction problem shapes (#1803) 2024-10-14 19:46:20 -04:00
cc3c29a81a CUTLASS 3.6.0 (#1850)
* v3.6

* update changelog

* update readme

* fix typo

* fixing typos

* hopper gemm with weight prefetch

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-10-09 15:33:27 -04:00