eefa171318
[EVT] Fix Row/Col broadcast with array arguments ( #2120 )
...
* Use constexpr in if to prevent invalid comparison.
* Move constexpr check into else scope.
2025-02-21 17:47:30 -05:00
9b3772dfa6
Hopper Grouped GEMM support for FP8 Accum ( #2123 )
...
* Add support for fp8accum, with profiler extension
* Update .gitignore
* contri
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-02-20 21:55:26 -05:00
b84e9802d8
update 3.8 v2 ( #2112 )
...
* update 3.8 v2
* update 3.8
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-02-19 22:03:14 -05:00
e9627ce55b
Always use cudaGetDriverEntryPoint with CUDA 12 ( #2086 )
...
`cudaGetDriverEntryPointByVersion` has been added to drivers in 12.5, but we don't know at compile time the driver version.
In particular, we can build with nvcc 12.8 for a 12.2 driver for instance, and this was causing the following error:
```
undefined symbol: cudaGetDriverEntryPointByVersion,
```
2025-02-11 13:04:25 -05:00
833f6990e0
v3.8.0 update ( #2082 )
...
* 3.8 update
* fix Markus' name
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-02-06 21:33:40 -05:00
affd1b693d
[EVT] Add support for Row/Col broadcast PtrArray ( #2033 )
...
* Add group support to EVT row/col broadcast.
* small modifications
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-02-02 12:10:07 -05:00
6f55278121
bugfix generic-k code in top-k with softmax ( #1993 )
...
* bugfix generic-k code in top-k with softmax
* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
---------
Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com >
2025-01-31 19:05:35 -05:00
3c28697b9f
Groupwise scaling along M for FP8 gemm ( #2037 )
...
* FP8 groupwise scaling along M
* small updates
---------
Co-authored-by: zl <zl@deepseek.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-31 13:51:28 -05:00
47daa33c61
fix cuda 12.6 issues ( #2066 )
2025-01-28 17:28:29 -05:00
389e493055
CUTLASS 3.8 Release ( #2059 )
...
* CUTLASS 3.8 Release
* update
* Update README.md
* Revert "Update README.md"
This reverts commit b353e36fe8 .
* update
* update
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-25 02:44:06 -05:00
b78588d163
CUTLASS 3.7 ( #2045 )
...
* CUTLASS 3.7
* clean up changelog
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-18 09:53:07 -05:00
ef5620dd1d
Blockwise Scaling for FP8 ( #1932 )
...
* F8 Blockwise Scaling
* two more NumProducerThreadEvents
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-09 11:22:09 -05:00
375e284e6a
Add Line Break ( #2020 )
2025-01-08 23:46:59 -05:00
51b25e7b58
Add vector-types back to platform.h ( #2026 )
2025-01-08 15:31:59 -05:00
7de6a59784
Add half->int8 saturate conversion to promise valid range ( #1983 )
...
* Add half->int8 saturate conversion to promise valid range
* add gpu only macro
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2025-01-08 09:01:07 -05:00
c506e16788
fix mem fence ( #2030 )
...
Co-authored-by: yuzhai <yuzhai@nvidia.com >
2025-01-07 19:02:26 -05:00
7494a180a4
fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong ( #1989 )
2025-01-06 22:05:12 -05:00
3d261a5974
3.6.0 update ( #2005 )
...
* 3.6.0 update
* doc and swap stuff
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2024-12-25 01:34:40 -05:00
e1cd8c7866
Fix Typo ( #1962 )
2024-12-10 22:07:37 -05:00
2b6cfd34d1
fix a typo that fails the compiling when ElementScale is not the same as MmaType ( #1977 )
2024-12-10 15:54:44 -05:00
4c42f73fda
Improve mixed dtype GEMM ( #1972 )
...
* update
* fix a typo
2024-12-06 13:33:22 -05:00
80243e0b8c
add {uint4, uint2, int2} => {fp16, bf16} conversion ( #1966 )
2024-12-03 14:03:43 -05:00
8aa95dbb88
Fix the racing condition of mixed-input gemm when writing the registers ( #1931 )
...
* move two warpgroup_wait
* merge main
---------
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
2024-11-08 13:15:54 -05:00
d656afbd2a
fix undefined in device code error ( #1880 )
2024-11-06 14:56:54 -05:00
19f51596e8
feat: support kFactor 8 used in mma tensor op tile iterator ( #1512 )
2024-10-29 11:56:59 -04:00
e8a8b69365
Refactor some GroupedGEMM logic ( #1899 )
2024-10-25 20:14:01 -04:00
08a49953a0
Add a print for the uint{x}b_t type. ( #1871 )
2024-10-24 14:39:22 -04:00
a424ca6cf9
fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support ( #1856 )
...
* fix wrong A/BLayout in MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for m8n8k128, m16n8k128 mma.and.popc in MMA_Traits instantiation
* add "print" template for subbyte_reference<T>
2024-10-24 14:38:35 -04:00
be692b48b0
remove redundant hardcoded packing configs in mixed dtype gemm ( #1894 )
...
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
2024-10-23 14:24:09 -04:00
f02913c34e
Include of regular_tile_iterator.h fixed for NVRTC ( #1765 )
...
* Include of regular_tile_iterator.h fixed for NVRTC
* More include fixed for NVRTC
2024-10-23 12:55:59 -04:00
03e3bffaec
Adjusting code indentation ( #1639 )
2024-10-23 12:55:02 -04:00
b0c09ed077
fix by adding public ( #1753 )
2024-10-23 12:45:58 -04:00
d65266a868
Add all supported GMMA shapes ( #1890 )
2024-10-22 18:13:36 -04:00
5b50a8faaf
Add GMMA shape m64n40k16 ( #1864 )
2024-10-21 20:41:47 -04:00
08101d9d0c
Improve sm90 mixed dtype kernel ( #1883 )
2024-10-17 20:06:38 -04:00
755194a7bd
add is_last_tile
2024-10-17 12:11:02 -07:00
53668799b2
Handle MNK Sm90{Row, Col}Reduction problem shapes ( #1803 )
2024-10-14 19:46:20 -04:00
cc3c29a81a
CUTLASS 3.6.0 ( #1850 )
...
* v3.6
* update changelog
* update readme
* fix typo
* fixing typos
* hopper gemm with weight prefetch
---------
Co-authored-by: yuzhai <yuzhai@nvidia.com >
Co-authored-by: Haicheng Wu <haichengw@nvidia.com >
2024-10-09 15:33:27 -04:00
0837a2a00a
Fix typo in comment ( #1787 )
2024-10-07 12:39:59 -04:00
e2b0789927
Add some can implement rules of hopper convolution. ( #1835 )
2024-09-25 11:28:10 -04:00
2991ce18d3
Add print_svg for mma ( #1733 )
...
* add print_svg for mma
* correct the code indentation
2024-09-18 10:37:24 -04:00
1ebda1ccef
Fix MMA promotion interval assertions ( #1641 )
2024-09-16 12:38:42 -04:00
3a8c01a18b
Prefix a member template name with the template keyword. ( #1796 )
...
Fixes llvm buld error.
2024-09-11 13:33:56 -04:00
dbdae514e0
Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm ( #1795 )
2024-09-11 00:07:31 -04:00
21d0534167
fix assertion ( #1790 )
2024-09-09 14:05:27 -04:00
323c8170bf
Support ComputeFn where output type differs from input type ( #1771 )
...
This is useful for e.g. function taking in 2 float inputs and turn them to complex
2024-09-05 23:25:03 -04:00
82f5075946
set_slice3x3 -> set_slice_3x3 ( #1784 )
2024-09-05 23:24:10 -04:00
06e337758d
Remove extraneous comma in declaration ( #1776 )
2024-09-05 17:14:15 -04:00
7369adcaca
Add Sm90LinCombPerColBias ( #1774 )
...
Co-authored-by: Jiayu Sun <jiayus@s4124-0071.nvidia.com >
2024-09-04 15:11:24 -04:00
6c3044136b
Update barrier.h ( #1782 )
2024-09-04 14:52:11 -04:00