726 Commits

Author SHA1 Message Date
b0296bf682 fix uint128 2024-08-15 21:06:01 -07:00
865be73a97 Merge pull request #1713 from NVIDIA/351_sparse_update
update 3.5.1 readme/changelog
2024-08-15 11:44:49 -05:00
8d8cfdf375 update 3.5.1 readme/changelog 2024-08-14 21:12:44 -07:00
eqy
fb170439e8 Update half.h (#1709) 2024-08-14 14:59:59 -04:00
4e5a8f6853 3.5.1 plots and updated readme (#1708)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2024-08-12 18:55:55 -04:00
7192f4ab23 Add CLayout_64x208 (#1680)
Without this I get compilation error when the extended shapes are enabled
2024-08-08 14:00:24 -04:00
2049c6c5a2 5476 cutlass 3x gemm kernels (#1695)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2024-08-08 13:56:23 -04:00
e22ba590cd support data type w2 used in cutlass_library (#1517) 2024-08-06 11:15:18 -04:00
19b4c5e065 Fix isnan namespace qualification in cutlass/functional.h (#1679)
* Fix unrelated MSVC build warnings

* Fix use of isnan in functional.h

Correct namespace qualification of isnan in functional.h
so that it invokes cutlass::isnan for half_t, instead of
converting half_t to float and invoking std::isnan (on host,
or ::isnan on device).
2024-08-05 14:28:13 -04:00
06b21349bc 1x1x1 cluster launch (#1673) 2024-08-01 12:20:28 -04:00
eee0cab26c Stamp out 1x1x1 clusters, 128x256 CTA shape (#1665)
Adds 128x256 tile shapes to FP16/BF16 and FP8 generators.
Also adds 1x1x1 clusters to all existing FP16/BF16/FP8 generators.

NOTE: it is important to set kernel filter (--kernels /
CUTLASS_LIBRARY_KERNELS) to a non empty string and skip pruning to get
all of the new configurations.

If profiling exhaustively, they can be set to `*`.

Number of CUTLASS 3.X GEMMs before this commit: 2868
Number of CUTLASS 3.X GEMMs after this commit: 4016

Co-authored-by: Ali Hassani <ahassani@nvidia.com>
2024-07-31 20:22:29 -04:00
36cbfcf483 Add extended wgmma shapes for all data types (#1666) 2024-07-31 18:33:14 -04:00
1f2b590da6 Skip void-C kernels in the profiler when beta is non zero (#1661)
* Skip void-C kernels in the profiler when beta is non zero

CUTLASS profiler will only skip disposition for void-C kernels when beta
is non zero, when it makes more sense to skip running it in the first
place.

Not all users are aware of void-C kernels (as far as I know it wasn't a
thing in 2.X), and not everyone remembers to filter out voidC kernels
when running the profiler with a non zero beta.

The easiest solution (and as far as I can tell correct way of handling this)
is that `can_implement` return `false` when beta is non zero (or
whatever argument indicates an epilogue source) but we have a void-C
kernel.

Profiler already includes functionality to skip running kernels that
fail `can_implement`.

* Move checks to collectives instead

---------

Co-authored-by: Ali Hassani <ahassani@nvidia.com>
2024-07-31 18:11:58 -04:00
8b2a0408bd Profiler docs and argument update for raster order (#1667) 2024-07-31 16:40:10 -04:00
eqy
fbd116c0e5 fix build on SM 5.2 (#1664) 2024-07-31 09:54:57 -04:00
5b283c872c Add more GMMA shapes (#1630)
* Add more GMMA shapes

* Add more shapes for BF16
2024-07-29 19:09:51 -04:00
be60a0b272 CUTLASS 3.5.1 (#1623)
* CUTLASS 3.5.1

* updates, optimizations, fixes
2024-07-29 08:46:24 -04:00
56b46e2d13 Fix grouped gemm invalid memory access to problem shapes (#1543) 2024-07-10 11:55:22 -04:00
52fb43f30f fix mbarrier invalidate (#1494) 2024-07-10 11:35:26 -04:00
843adf0408 Fix SMEM index for C in CuTe examples (#1477) 2024-07-10 11:14:15 -04:00
e48c7618e4 [bug] fix device thread gemm.h constructor (#1473) 2024-07-10 11:12:36 -04:00
c5239d8312 Add Faster Neighborhood Attention to pubs (#1471) 2024-07-10 11:09:13 -04:00
d6580c3dc0 Support use of external/system GTest installation (#1469)
* Support use of system/external GTest installation

* Create working directory for tests explicitly
2024-07-10 11:07:57 -04:00
81b06ee0e0 Fix B operand variable name and comments (#1458) 2024-07-10 11:06:29 -04:00
dbfced05e7 Fix typos in convolution tests (#1433) 2024-07-10 11:00:52 -04:00
2448bb56e6 Update gemm_api_3x.md (#1386)
Fixed what it seems to be an obvious typo.
2024-07-10 10:59:02 -04:00
637b159063 Fix C++17 version detection in helper_macros.hpp (#1479)
* It seems that __cplusplus can be inconsistent with _MSVC_LANG when discerning C++17 version. See https://github.com/NVIDIA/cutlass/issues/1474. Added switch to check _MSVC_LANG in addition to __cplusplus

* Fixed typo.

* Oops, another typo.

* Changed incorrect logic, ifndef to ifdef

* Define CUTLAS_CPLUSPLUS for language version testing

Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com>

---------

Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com>
2024-05-28 11:00:51 -04:00
033d9efd2d [Documentation] Fixes the confusion between concatenated vs. composed layout in CuTe documentation (#1498)
* Update 02_layout_algebra.md

* Update 02_layout_algebra.md
2024-05-02 15:35:12 -04:00
Sin
acc3ee18a1 Fix typos in cute docs (#1486)
* fix typos in 02_layout_algebra.md

* fix typos in 03_tensor.md
2024-05-02 15:34:36 -04:00
5c447dd84f Update packed_stride.hpp to add CUTLASS_HOST_DEVICE decorator to new functions (#1495) 2024-04-19 12:07:57 -04:00
7d49e6c7e2 Updates for CUTLASS 3.5.0 (#1468) v3.5.0 2024-04-11 21:33:40 -04:00
a40e08e9d5 Update 02_layout_algebra.md (#1451)
change line 348 to reflect correct layout.
2024-04-10 10:57:57 -04:00
lzw
8e7d9f483d add missing header for size_t in numeric_types.h (#1420)
* add missing header for size_t in `numeric_types.h`

* make nvrtc happy

* add missing header for int types in `cutlass/arch/memory.h`

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-04-09 14:15:48 -04:00
19f3cc33f1 Fix uint128 operator add (#1400)
* fix uint128 operator add for 64-bit hilo implemenation

* add uint128 test for operator add

* make clang happy

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-04-02 13:32:18 -04:00
f9ece1b42c Python Gemm tile_descriptions fix (#1439)
* fix python gemm tile descriptions

* fix formatting

* fix math_operation filtering

* fix formatting
2024-03-30 09:00:46 -04:00
28cbacbf64 fix stride compilation warning (#1415) 2024-03-29 23:50:33 -04:00
8f7d2789b8 [NFC] improve doc: fix typo in mma doc (#1417) 2024-03-27 14:07:20 -04:00
c4e3e122e2 group gemm set stride L = cute::Int<0> (#1416) 2024-03-20 17:31:14 -04:00
629f4653c3 CUTLASS 3.5.0 (#1411) 2024-03-19 17:51:04 -04:00
ffa34e7075 (NFC) improve doc: Add missing verb to sentence (#1377)
Co-authored-by: lorenzo chelini <lchelini@nvidia.com>
2024-03-04 15:30:10 -05:00
a8f2c80db0 fix tile_size(TiledCopy<Args...> const&) error (#1357) 2024-02-24 00:33:01 -05:00
bbe579a9e3 Updates for CUTLASS 3.4.1 (#1346)
* Updates for CUTLASS 3.4.1

* minor epi change
v3.4.1
2024-02-15 15:48:34 -05:00
47a3ebbea9 Add a missing platform include (#1328) 2024-02-03 01:30:32 -05:00
57e01e1a6b Fix missing include file (#1318) 2024-02-03 01:29:32 -05:00
6e3df975a2 Modify comments in code examples/08_turing_tensorop_gemm/turing_tensorop_gemm.cu (#1325) 2024-01-31 21:41:30 -05:00
8825fbf1ef fix unrecognized print format specifier for int8/uint8 (#1303)
* fix unrecognized print format specifier for int8/uint8

* use c++ static_cast instead of c cast style
2024-01-29 21:22:40 -05:00
092f14db05 fix tile_size_mnk compilation warning (#1294) 2024-01-29 21:21:15 -05:00
9385141f19 Update PUBLICATIONS.md
ptq paper from goog
2024-01-19 14:17:55 -05:00
b4b5b11070 Update PUBLICATIONS.md
add odyssey llm paper from metuan
2024-01-18 10:30:21 -05:00
139b93db61 update publications (#1308) 2024-01-17 14:06:46 -05:00