* v4.3 update.
* Update the cute_dsl_api changelog's doc link
* Update version to 4.3.0
* Update the example link
* Update doc to encourage user to install DSL from requirements.txt
---------
Co-authored-by: Larry Wu <larwu@nvidia.com>
* add support for sm89 in cute and the unit tests
* rebase v3.9 and format code
* minor fix
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Treat negative zero as zero in the sparse gemm compressor
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
* format
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
* Apply patch
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
* sm90_sparse_gemm_compressor.hpp
* test/unit/transform/CMakeLists.txt
* test/unit/transform/device/sm90_sparse_gemm_compressor_legacy.hpp
* include/cutlass/numeric_types.h
---------
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Add support for mixed 4-bit/8-bit data types GEMM
* fix ( and )
---------
Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Add couple configs into generator.py for mixed input MM
* change one unit test name; reenable 128x32 in the profiler
* Added U8/BF16 tests.
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Fix unrelated MSVC build warnings
* Fix use of isnan in functional.h
Correct namespace qualification of isnan in functional.h
so that it invokes cutlass::isnan for half_t, instead of
converting half_t to float and invoking std::isnan (on host,
or ::isnan on device).
* fix uint128 operator add for 64-bit hilo implemenation
* add uint128 test for operator add
* make clang happy
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Allow per-column bias in EpilogueTensorBroadcast
EpilogueTensorBroadcast only supports per-row vector broadcast, because
the bias stride is hardcoded.
It can easily support both if the bias stride is made conditional, and
the original behavior is maintained by defaulting to per-row.
* Add unit test for EpilogueTensorBroadcast with per-col bias
---------
Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Ali Hassani <ali@hippoml.com>
* Remove unused variables
* Qualify calls to make_fragment_? from templated base class.
Fixes clang build error.
* Add missing `#include <cstdio>`
* Various changes to fix clang compile errors.
* More changes to fix clang build.
Remaining issues:
- `params` initializer of `CollectiveEpilogue`.
- `ops` initializer of `Sm90VisitorImplBase`.
- `__usAtomicCAS` needs to be added to clang upstream.
* Fix remaining clang build issues.
* Qualify `cute::rank()` calls.
* Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace.
* Double-escape special registers in inline asm.
* small change
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update