CUTLASS 3.3.0 (#1167)

* Release 3.3.0

Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.

* minor doc update
This commit is contained in:
Pradeep Ramani
2023-11-02 08:09:05 -07:00
committed by GitHub
parent 922fb5108b
commit c008b4aea8
263 changed files with 16214 additions and 5008 deletions

View File

@ -296,7 +296,9 @@ freely with any mainloop. Each mainloop policy either prescribes a `Schedule` wi
it needs to be run, or exposes a template API that lets the user pick a subset of the following schedules:
```c++
struct KernelMultistage { };
struct KernelCpAsyncWarpSpecialized { };
struct KernelCpAsyncWarpSpecializedPingpong { };
struct KernelCpAsyncWarpSpecializedCooperative { };
struct KernelTma { };
struct KernelTmaWarpSpecialized { };
struct KernelTmaWarpSpecializedPingpong { };
@ -305,7 +307,7 @@ struct KernelTmaWarpSpecializedCooperative { };
- A single kernel schedule can support multiple mainloop implementations. For example,
`KernelMultistage` can be composed with many different mainloop implementations across GPU
architectures such as `MainloopSm70TwoStage`, `MainloopSm80CpAsyncUnpredicated`, `MainloopSm90CpAsyncGmma`, and many more.
architectures such as `MainloopSm70TwoStage`, `MainloopSm80CpAsyncUnpredicated`, and many more.
- A single mainloop can be composed with multiple
possible kernel schedules. For example, the `MainloopSm90TmaGmmaWarpSpecialized` can be