Updates for CUTLASS 3.5.0 (#1468)

This commit is contained in:
Vijay Thakkar
2024-04-11 21:33:40 -04:00
committed by GitHub
parent a40e08e9d5
commit 7d49e6c7e2
171 changed files with 7526 additions and 1888 deletions

View File

@ -36,17 +36,17 @@ is the following error when attempting to use clang:
## Required CMake options
The Clang build requires specifying the following CMake options.
Replace `<path-to-clang++>` with the path to your `clang++` executable,
and replace `<path-to-clang>` with the path to your `clang` executable
(which must have the same version as your `clang++` executable).
You may use `clang++` resp. `clang` directly if they are in your `PATH`.
Replace `<path-to-clang++>` with the path to your `clang++` executable.
You may use `clang++` directly if it is in your `PATH`.
* `CMAKE_CXX_COMPILER=<path-to-clang++>`
* `CMAKE_CUDA_HOST_COMPILER=<path-to-clang++>`
* `CMAKE_C_COMPILER=<path-to-clang>`
Please note that both `CMAKE_CXX_COMPILER` and `CMAKE_C_COMPILER`
must be set, even though CUTLASS is a C++ project, not a C project.
One must set both! It's not enough just to set the `CXX` environment
variable, for example. Symptoms of only setting `CMAKE_CXX_COMPILER`
(or only setting the `CXX` environment variable) include `cc1plus`
(GCC's compiler executable) reporting build errors due to it not
understanding Clang's command-line options.
Users can also specify a particular CUDA Toolkit version
by setting the CMake option `CMAKE_CUDA_COMPILER`

View File

@ -317,23 +317,25 @@ The `complement` of a layout attempts to find another layout that represents the
You can find many examples and checked post-conditions in [the `complement` unit test](../../../test/unit/cute/core/complement.cpp). The post-conditions include
```cpp
// @post cosize(make_layout(@a layout_a, @a result))) >= @a cosize_hi
// @post cosize(@a result) >= round_up(@a cosize_hi, cosize(@a layout_a))
// @post cosize(make_layout(@a layout_a, @a result))) >= size(@a cotarget)
// @post cosize(@a result) >= round_up(size(@a cotarget), cosize(@a layout_a))
// @post for all i, 1 <= i < size(@a result),
// @a result(i-1) < @a result(i)
// @post for all i, 1 <= i < size(@a result),
// for all j, 0 <= j < size(@a layout_a),
// @a result(i) != @a layout_a(j)
Layout complement(LayoutA const& layout_a, Integral const& cosize_hi)
Layout complement(LayoutA const& layout_a, Shape const& cotarget)
```
That is, the complement `R` of a layout `A` with respect to an integer `M` satisfies the following properties.
1. The size (and cosize) of `R` is *bounded* by `M`.
That is, the complement `R` of a layout `A` with respect to a Shape (IntTuple) `M` satisfies the following properties.
1. The size (and cosize) of `R` is *bounded* by `size(M)`.
2. `R` is *ordered*. That is, the strides of `R` are positive and increasing. This means that `R` is unique.
3. `A` and `R` have *disjoint* codomains. `R` attempts to "complete" the codomain of `A`.
The `cotarget` parameter above is most commonly an integer -- you can see we only use `size(cotarget)` above. However, sometimes it is useful to specify an integer that has static properties. For example, `28` is a dynamic integer and `(_4,7)` is a shape with size `28` that is statically known to be divisible by `_4`. Both will produce the same `complement` mathematically, but the extra information can used by `complement` to preserve the staticness of the result as much as possible.
### Complement Examples
`complement` is most effective on static shapes and strides, so consider all integers below to be static. Similar examples for dynamic shapes and strides can be found in the unit test.
`complement` is most effective on static shapes and strides, so consider all integers below to be static. Similar examples for dynamic shapes and strides as well as IntTuple `cotarget` can be found in [the unit test](../../../test/unit/cute/core/complement.cpp).
* `complement(4:1, 24)` is `6:4`. Note that `(4,6):(1,4)` has cosize `24`. The layout `4:1` is effectively repeated 6 times with `6:4`.
@ -425,9 +427,9 @@ Layout Shape : (M, N, L, ...)
Tiler Shape : <TileM, TileN>
logical_divide : ((TileM,RestM), (TileN,RestN), L, ...)
zipped_divide : ((TileM,TileN,...), (RestM,RestN,L,...))
tiled_divide : ((TileM,TileN,...), RestM, RestN, L, ...)
flat_divide : (TileM, TileN, ..., RestM, RestN, L, ...)
zipped_divide : ((TileM,TileN), (RestM,RestN,L,...))
tiled_divide : ((TileM,TileN), RestM, RestN, L, ...)
flat_divide : (TileM, TileN, RestM, RestN, L, ...)
```
For example, the `zipped_divide` function applies `logical_divide`, and then gathers the "subtiles" into a single mode and the "rest" into a single mode.

View File

@ -63,13 +63,12 @@ template <
typename T, // element type
int N // number of elements
>
class Array;
struct Array;
```
`Array<class T, int N>` defines a statically sized array of elements of type _T_ and size _N_. This class is similar to
[`std::array<>`](https://en.cppreference.com/w/cpp/container/array) in the Standard Library with two notable exceptions:
* constructors for each element may not be called
* partial specializations exist to pack or unpack elements smaller than one byte.
[`std::array<>`](https://en.cppreference.com/w/cpp/container/array) in the Standard Library with one notable exception:
partial specializations exist to pack or unpack elements smaller than one byte.
`Array<>` is intended to be a convenient and uniform container class to store arrays of numeric elements regardless of data type or vector length. The storage needed is expected to be the minimum necessary given the logical size of each numeric type in bits (numeric types smaller than one byte are densely packed). Nevertheless, the size reported by `sizeof(Array<T, N>)` is always an integer multiple of bytes.

View File

@ -210,7 +210,6 @@ GEMM
[int] --inst_k,--instruction-shape::k Math instruction shape in the K dimension
[int] --min_cc,--minimum-compute-capability Minimum device compute capability
[int] --max_cc,--maximum-compute-capability Maximum device compute capability
Examples:
Profile a particular problem size: