Release v4.0.0 (#2294)

This commit is contained in:
Kihiro Bando
2025-05-13 15:55:29 -04:00
committed by GitHub
parent ad7b2f5e84
commit f115c3f854
299 changed files with 51495 additions and 4413 deletions

View File

@ -0,0 +1,10 @@
.. _blackwell:
Blackwell Specific
==================
.. toctree::
:maxdepth: 2
Blackwell SM100/SM120 GEMMs<blackwell_functionality.md>
Blackwell Cluster Launch Control<blackwell_cluster_launch_control.md>

View File

@ -6,7 +6,7 @@ A GEMM workload usually consists of three phases: prologue, mainloop and epilogu
Consider a GEMM that has `20x20x1` output tiles, running on a GPU with `100` SMs. There is another kernel occupying all the resources of `20` SMs so only `80` SMs can be used. Assume cluster shape is `1x1x1`. The following diagram shows how the schedule would look like for such a kernel.
<p align="center"><img src=../../images/non_persistent.png alt="GEMM tiles are evenly divided among available SMs" title="GEMM Scheduling with Limited SM Resources"></p>
<p align="center"><img src=../images/non_persistent.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
### Static Scheduler
@ -14,7 +14,7 @@ CUTLASS has adopted a software technique named **persistent kernels**. Persisten
However, static scheduler is susceptible to workload imbalance if the resources of some SMs are unavailable. The following diagram illustrates this issue.
<p align="center"><img src=../../images/persistent_static.png alt="GEMM tiles are unevenly divided among available SMs, leading to workload imbalance" title="Imbalanced Workload Scheduling due to Static Scheduler"></p>
<p align="center"><img src=../images/persistent_static.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
### Dynamic Scheduler with Cluster Launch Control
A fundamental limitation of persistent scheduling is that the number of SMs this kernel can utilize is unknown in real time. Some SMs might be occupied by another kernel and thus their resources are unavailable. This makes it challenging to load-balance work across SMs.
@ -32,7 +32,7 @@ Cluster launch control follows the below rules:
The following diagram shows how the schedule would look like with cluster launch control.
<p align="center"><img src=../../images/persistent_clc.png alt="GEMM tiles are dynamically allocated among available SMs, leading to a balanced workload" title="Dynamic Scheduler with Cluster Launch Control"></p>
<p align="center"><img src=../images/persistent_clc.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
## Programming Model
### Pseudo Code
@ -120,7 +120,7 @@ The CLC pipeline has a depth of 3 to overlap the CLC operations of multiple wave
# Copyright
### Copyright
Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -723,7 +723,7 @@ Specialized policies must be used to generate mixed-input-datatype `mx_float4_t`
|----------------|----|----|----|----|------------------------------------|
128x128x128 | Y | N | N | N | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
# Copyright
### Copyright
Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -5,7 +5,7 @@ Users and developers may build either
in Visual Studio's graphical integrated development environment,
or on the command line with `cmake --build`.
# Software prerequisites
## Software prerequisites
1. Windows 10 or 11
@ -22,7 +22,7 @@ or on the command line with `cmake --build`.
Visual Studio must be installed *before* the CUDA Toolkit.
Otherwise, Visual Studio's build system won't know about CUDA.
# Operating system settings
## Operating system settings
By default, Windows restricts the maximum file path length (`MAX_PATH`) to 260 characters.
CUTLASS has many files and directory paths that challenge this requirement.
@ -48,7 +48,7 @@ before attempting to clone or build CUTLASS.
[This Microsoft help article](https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry)
explains different ways to change the registry setting.
# Set up build environment
## Set up build environment
1. Run "git bash" to get a familiar command-line interface
@ -62,7 +62,7 @@ explains different ways to change the registry setting.
Alternate approaches may rely on the CMake GUI and/or Windows' native command line.
# Building
## Building
A successful CMake run will create a `CUTLASS.sln` Visual Studio "solution" file in the build directory.
One can open this in Visual Studio and build the entire solution or any subset of projects as desired.
@ -77,7 +77,7 @@ Unlike with CMake's Makefile or Ninja generators,
`CMAKE_BUILD_TYPE` has no effect on the Visual Studio generator,
because the Visual Studio generator creates all build configurations.
# Tips
## Tips
With Windows builds, one may find that CMake reruns unnecessarily.
For example, cancelling a build and starting it again may rerun CMake.
@ -86,7 +86,7 @@ One work-around is to set the CMake option `CMAKE_SUPPRESS_REGENERATION=ON`.
However, this turns off CMake's ability to detect on its own when it needs to rerun.
As a result, one will need to know when to rerun CMake by hand.
## Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -5,7 +5,7 @@ Clang as host compiler, and NVCC as device compiler.
This is NOT the same as building with
Clang as both host and device compiler ("CUDA Clang").
# Software prerequisites
## Software prerequisites
1. Clang (regularly tested with Clang 17;
occasionally tested with Clang 10 and greater)
@ -29,9 +29,9 @@ A symptom of not installing all needed dependencies
is the following error when attempting to use clang:
`"/usr/bin/ld: cannot find -lstdc++: No such file or directory"`.
# Running CMake
## Running CMake
## Required CMake options
### Required CMake options
The Clang build requires specifying the following CMake options.
Replace `<path-to-clang++>` with the path to your `clang++` executable.
@ -55,7 +55,7 @@ then one can set `CMAKE_CUDA_COMPILER` as follows.
* `CMAKE_CUDA_COMPILER=${PATH_TO_CUDA_TOOLKIT}/bin/nvcc`
## Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -0,0 +1,10 @@
.. _cpp_build:
Build
=====
.. toctree::
:maxdepth: 1
Building on Windows with Visual Studio<building_in_windows_with_visual_studio.md>
Building with Clang as host compiler<building_with_clang_as_host_compiler.md>

View File

@ -1,6 +1,6 @@
![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")
# CUTLASS Code Organization
# Code Organization
This document describes the layout of the CUTLASS repository. The main components are:

View File

@ -249,9 +249,7 @@ auto same_r = make_layout(composition(layout<0>(a), get<0>(tiler)),
We often use the `<LayoutA, LayoutB, ...>` notation to distinguish `Tiler`s from the concatenation-of-sublayouts notation `(LayoutA, LayoutB, ...)` that we used previously.
The `result` in the above code can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
<p align="center">
<img src="../../../images/cute/composition1.png" alt="composition1.png" height="250"/>
</p>
![composition1.png](../../../images/cute/composition1.png)
For convenience, CuTe also interprets `Shape`s as a tiler as well. A `Shape` is interpreted as tuple-of-layouts-with-stride-1:
```cpp
@ -268,9 +266,7 @@ auto tiler = make_shape(Int<3>{}, Int<8>{});
auto result = composition(a, tiler);
```
where `result` can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
<p align="center">
<img src="../../../images/cute/composition2.png" alt="composition2.png" height="250"/>
</p>
![composition2.png](../../../images/cute/composition2.png)
## Composition Tilers
@ -323,9 +319,7 @@ The `cotarget` parameter above is most commonly an integer -- you can see we onl
* `complement((2,2):(1,6), 24)` is `(3,2):(2,12)`. Note that `((2,2),(3,2)):((1,6),(2,12))` has cosize `24` and produces unique indices.
<p align="center">
<img src="../../../images/cute/complement1.png" alt="complement1.png" height="75"/>
</p>
![complement1.png](../../../images/cute/complement1.png)
As a visualization, the above figure depicts the codomain of the last example. The image of the original layout `(2,2):(1,6)` is colored in gray. The complement effectively "repeats" the original layout (displayed in the other colors) such that the codomain size of the result is `24`. The complement `(3,2):(2,12)` can be viewed as the "layout of the repetition."
## Division (Tiling)
@ -371,9 +365,7 @@ This is computed in the three steps described in the implementation above.
* Concantenation of `(B,B*) = (4,(2,3)):(2,(1,8))`.
* Composition of `A = (4,2,3):(2,1,8)` with `(B,B*)` is then `((2,2),(2,3)):((4,1),(2,8))`.
<p align="center">
<img src="../../../images/cute/divide1.png" alt="divide1.png" height="150"/>
</p>
![divide1.png](../../../images/cute/divide1.png)
The above figure depicts `A` as a 1-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are six of those tiles in `A` shown by each of the colors. After the divide, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
@ -383,9 +375,7 @@ Using the `Tiler` concept defined above, this immediately generalizes to multidi
Similar to the 2-D composition example above, consider a 2-D layout `A = (9,(4,8)):(59,(13,1))` and want to apply `3:3` down the columns (mode-0) and `(2,4):(1,8)` across the rows (mode-1). This means the tiler can be written as `B = <3:3, (2,4):(1,8)>`.
<p align="center">
<img src="../../../images/cute/divide2.png" alt="divide2.png" height="450"/>
</p>
![divide2.png](../../../images/cute/divide2.png)
The above figure depicts `A` as a 2-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are twelve of those tiles in `A` shown by each of the colors. After the divide, the first mode of each mode of the result is the tile of data and the second mode of each mode iterates over each tile. In that sense, this operation can be viewed as a kind of `gather` operation or as simply a permutation on the rows and cols.
@ -429,9 +419,7 @@ We note that `logical_divide` preserves the *semantics* of the modes while permu
This is not the case with `zipped_divide`. The mode-0 in the `zipped_divide` result is the `Tile` itself (of whatever rank the `Tiler` was) and mode-1 is the layout of those tiles. It doesn't always make sense to plot these as 2-D layouts, because the `M`-mode is now more aptly the "tile-mode" and the `N`-mode is more aptly the "rest-mode". Regardless, we still can plot the resulting layout as 2-D as shown below.
<p align="center">
<img src="../../../images/cute/divide3.png" alt="divide3.png" height="450"/>
</p>
![divide3.png](../../../images/cute/divide3.png)
We've kept each tile as its color in the previous images for clarity. Clearly, iterating across tiles is now equivalent to iterating across a row of this layout and iterating over elements within a tile is equivalent to iterating down a column of this layout. As we'll see in the `Tensor` section, this can be used to great effect in partitioning within or across tiles of data.
@ -476,9 +464,7 @@ This is computed in the three steps described in the implementation above.
* Composition of `A* = (2,3):(2,8)` with `B = 6:1` is then `(2,3):(2,8)`.
* Concatenation of `(A,A* o B) = ((2,2),(2,3)):((4,1),(2,8))`.
<p align="center">
<img src="../../../images/cute/product1.png" alt="product1.png" height="175"/>
</p>
![product1.png](../../../images/cute/product1.png)
The above figure depicts `A` and `B` as a 1-D layouts. The layout `B` describes the number and order of repetitions of `A` and they are colored for clarity. After the product, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
@ -486,9 +472,7 @@ Note that the result is identical to the result of the 1-D Logical Divide exampl
Of course, we can change the number and order of the tiles in the product by changing `B`.
<p align="center">
<img src="../../../images/cute/product2.png" alt="product2.png" height="175"/>
</p>
![product2.png](../../../images/cute/product2.png)
For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated tiles instead of 6 and the tiles are in a different order.
@ -496,9 +480,7 @@ For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated til
We can use the by-mode `tiler` strategies previously developed to write multidimensional products as well.
<p align="center">
<img src="../../../images/cute/product2d.png" alt="product2d.png" height="250"/>
</p>
![product2d.png](../../../images/cute/product2d.png)
The above image demonstates the use of a `tiler` to apply `logical_product` by-mode. Despite this **not being the recommended approach**, the result is a rank-2 layout consisting of 2x5 row-major block that is tiled across a 3x4 column-major arrangement.
@ -519,17 +501,13 @@ Because `A` is always compatible with mode-0 of the result and `B` is always com
This is exactly what `blocked_product` and `raked_product` do and it is why they are called rank-sensitive. Unlike other CuTe functions that take `Layout` arguments, these care about the top-level rank of the arguments so that each mode can be reassociated after the `logical_product`.
<p align="center">
<img src="../../../images/cute/productblocked2d.png" alt="productblocked2d.png" height="250"/>
</p>
![productblocked2d.png](../../../images/cute/productblocked2d.png)
The above image shows the same result as the `tiler` approach, but with much more intuitive arguments. A 2x5 row-major layout is arranged as a tile in a 3x4 column-major arrangement. Also note that `blocked_product` went ahead and `coalesced` mode-0 for us.
Similarly, `raked_product` combines the modes slightly differently. Instead of the resulting "column" mode being constructed from the `A` "column" mode then the `B` "column" mode, the resulting "column" mode is constructed from the `B` "column" mode then the `A` "column" mode.
<p align="center">
<img src="../../../images/cute/productraked2d.png" alt="productraked2d.png" height="250"/>
</p>
![productraked2d.png](../../../images/cute/productraked2d.png)
This results in the "tile" `A` now being interleaved or "raked" with the "layout-of-tiles" `B` instead of appearing as blocks. Other references call this a "cyclic distribution."

View File

@ -269,9 +269,7 @@ Tensor E = A(make_coord(_,1),make_coord(0,_,1));
Tensor F = A(make_coord(2,_),make_coord(_,3,_));
```
<p align="center">
<img src="../../../images/cute/slice.png" alt="slice.png" height="300"/>
</p>
![slice.png](../../../images/cute/slice.png)
In the image above, a `Tensor` is sliced in various ways and the subtensors generated by those slices are highlighted within the original tensor. Note that tensor `C` and `D` contain the same elements, but have different ranks and shapes due to the use of `_` versus the use of `make_coord(_,_)`. In each case, the rank of the result is equal to the number of `Underscore`s in the slicing coordinate.
@ -327,9 +325,7 @@ Tensor tv = composition(A, tv_layout); // (8,4)
Tensor v = tv(threadIdx.x, _); // (4)
```
<p align="center">
<img src="../../../images/cute/tv_layout.png" alt="tv_layout.png" height="300"/>
</p>
![tv_layout.png](../../../images/cute/tv_layout.png)
The above image is a visual representation of the above code. An arbitrary 4x8 layout of data is composed with a specific 8x4 TV-layout that represents a partitioning pattern. The result of the composition is on the right where each threads' values are arranged across each row. The bottom layout depicts the inverse TV layout which shows the mapping of 4x8 logical coordinates to the thread id and value id they will be mapped to.

View File

@ -208,9 +208,7 @@ Volta architecture implements an HMMA instruction where a group of 8 threads cal
We first take a look at how we would take the ISA semantics of thread and data partitioning for the HMMA instruction, and encode it in a Traits struct. The HMMA NT instruction has the thread-data layout:
<p align="center">
<img src="../../../images/cute/HMMA.8x8x4.NT.png" alt="HMMA.8x8x4.NT.png" height="400"/>
</p>
![HMMA.8x8x4.NT.png](../../../images/cute/HMMA.8x8x4.NT.png)
### Types
@ -250,9 +248,7 @@ Again, this layout function maps the logical thread id [0,8) of the MMA operatio
Let us look at exactly how the 8 threads within a QP are mapped to the A, B and C matrices. For the C and D matrices, the above image is broken down a bit more below. On the left is shown the whole QP level view, and on the right is shown the values owned by just thread 0.
<p align="center">
<img src="../../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
</p>
![HMMA.8x8x4.quadpair.C.png](../../../images/cute/HMMA.8x8x4.quadpair.C.png)
The metainformation of this single instruction level view is what we want to encode in CuTe. Specifically, the QP level view in this diagram corresponds to the four MMA traits for [SM70_F32F16F16F32](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/mma_sm70.hpp). These structs contain the `Element` types, the `Shape_MNK`, and the `ThrID` mapping we constructed above. Now, let us take a look at the definition of `CLayout`, the thread-data layout of accumulators. The job of `CLayout` is to construct a mapping between the `(logical_thr_id, logical_val_id)` and `(m, n)` coordinate in the C matrix which can then be used to build up more complicated layouts and operations like the 16x16x4 WMMA.
@ -320,9 +316,7 @@ In the case of F16 accumulators, the layout is way less complex. Each row of acc
A and B matrix layouts depend on whether the sources are transposed or not. The diagram below shows the thread ID to data ownership map for A and B matrices in the case of NT and TN transposes.
<p align="center">
<img src="../../../images/cute/HMMA.8x8x4.quadpair.AB.png" alt="HMMA.8x8x4.quadpair.AB.png" height="400"/>
</p>
![HMMA.8x8x4.quadpair.AB.png](../../../images/cute/HMMA.8x8x4.quadpair.AB.png)
Let's look at the TN layout for A matrix first (right side in the diagram). Again, there are the same 8 logical threads, but each threads owns only 4 elements this time. The shape of `ALayout` will then be `Shape<_8, _4>`. As for the strides, we again need a similar mapping between `(m, k) == m + k * M`. Looking down the `M` mode, we go from `(T0, V0)` to `(T1, V0)` which is a stride of 1 for all 8 threads. For the `K` mode, as we go across, we go from `(T0, V0)` to `(T0, V1)`, which makes a stride of 8 for all 4 values. Therefore, the A layout is:
@ -375,17 +369,13 @@ using ThrID = Layout<_128, _1>;
Accumulators are mapped hierarchically in GMMA, starting from the concept of a core matrix and building up to a layout for the whole C matrix tile. Let's look at this core matrix first. We only consider fp16 accumulators here, but extensions of fp32 accumulators as trivial as we will see later.
Each core matrix has the layout as shown in the diagram below.
<p align="center">
<img src="../../../images/cute/gmma_coremat_cd_fp16.png" alt="gmma_coremat_cd_fp16.png" height="600"/>
</p>
![gmma_coremat_cd_fp16.png](../../../images/cute/gmma_coremat_cd_fp16.png)
As in the Volta examples, the thread IDs are logical only, and which of the four warps they belong to in the warpgroup is not important.
Then GMMA tiles this core matrix first vertically along the M mode, and then repeats that column of core matrices along the N mode to construct the full MxN tile. This tiling is shown in the image below.
<p align="center">
<img src="../../../images/cute/gmma_wg_n_slice.png" alt="gmma_wg_n_slice.png" height="600"/>
</p>
![gmma_wg_n_slice.png](../../../images/cute/gmma_wg_n_slice.png)
With this image, we are again ready to start building the `CLayout` for `SM90_64x128x16_F16F16F16F16_TN` atom. Same as before, we are constructing a mapping between the `(logical_thr_id, logical_val_id) -> (m, n)` coordinate spaces.
@ -452,9 +442,7 @@ Let's start with `SM70_8x8x4_F32F16F16F32_NT`.
MMA_Atom mma = MMA_Atom<SM70_8x8x4_F32F16F16F32_NT>{};
print_latex(mma);
```
<p align="center">
<img src="../../../images/cute/HMMA.8x8x4.NT_Atom.png" alt="HMMA.8x8x4.NT_Atom.png" height="400"/>
</p>
![HMMA.8x8x4.NT_Atom.png](../../../images/cute/HMMA.8x8x4.NT_Atom.png)
The above is equivalent to
```cpp
@ -472,9 +460,7 @@ We can create an object akin to a WMMA by using four of these quadpair MMAs:
Stride<_2,_1>>{}); // 2x2 n-major layout of Atoms
print_latex(mma);
```
<p align="center">
<img src="../../../images/cute/HMMA.8x8x4.NT_2x2.png" alt="HMMA.8x8x4.NT_2x2.png" height="400"/>
</p>
![HMMA.8x8x4.NT_2x2.png](../../../images/cute/HMMA.8x8x4.NT_2x2.png)
This `TiledMMA` replicates the `MMA_Atom` across threads as we can see the `T4` and `T8` and `T12` threads in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the atom's partitioning pattern for a new quadpair and this replication follows a `(2,2):(2,1)` layout.
The above represents a 16x16x4 MMA now, but we can immediately expand this "tile size" up to 32x32x4 instead:
@ -485,9 +471,7 @@ The above represents a 16x16x4 MMA now, but we can immediately expand this "tile
Tile<_32,_32,_4>{}); // 32x32x4 tiler
print_latex(mma);
```
<p align="center">
<img src="../../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png" alt="HMMA.8x8x4.NT_2x2_32x32x4.png" height="400"/>
</p>
![HMMA.8x8x4.NT_2x2_32x32x4.png](../../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png)
This `TiledMMA` replicates the previous `TiledMMA` across values instead of threads. We can see the `T0V8` and `T16V8` and `T8V8` values in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the previous `TiledMMA`'s partitioning pattern for a new set of values.
Continuing, we see that there are eight values that `T0` receives from the `A`-matrix. Those reads occur at coordinates
@ -513,9 +497,7 @@ which are separate, but we might prefer them to be next to each other. That is w
_4>{}); // Permutation on K, size 4 identity
print_latex(mma);
```
<p align="center">
<img src="../../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png" alt="HMMA.8x8x4.NT_2x2_32Mx32x4.png" height="400"/>
</p>
![HMMA.8x8x4.NT_2x2_32Mx32x4.png](../../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png)
That layout `(4,4,2):(1,8,4)` is read like a scatter permutation, telling the m-coords of the original image where to go in the new image.
```

View File

@ -334,9 +334,7 @@ These thread layouts are then used to partition the tiles of data in global memo
```
where we've used the same projection-style interface to avoid applying the `N`-mode of `tC` to the `(BLK_M,BLK_K)` shape of `sA` and avoid applying the `M`-mode of `tC` to the `(BLK_N,BLK_K)` shape of `sB`.
<p align="center">
<img src="../../../images/cute/tC_partitioning.png" alt="tC_partitioning.png" height="300"/>
</p>
![tC_partitioning.png](../../../images/cute/tC_partitioning.png)
This diagram shows a `tC` layout, highlights two threads in green and blue, shows the projections of the `tC` layout, and finally highlights the subtensors within `sA`, `sB`, and `gC` that `tCsA`, `tCsB`, and `tCgC` represent.
With the data partitioned across the threads, *every thread* can now participate in the compute step by writing
@ -390,9 +388,7 @@ As a first example, lets look at the `TiledCopy` that `gemm_nt` generates.
print_latex(copyA);
```
The easiest way to see what this `TiledCopy` does is to look at the partition pattern in LaTeX.
<p align="center">
<img src="../../../images/cute/TiledCopyA.png" alt="TiledCopyA.png" height="300"/>
</p>
![TiledCopyA.png](../../../images/cute/TiledCopyA.png)
On the left is the source-tensor partitioning and on the right is the destination-tensor partitioning. The partition patterns are the same for this case, but there exist PTX instructions which require different patterns in the source and destination. The diagram shows that each thread reads 4x1 `TA` elements and there are 32x8 threads. The `UniversalCopy<uint128_t>` forces the instruction to use a 128-bit copy instruction. If the partition (of `sA` or `gA` in this case) does not result in 4 `TA` elements that cannot be vectorized to a 128-bit load/store, then CuTe will statically fail with an error message to that effect.
To use the `TiledCopy`, the kernel writes
@ -421,9 +417,7 @@ As a first example, lets look at the `TiledMMA` that `gemm_nt` generates.
print_latex(mmaC);
```
The easiest way to see what this `TiledMMA` does is to look at the partition pattern in LaTeX.
<p align="center">
<img src="../../../images/cute/TiledMmaC.png" alt="TiledMmaC.png" height="300"/>
</p>
![TiledMmaC.png](../../../images/cute/TiledMmaC.png)
On the left is the A-tensor partitioning, on the top is the B-tensor partitioning, and in the middle is the C-tensor partitioning.Because the `UniversalFMA` is a 1x1x1 MMA instruction, a 16x16x1 tiling of them results in a 16x16x1 `TiledMMA`. Other MMA instructions will have different threads involved and have different instruction sizes. In this case, all threads will read a single element from `A`, `B`, and `C` each.
To use the `TiledMMA`, the kernel writes

View File

@ -8,7 +8,7 @@ What is an `ArithTuple`? Are those tensor strides? What do those mean? What is t
This documentation intends to answer those questions and introduce some of the more advanced features of CuTe.
# Introduction to TMA instructions
## Introduction to TMA instructions
The Tensor Memory Accelerator (TMA) is a set of instructions for copying possibly multidimensional arrays between global and shared memory. TMA was introduced in the Hopper architecture. A single TMA instruction can copy an entire tile of data all at once. As a result, the hardware no longer needs to compute individual memory addresses and issue a separate copy instruction for each element of the tile.
@ -53,9 +53,9 @@ That means that an ordinary CuTe Tensor that stores a GMEM pointer and computes
What do we do?
# Building a TMA Tensor
## Building a TMA Tensor
## Implicit CuTe Tensors
### Implicit CuTe Tensors
All CuTe Tensors are compositions of Layouts and Iterators. An ordinary global memory tensor's iterator is its global memory pointer. However, a CuTe Tensor's iterator doesn't have to be a pointer; it can be any random-access iterator.
@ -83,7 +83,7 @@ This tensor maps logical coordinates to on-the-fly computed integers. Because it
But the TMA doesn't consume pointers or integers, it consumes coordinates. Can we make a tensor of implicit TMA
coordinates for the TMA instruction to consume? If so, then we could presumably also tile and partition and slice that tensor of coordinates so that we would always have the right TMA coordinate to give to the instruction.
## ArithTupleIterators and ArithTuples
### ArithTupleIterators and ArithTuples
First, we build a `counting_iterator` equivalent for TMA coordinates. It should support
@ -110,7 +110,7 @@ In summary, one creates a TMA descriptor for the *whole global memory tensor*. T
We can now track and offset TMA coordinates with this iterator, but how do we get CuTe Layouts to generate non-integer offsets?
## Strides aren't just integers
### Strides aren't just integers
Ordinary tensors have a layout that maps
a logical coordinate `(i,j)` into a 1-D linear index `k`.
@ -122,7 +122,7 @@ to a TMA coordinate, rather than to a 1-D linear index.
To do this, we can abstract what a stride is. Strides need not be integers, but rather any algebraic object that supports inner-product with the integers (the logical coordinate). The obvious choice is the `ArithmeticTuple` we used earlier since they can be added to each other, but this time additionally equipped with an `operator*` so it can also be scaled by an integer.
### Aside: Integer-module strides
#### Aside: Integer-module strides
A group of objects that support addition between elements and product between elements and integers is called an integer-module.
@ -133,7 +133,7 @@ Rank-R tuples of integers are an integer-module.
In principle, layout strides may be any integer-module.
### Basis elements
#### Basis elements
CuTe's basis elements live in the header file `cute/numeric/arithmetic_tuple.hpp`.
To make it easy to create `ArithmeticTuple`s that can be used as strides, CuTe defines normalized basis elements using the `E` type alias. "Normalized" means that the scaling factor of the basis element is the compile-time integer 1.
@ -172,7 +172,7 @@ Intuitively, "compatible" means that
the nested structure of the two basis elements
matches well enough to add the two elements together.
### Linear combinations of strides
#### Linear combinations of strides
Layouts work by taking the inner product
of the natural coordinate with their strides.
@ -200,7 +200,7 @@ and can be interpreted as the coordinate `((7,4),23)`.
Thus, linear combinations of these strides can be used to generate TMA coordinates.
These coordinates, in turn, can be used to offset TMA coordinate iterators.
## Application to TMA Tensors
### Application to TMA Tensors
Now we can build CuTe Tensors like the one seen in the introduction.
@ -230,7 +230,7 @@ ArithTuple(0,0) o (4,5):(_1@1,_1@0):
(0,3) (1,3) (2,3) (3,3) (4,3)
```
## Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -4,7 +4,7 @@ CuTe
====================
.. toctree::
:maxdepth: 2
:maxdepth: 1
00_quickstart<00_quickstart.md>
01_layout<01_layout.md>

View File

@ -0,0 +1,12 @@
.. _cutlass_2_x:
CUTLASS 2.x
==================
.. toctree::
:maxdepth: 2
Layouts and Tensors<layout.md>
GEMM API<gemm_api.md>
Tile Iterator Concepts<tile_iterator_concept.md>
Utilities<utilities.md>

View File

@ -0,0 +1,11 @@
.. _cutlass_3_x:
CUTLASS 3.x
==================
.. toctree::
:maxdepth: 2
Design <cutlass_3x_design.md>
GEMM Backwards Compatibility <cutlass_3x_backwards_compatibility.md>
GEMM API <gemm_api_3x.md>

View File

@ -438,7 +438,7 @@ obtain the kernel's configuration parameters. Users can use these to approximate
for 3.0 API kernels. However, the reflective interfaces cannot always match the types exactly,
as the mappings are not always bijective.
# Copyright
### Copyright
Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -114,7 +114,7 @@ In this way, CuTe reifies the thread-to-data-layout mapping,
makes it easier to write code that is "correct by construction".
If the code compiles, it's probably correct.
## Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -277,7 +277,7 @@ CUDA exposes warp-level matrix operations in the CUDA C++ WMMA API. The CUDA C++
| **B** | `RowMajor`, `ColumnMajor` | `RowMajor`, `ColumnMajor` |
| **C** | `RowMajor`, `ColumnMajor` | `RowMajor`, `ColumnMajor` |
# Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -355,7 +355,7 @@ support on current and future NVIDIA GPUs.
```
# Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -5,7 +5,7 @@
CUTLASS presents a uniform programming model for matrix multiply-accumulate operations at each level of the hierarchy. This document
focuses on device-level, threadblock-level GEMMs, warp-level GEMMs, thread-level GEMMs, and instruction-level GEMMs.
# CUTLASS GEMM Model
## CUTLASS GEMM Model
CUTLASS implements the basic GEMM triple loop nest with a tiled structure mirroring the execution model hierarchy.
@ -62,7 +62,7 @@ warp-synchronous matrix multiply instructions targeting Tensor Cores.
Alternatively, GEMMs targeting single-thread instructions may have an additional series of nested loops corresponding to
thread-level concurrency.
# CUTLASS GEMM Components
## CUTLASS GEMM Components
This loop nest is expressed in CUTLASS via the following components which are specialized for data type, layout, and
math instruction.
@ -71,7 +71,7 @@ math instruction.
These components are described in the following sections.
## Device-wide GEMM API
### Device-wide GEMM API
The device-level GEMM API is intended to streamline instantiation and execution of the standard
GEMM computation across the GPU. This operator is intended to be used in host-side .cu code and
@ -119,7 +119,7 @@ The device-wide GEMM API is embodied by the following operators:
```
## Threadblock-level GEMM API
### Threadblock-level GEMM API
GEMMs at this scope are expected to efficiently load tiles of data from global memory into internal storage and then compute matrix
products with warp-level GEMM operators.
@ -196,7 +196,7 @@ struct Mma {
};
```
## Warp-level Matrix Multiply API
### Warp-level Matrix Multiply API
Warp-level GEMM operators load tiles from shared memory into registers and then compute matrix multiplies using either
Tensor Cores or CUDA Cores. The result is accumulated in a register tile. Iterators are defined for each
@ -416,7 +416,7 @@ class MmaSimt;
```
## Thread-level GEMM API
### Thread-level GEMM API
Thread-level GEMM operations perform matrix multiply-accumulate on data held in registers. These target CUDA Cores exclusively.
@ -502,7 +502,7 @@ struct Mma;
} // namespace cutlass
```
## Efficient Epilogue
### Efficient Epilogue
CUTLASS GEMM operators perform mma followed by epilogue operation similar
to cuBLAS. CUTLASS implements an efficient row-major epilogue. Thus, to achieve
@ -529,7 +529,7 @@ of input layouts. Thus, CUTLASS supports the following layout combinations for i
- `{N,T} x {N,T} => {N,T}` - NN, TN, TN, TT GEMM for both row-major and column-major output
## Instruction-level operations
### Instruction-level operations
CUTLASS defines a template-based interface to Tensor Core operations to avoid resorting
to inline PTX.
@ -538,7 +538,7 @@ to inline PTX.
- [mma_sm75.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/mma_sm75.h) - Turing TensorCore operations
# Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -19,7 +19,7 @@ Device, Kernel, and Collective.
It also briefly discusses the Tiled MMA/Copy and Atom level,
and then refers readers to CuTe's tutorial for more information.
# CUTLASS GEMM Model
## CUTLASS GEMM Model
CUTLASS implements algorithms that express
the classical "triply nested loop" GEMM algorithm
@ -80,7 +80,7 @@ and computes MMAs.
These tiled copy and tiled mma iterations are generally
fully static and get fully unrolled.
# CUTLASS GEMM Components
## CUTLASS GEMM Components
CUTLASS expresses the above loop nest
with the following components which are specialized for
@ -146,7 +146,7 @@ using GemmHandle = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
Towards the end, we also briefly cover CuTe's tiled mma and copy as well as the atom layer APIs,
before redirecting users to CuTe-specific documentation for further details.
## Collective API
### Collective API
A Collective is "the largest collection of threads
onto which mma atoms and copy atoms are tiled."
@ -670,7 +670,7 @@ please refer to CuTe's tutorial, e.g., the sections on
* [a GEMM example](./cute/0x_gemm_tutorial.md).
# Copyright
### Copyright
Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -0,0 +1,16 @@
.. _getting_started:
Getting Started
==================
.. toctree::
:maxdepth: 2
Quickstart<quickstart.md>
IDE Setup<ide_setup.md>
Build<build/index>
Functionality<functionality.md>
Terminology<terminology.md>
Fundamental Types<fundamental_types.md>
Programming Guidelines<programming_guidelines.md>

View File

@ -1,6 +1,6 @@
![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Grouped Kernel Schedulers")
# CUTLASS Grouped Kernel Schedulers
# Grouped Kernel Schedulers
CUTLASS's grouped kernel is a persistent kernel which launches multiple problems (e.g., GEMMs, SYR2Ks) within a
single CUDA kernel launch.

View File

@ -118,7 +118,7 @@ This is usually a convenient way to configure projects, but it's not as simple f
clang doesn't understand many of the compiler flags used by nvcc. Hence, for now, we don't recommend using
`compile_commands.json` to configure your CUDA project.
## Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -217,7 +217,7 @@ and `TensorRef` objects for each of the operands whose extents are implied as a
redundant storage of extent quantities, CUTLASS minimizes capacity utilization of precious resources such as constant memory.
This is consistent with BLAS conventions.
# Summary:
## Summary:
The design patterns described in this document form a hierarchy:
* `T *ptr;` is a pointer to a contiguous sequence of elements of type `T`
@ -225,7 +225,7 @@ The design patterns described in this document form a hierarchy:
* `TensorRef<T, Layout> ref(ptr, layout);` is an object pointing to an _unbounded_ tensor containing elements of type `T` and a layout of type `Layout`
* `TensorView<T, Layout> view(ref, extent);` is an object pointing to a _bounded_ tensor containing elements of type `T` and a layout of type `Layout`
# Appendix: Existing Layouts
### Appendix: Existing Layouts
This section enumerates several existing Layout types defined in CUTLASS.
@ -268,7 +268,7 @@ Permuted Shared Memory Layouts:
- `TensorOpCrosswise<ElementSize>`
# Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -1,619 +0,0 @@
![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
# Overview
# CUTLASS 3.9.0
_CUTLASS 3.9.0 - March 2025_
CUTLASS is a collection of CUDA C++ template abstractions for implementing
high-performance matrix-matrix multiplication (GEMM) and related computations at all levels
and scales within CUDA. It incorporates strategies for hierarchical decomposition and
data movement similar to those used to implement cuBLAS and cuDNN. CUTLASS decomposes
these "moving parts" into reusable, modular software components abstracted by C++ template
classes. Primitives for different levels of a conceptual parallelization hierarchy
can be specialized and tuned via custom tiling sizes, data types,
and other algorithmic policy. The resulting flexibility simplifies their use
as building blocks within custom kernels and applications.
To support a wide variety of applications, CUTLASS provides extensive support for
mixed-precision computations, providing specialized data-movement and
multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16,
[FP32 emulation via tensor core instruction](https://github.com/NVIDIA/cutlass/tree/main/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm),
8b floating point types (e5m2 and e4m3),
block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8),
narrow integer types (4 and 8b signed and unsigned integers),
and binary 1b data types (where architectures allow for the
native support of such data types).
CUTLASS demonstrates optimal matrix multiply operations
targeting the programmable, high-throughput _Tensor Cores_ implemented by
NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.
In addition to GEMMs, CUTLASS implements high-performance convolution via
the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution
operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline.
This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
See the [Quick Start Guide](quickstart.md) to get started quickly.
See the [functionality docs](functionality.md) for a more comprehensive
list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU
architecture.
# What's New in CUTLASS 3.9
* Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
- Collective mainloops that target for:
* [Blockscaled datatypes with support for dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp)
* [Blockscaled datatypes with support for sparse GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm120_blockscaled_sparse_mma_tma.hpp)
- New [GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/dispatch_policy.hpp) and [epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.
- [Blackwell SM120 epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/fusion/sm120_visitor_store_tma_warpspecialized.hpp) and [full set of EVT fusions](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/fusion/sm120_callbacks_tma_warpspecialized.hpp).
* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
- [Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu).
- [Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79b_blackwell_geforce_nvfp4_nvfp4_gemm.cu).
- [Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu).
* Set of unit tests that demonstrate the usage of both [sparse](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm120_blockscaled_sparse_tensorop_gemm/) and [dense](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/) Blackwell SM120 blockscaled GEMM.
* Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
- Enhancement of [blockwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) for Hopper architecture.
- Enhancement of [groupwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) for Hopper architecture.
- Support for [grouped GEMM with blockwise and groupwise scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
- Support for [blockwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu) for Blackwell architecture.
- Support for [groupwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu) for Blackwell architecture.
- Support for [grouped GEMM with blockwise](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu) and [groupwise scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu) for Blackwell architecture.
* Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
- Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
- Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
- Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
- More detailed introductions and examples to leverage this feature can be found in [profiler.md](./profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.
**See the [CHANGELOG](../release_notes.md) for details of all past releases and updates.**
# Performance
CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels,
they exhibit nearly optimal utilization of peak theoretical throughput. The figure below
shows CUTLASS 3.8's performance as a % of theoretical peak utilization
on various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU.
![ALT](../../images/cutlass-3.8-blackwell-gemm-peak-performance.svg "")
The two figures below show the continual CUTLASS performance improvements
on an [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) (NVIDIA Hopper architecture) since
CUTLASS 3.1.
CUTLASS 3.5.1 was compiled with the [CUDA 12.5u1 Toolkit](https://developer.nvidia.com/cuda-downloads).
Tensor Core operations are implemented using CUDA's
[mma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma) and
[wgmma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions) instructions.
![ALT](../../images/cutlass-3.5.1-gemm-peak-performance.png "")
![ALT](../../images/cutlass-3.5.1-gemm-peak-performance-fp8.png "")
# CuTe
CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data.
CuTe is a collection of C++ CUDA template abstractions for
defining and operating on hierarchically multidimensional layouts of threads and data.
CuTe provides `Layout` and `Tensor` objects that compactly package the type,
shape, memory space, and layout of data, while performing the complicated indexing for the user.
This lets programmers focus on the logical descriptions of their algorithms while
CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design,
implement, and modify all dense linear algebra operations.
The core abstractions of CuTe are hierarchically multidimensional layouts
which can be composed with data arrays to represent tensors.
The representation of layouts is powerful enough to represent nearly
everything we need to implement efficient dense linear algebra.
Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates.
This greatly simplifies the design and improves code composability and readability.
More documentation specific to CuTe can be found in its
[dedicated documentation directory](cute/00_quickstart.md).
# Compatibility
Minimum requirements:
- Architecture: Volta (compute capability 7.0)
- Compiler: Must support at least C++17
- CUDA Toolkit version: 11.4
CUTLASS requires a C++17 host compiler and
performs best when built with the [**CUDA 12.8 Toolkit**](https://developer.nvidia.com/cuda-downloads).
It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, and all other CUDA 12.x versions.
## Operating Systems
We have tested the following environments.
|**Operating System** | **Compiler** |
|-----------------|----------|
| Ubuntu 18.04 | GCC 7.5.0 |
| Ubuntu 20.04 | GCC 10.3.0 |
| Ubuntu 22.04 | GCC 11.2.0 |
Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.
## Hardware
CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs.
|**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit Required by CUTLASS-3**|
|---|---|---|
|NVIDIA V100 Tensor Core GPU |7.0|11.4|
|NVIDIA TitanV |7.0|11.4|
|NVIDIA GeForce RTX 20x0 series |7.5|11.4|
|NVIDIA T4 |7.5|11.4|
|NVIDIA A100 Tensor Core GPU |8.0|11.4|
|NVIDIA A10 |8.6|11.4|
|NVIDIA GeForce RTX 30x0 series |8.6|11.4|
|NVIDIA GeForce RTX 40x0 series |8.9|11.8|
|NVIDIA L40 |8.9|11.8|
|NVIDIA H100 Tensor Core GPU |9.0|11.8|
|NVIDIA H200 Tensor Core GPU |9.0|11.8|
|NVIDIA B200 Tensor Core GPU |10.0|12.8|
|NVIDIA GeForce RTX 50x0 series |10.0|12.8|
## Target Architecture
In general, PTX code generated for one target architecture can be run on future architectures
(i.e., it is forward compatible).
However, CUDA 12.0 introduced the concept of "architecture-accelerated features" whose
PTX does not have forward compatibility guarantees.
Several Hopper and Blackwell PTX instructions fall under this category of
architecture-accelerated features, and thus require a `sm_90a` or `sm100a` target architecture
(note the "a" appended). For more details on this and other architecture-accelerated instructions,
please refer to the [CUDA Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availability).
The target architecture information is passed on to CUTLASS via the cmake flag
`CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100,
users are required to build CUTLASS with `90a` as the target architecture.
If a user accidentally builds a kernel which uses SM90a features
(e.g. Hopper Tensor Core Instructions), using the SM90 target
(note the lack of "a"), with either CUDA Toolkit 12 or 11.8,
the kernel is expected to fail with a runtime error.
```
cmake .. -DCUTLASS_NVCC_ARCHS="90a"
```
Or
```
cmake .. -DCUTLASS_NVCC_ARCHS="100a"
```
Note: The NVIDIA Blackwell SM100 architecture used in the datacenter
products has a different compute capability than the one underpinning
NVIDIA Blackwell GeForce RTX 50 series GPUs. As a result, kernels
compiled for Blackwell SM100 architecture with arch conditional features
(using `sm100a`) are not compatible with RTX 50 series GPUs.
Please refer to the [functionality documentation](functionality.md)
for details on which kernels require which target architectures.
# Documentation
CUTLASS is described in the following documents and the accompanying
[Doxygen documentation](https://nvidia.github.io/cutlass).
- [Quick Start Guide](quickstart.md) - basics of building and running CUTLASS
- [Functionality](functionality.md) - summarizes functionality available in CUTLASS
- [Efficient GEMM in CUDA](efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
- [CUTLASS 3.x Design](cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
- [GEMM API 3.x](gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
- [GEMM API 2.x](gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
- [Implicit GEMM Convolution](implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
- [Code Organization](code_organization.md) - describes the organization and contents of the CUTLASS project
- [Terminology](terminology.md) - describes terms used in the code
- [Programming Guidelines](programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
- [Fundamental types](fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
- [Layouts](layout.md) - describes layouts of matrices and tensors in memory
- [Tile Iterators](tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
- [CUTLASS Profiler](profiler.md) - command-line driven profiling application
- [CUTLASS Utilities](utilities.md) - additional templates used to facilitate rapid development
- [Dependent kernel launch](dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent
kernels in the same stream, and how it is used in CUTLASS.
# Resources
We have also described the structure of an efficient GEMM in our talk at the
[GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).
- [CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA](https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8854/)
- [Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100](https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/)
- [Accelerating Convolution with Tensor Cores in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31883/)
- [Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41996/)
- [CUTLASS: Python API, Enhancements, and NVIDIA Hopper](https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41131/)
# Building CUTLASS
CUTLASS is a header-only template library and does not need to be built to be used by other
projects. Client applications should target CUTLASS's `include/` directory in their include
paths.
CUTLASS unit tests, examples, and utilities can be build with CMake.
The minimum version of CMake is given in the [Quickstart guide](quickstart.md).
Make sure the `CUDACXX` environment variable points to NVCC in the CUDA Toolkit installed
on your system.
```bash
$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
```
Create a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels
for CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 8.9, and 9.0.
To reduce compile time you can specify
the architectures to build CUTLASS for by changing the CMake configuration setting
`CUTLASS_NVCC_ARCHS`.
```bash
$ mkdir build && cd build
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 # compiles for NVIDIA's Ampere Architecture
```
From the `build/` directory, compile and run the CUTLASS unit tests by building the target `test_unit` with make.
The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS,
and they may be executed in parallel via make's `-j` command line argument.
```bash
$ make test_unit -j
...
...
...
[----------] Global test environment tear-down
[==========] 946 tests from 57 test cases ran. (10812 ms total)
[ PASSED ] 946 tests.
```
All tests should pass on supported platforms, though the exact number of tests may vary over time.
# Project Structure
CUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests.
[Doxygen documentation](https://nvidia.github.io/cutlass) provides a complete list of files, classes,
and template concepts defined in the CUTLASS project.
A detailed explanation of the source code organization may be found in the
[CUTLASS documentation](code_organization.md), but several main components are summarized below.
## CUTLASS Template Library
```
include/ # client applications should target this directory in their build's include paths
cutlass/ # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only
arch/ # direct exposure of architecture features (including instruction-level GEMMs)
conv/ # code specialized for convolution
epilogue/ # code specialized for the epilogue of gemm/convolution
gemm/ # code specialized for general matrix product computations
layout/ # layout definitions for matrices, tensors, and other mathematical objects in memory
platform/ # CUDA-capable Standard Library components
reduction/ # bandwidth-limited reduction kernels that do not fit the "gemm" model
thread/ # simt code that can be performed within a CUDA thread
transform/ # code specialized for layout, type, and domain transformations
* # core vocabulary types, containers, and basic numeric operations
cute/ # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy
algorithm/ # Definitions of core operations such as copy, gemm, and operations on cute::tuples
arch/ # Bare bones PTX wrapper structs for copy and math instructions
atom/ # Meta-information either link to or built from arch/ operators
mma_atom.hpp # cute::Mma_Atom and cute::TiledMma
copy_atom.hpp # cute::Copy_Atom and cute::TiledCopy
*sm*.hpp # Arch specific meta-information for copy and math operations
* # Core library types such as Shape, Stride, Layout, Tensor, and associated operations
```
### CUTLASS SDK Examples
[CUTLASS SDK examples](https://github.com/NVIDIA/cutlass/tree/main/examples) apply CUTLASS templates to implement basic computations.
### Tools
```
tools/
library/ # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates
include/
cutlass/
library/
profiler/ # CUTLASS Profiler - command-line utility for executing operations in the
# CUTLASS Library
util/ # CUTLASS Utilities - contains numerous helper classes for
include/ # manging tensors in device memory, reference
cutlass/ # implementations for GEMM, random initialization
util/ # of tensors, and I/O.
```
### Test
The `test/unit/` directory consist of unit tests implemented with Google Test that demonstrate
basic usage of Core API components and complete tests of the CUTLASS GEMM computations.
Instructions for building and running the Unit tests are described in the [Quickstart guide](quickstart.md).
# Performance Profiling
The `tools/profiler/` directory contains a command-line utility for launching each of the GEMM kernels.
It can be built as follows:
```bash
$ make cutlass_profiler -j16
```
## Building all GEMM and Convolution kernels (_long_ build times)
By default, only one tile size is instantiated for each data type, math instruction, and layout.
To instantiate all, set the following environment variable when running CMake from an empty `build/` directory.
Beware, this results in *tens of thousands* of kernels and long build times.
This would also result in a large binary size and on some platforms linker to fail on building the library.
Therefore, it's highly recommended to generate only a subset of kernels as demonstrated in the sub-section below.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all
...
$ make cutlass_profiler -j16
```
## Building a subset of GEMM and Convolution kernels (_reduced_ build times)
To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with
wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one
or a subset of kernels for NVIDIA Ampere and Turing architecture:
### Building a subset Tensor Core GEMM kernels
To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture,
use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
...
$ make cutlass_profiler -j16
```
Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:
```bash
./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096
...
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
cuBLAS: Passed
Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 \
--cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 \
--max_cc=1024
Bytes: 118489088 bytes
FLOPs: 115992428544 flops
Runtime: 1.55948 ms
Memory: 70.7616 GiB/s
Math: 74378.8 GFLOP/s
=============================
...
```
### Building one CUDA Core GEMM kernel
To compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
...
$ make cutlass_profiler -j16
```
Example command line for profiling single SGEMM CUDA kernel is as follows:
```bash
$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1
Status: Success
Verification: ON
Disposition: Passed
cuBLAS: Passed
Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1 \
--batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \
--warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
Bytes: 180355072 bytes
FLOPs: 115992428544 flops
Runtime: 6.73655 ms
Memory: 24.934 GiB/s
Math: 17218.4 GFLOP/s
=============================
```
### Building a subset of Tensor Core Convolution kernels
To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation
and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
...
$ make cutlass_profiler -j16
```
Example command line for profiling a subset of Tensor Core convolution kernels is as follows:
```bash
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
...
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: conv2d
Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc \
--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
--eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5 \
--warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
Bytes: 1130659840 bytes
FLOPs: 118482796544 flops
Runtime: 0.711496 ms
Memory: 1479.99 GiB/s
Math: 166526 GFLOP/s
=============================
...
```
### Building one Convolution CUDA kernel
To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation
and FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
...
$ make cutlass_profiler -j16
```
Example command line for profiling one CUDA Core convolution kernel:
```bash
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: conv2d
Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc \
--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
--eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \
--warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
Bytes: 2055798784 bytes
FLOPs: 118482796544 flops
Runtime: 7.34266 ms
Memory: 260.752 GiB/s
Math: 16136.2 GFLOP/s
=============================
```
## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
- [GEMM CMake Examples](quickstart.md#gemm-cmake-examples)
- [Implicit GEMM convolution CMake Examples](quickstart.md#convolution-cmake-examples)
- [Further details about the CUTLASS Profiler are described here.](profiler.md)
# About
CUTLASS is released by NVIDIA Corporation as Open Source software under the
[3-clause "New" BSD license](LICENSE.txt).
# Contributors
The official list of CUTLASS developers and contributors is available here: [CONTRIBUTORS](CONTRIBUTORS.md).
# Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
```
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
```

View File

@ -45,7 +45,7 @@ compile or fail to launch at runtime.
```bash
$ cmake .. \
-DCUTLASS_NVCC_ARCHS="90a" \
-DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm90_tensorop_s64x64x16gemm_f16_f16_f32_void_f32_*" \
-DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f32_*" \
-DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="max" \
-DCUTLASS_UNITY_BUILD_ENABLED=ON
```
@ -525,7 +525,7 @@ To best illustrate this naming convention, we will walk through the meaning of e
in a GEMM kernel used by the profiler:
```
cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f32_{optional-mixed-dtype-config}_128x128x64_2x1x1_0_ntn_align8
cutlass3x_sm90_tensorop_gemm_f16_f16_f32_f16_f32_{optional-mixed-dtype-config}_128x128x64_2x1x1_0_ntn_align8
```
The components within this name are as follows:
@ -553,7 +553,7 @@ Note that in some special cases where the input A/B types do not match that of t
instruction's, the MMA facing input type is added to the instruction string as well.
```
cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4
cutlass3x_sm90_tensorop_tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4
```
* `s64x128x8tf32gemm`: indicates that the MMA consumes inputs in `tf32` format, and therefore
@ -563,7 +563,7 @@ For custom mainloop or epilogue schedules, details of the opted-in schedule are
kernel name. For example,
```
cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma
cutlass3x_sm90_tensorop_gemm_f16_f16_f16_void_f16_128x128x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma
```
* `warpspecialized_cooperative`: Mainloop employs a persistent warp-specialized mainloop and kernel schedule.

View File

@ -1157,7 +1157,7 @@ has shape `((X, Y), K)` and stride `((1, X), X*Y)`.
`get<0>(stride)` is the tuple `(1, X)`, not a single integer.
However, A is certainly M major if interpreted as a matrix.
# Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -462,7 +462,7 @@ int main(int argc, char const **args) {
}
```
# CUTLASS Library
## CUTLASS Library
The [CUTLASS Library](https://github.com/NVIDIA/cutlass/tree/main/tools/library) defines an API for managing and executing collections of compiled
kernel instances and launching them from host code without template instantiations in client code.
@ -585,7 +585,7 @@ int main() {
}
```
# Example CMake Commands
## Example CMake Commands
To instantiate all operations supporting all tile sizes, data types, and alignment constraints, specify
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
@ -750,7 +750,7 @@ are needed in the mainloop builder:
We encourage a user to refer to Sm100 unit tests and the generated profiler-based kernels as more comprehensive samples.
# Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -78,7 +78,10 @@ replaced by [MMA and Copy atoms from CuTe](cute/0t_mma_atom.md).
**Thread Map**: abstraction for defining how threads are mapped to a given tile. Deprecated starting CUTLASS 3.0.
Replaced by `cute::Layout` in equivalent usage scenarios to represent thread tensors.
# Copyright
[comment]: <> (Don't remove this. This "##" is to prevent Sphinx from throwing build WARNING.)
##
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -469,7 +469,7 @@ struct WriteableReadableRandomAccessContiguousTileIteratorConcept {
};
```
# Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -431,7 +431,7 @@ Additional information may appear at the end of each line, such as shared memory
Please note that `synclog` is an experimental feature, and its functionality is not always guaranteed. We encourage its use in custom kernels and CUTLASS examples, though it is known to be incompatible with profiler kernels.
# Copyright
### Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

View File

@ -0,0 +1,18 @@
.. _cute_dsl:
CuTe DSL
========
.. toctree::
:maxdepth: 1
DSL Introduction <cute_dsl_general/dsl_introduction.rst>
DSL Code Generation <cute_dsl_general/dsl_code_generation.rst>
DSL Control Flow <cute_dsl_general/dsl_control_flow.rst>
DSL JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
DSL JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
DSL JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
Integration with Frameworks <cute_dsl_general/framework_integration.rst>
Debugging with the DSL <cute_dsl_general/debugging.rst>
Autotuning with the DSL <cute_dsl_general/autotuning_gemm.rst>
Educational Notebooks <cute_dsl_general/notebooks.rst>

View File

@ -0,0 +1,12 @@
.. _cute_dsl_api:
CuTe DSL API
============
.. toctree::
:maxdepth: 1
cute <cute_dsl_api/cute.rst>
cute_arch <cute_dsl_api/cute_arch.rst>
cute_nvgpu <cute_dsl_api/cute_nvgpu.rst>
utils <cute_dsl_api/utils.rst>

View File

@ -0,0 +1,11 @@
.. _cute:
cutlass.cute
============
.. automodule:: cutlass.cute
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__
:private-members:

View File

@ -0,0 +1,24 @@
.. _cute_arch:
cutlass.cute.arch
=================
The ``cute.arch`` module contains wrappers around NVVM-level MLIR Op builders that seamlessly
inter-operate with the Python types used in CUTLASS Python. Another benefit of wrapping these Op
builders is that the source location can be tracked with the ``@dsl_user_op`` decorator. Available
functions include
- basic API like ``thr_idx``;
- functions related to the direct management of mbarriers;
- low-level SMEM management (prefer using the ``SmemAllocator`` class);
- TMEM management.
API documentation
-----------------
.. automodule:: cutlass.cute.arch
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__
:private-members:

View File

@ -0,0 +1,18 @@
.. _cute_nvgpu:
cutlass.cute.nvgpu
==================
The ``cute.nvgpu`` module contains MMA and Copy Operations as well as Operation-specific helper
functions. The arch-agnostic Operations are exposed at the top-level while arch-specific Operations
are grouped into submodules like ``tcgen05``.
.. toctree::
:maxdepth: 2
:hidden:
cute_nvgpu_common
cute_nvgpu_warp
cute_nvgpu_warpgroup
cute_nvgpu_cpasync
cute_nvgpu_tcgen05

View File

@ -0,0 +1,9 @@
.. _cute_nvgpu_common:
Common
======
.. automodule:: cutlass.cute.nvgpu
:members:
:undoc-members:
:show-inheritance:

View File

@ -0,0 +1,10 @@
.. _cute_nvgpu_cpasync:
cpasync submodule
=================
.. automodule:: cutlass.cute.nvgpu.cpasync
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__

View File

@ -0,0 +1,10 @@
.. _cute_nvgpu_tcgen05:
tcgen05 submodule
=================
.. automodule:: cutlass.cute.nvgpu.tcgen05
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__

View File

@ -0,0 +1,10 @@
.. _cute_nvgpu_warp:
warp submodule
==============
.. automodule:: cutlass.cute.nvgpu.warp
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__

View File

@ -0,0 +1,10 @@
.. _cute_nvgpu_warpgroup:
warpgroup submodule
===================
.. automodule:: cutlass.cute.nvgpu.warpgroup
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__

View File

@ -0,0 +1,9 @@
cutlass.utils
=============
.. automodule:: cutlass.utils
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__
:private-members:

View File

@ -0,0 +1,154 @@
.. _autotuning_gemm:
Guidance for Auto-Tuning
=============================
.. contents:: Table of Contents
:depth: 2
:local:
Numerous GEMM kernel code examples are offered within our codebase.
When integrating these kernels into frameworks, auto-tuning becomes essential
for achieving optimal performance. This involves selecting the appropriate
kernel parameters based on the inputs of real applications.
Next, we'll briefly introduce some tips on how to perform auto-tuning.
The auto-tuning process typically involves the following steps:
1. Define search space
2. Benchmark each configuration and select the kernel with the best performance
3. Enable caching to reduce the tuning cost
The search space defines the valid combinations of kernel parameters that can be used to run the kernels.
Different inputs (shapes, data types, etc.) typically require different kernel parameters to achieve optimal performance.
The search space is related to the kernel. We take the Blackwell GEMM persistent kernel as an example.
The search space is as follows:
- ``mma_tiler_mn``: Defines the dimensions of the matrix tile that each Matrix Multiply-Accumulate (MMA) instruction processes in a single operation.
- ``cluster_shape_mn``: Specifies the number of CTAs along each dimension within a cluster. Refer `Parallel Thread Execution ISA documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tensorcore-5th-generation-family-instructions>`_ for the possible mma tiler size and cluster shape for different tensor data types.
- ``use_2cta_instrs``: Whether to utilize Blackwell's 2 CTA instructions for MMA/Copy.
- ``use_tma_store``: Whether to use Tensor Memory Access (TMA) instructions to store the result back to global memory.
After defining the search space, we could traverse all parameter combinations to find the optimal kernel.
The ``autotune_gemm`` function below demonstrates a simple exhaustive search approach - it iterates
through configurations, compiles and benchmarks each kernel, and returns the best performing one.
Since kernel compilation incurs overhead, it's important to cache and reuse compiled kernels
to minimize host launch latency. CuTe DSL facilitates this through its separate compilation
and execution workflow. More details can be found in :ref:`JIT_Caching`.
As demonstrated in the ``autotune_gemm`` function
(between the ``begin of cache the compiled GEMM kernel`` and ``end of cache the compiled GEMM kernel`` comments),
we can use ``cute.compile()`` to compile a kernel once, cache the compiled result, and reuse the cached JIT executor for multiple kernel
executions. We could maintain a global configuration-to-kernel dictionary (``config_kernel_dict``) to cache the compiled GEMM kernels,
where each key (``kernel_cache_key``) uniquely identifies a kernel based on its characteristics.
Usually we could use the {dtype + kernel configs} as the cached key for GEMM compilation. For example,
.. code-block:: python
kernel_cache_key = f"{ab_dtype}x{c_dtype}x{acc_dtype}x{use_2cta_instrs}x{mma_tiler}x{cluster_shape_mn}x{use_tma_store}"
If the input tensor's layout is static, we should add the shape in the cached key too.
Users can customize the ``benchmark`` function to measure kernel execution time.
For stable and reliable performance measurements:
1. Run a few warmup iterations (e.g., 5-10) to stabilize GPU temperature
2. Execute multiple timed iterations (e.g., 100-1000) for statistical significance
3. Use CUDA events and synchronization for precise timing
4. Lock GPU frequencies (SM and memory frequencies) with nvidia-smi
5. Process results by removing outliers and using min/avg statistics as measurements.
This ensures reliable kernel selection through proper benchmarking.
.. code-block:: python
# get the best GEMM kernel for given input tensors
def autotune_gemm(
a: cute.Tensor,
b: cute.Tensor,
c: cute.Tensor,
stream: cuda.CUstream,
use_2cta_instrs_list: List[bool] = [True],
use_tma_store_list: List[bool] = [True],
mma_tiler_m_list: List[int] = [256],
mma_tiler_n_list: List[int] = [256],
cluster_shape_m_list: List[int] = [2],
cluster_shape_n_list: List[int] = [1],
):
best_kernel = None
min_time = float("inf")
# traverse the search space
for use_2cta_instrs in use_2cta_instrs_list:
for use_tma_store in use_tma_store_list:
for mma_tiler_mn in product(mma_tiler_m_list, mma_tiler_n_list):
for cluster_shape_mn in product(cluster_shape_m_list, cluster_shape_n_list):
acc_dtype = cutlass.Float32
hardware_info = cutlass.utils.HardwareInfo()
max_active_clusters = hardware_info.get_max_active_clusters(
cluster_shape_mn[0] * cluster_shape_mn[1]
)
# instance a GEMM kernel
gemm = PersistentDenseGemmKernel(
acc_dtype,
use_2cta_instrs,
mma_tiler_mn,
cluster_shape_mn,
use_tma_store,
)
# begin of cache the compiled GEMM kernel
if kernel_cache_key not in config_kernel_dict:
# compile gemm kernel
compiled_gemm = cute.compile(
gemm,
a,
b,
c,
max_active_clusters,
stream,
)
config_kernel_dict[kernel_cache_key] = compiled_gemm
else:
compiled_gemm = config_kernel_dict[kernel_cache_key]
# end of cache the compiled GEMM kernel
try:
# define a benchmark function to measure the execution time of the compiled GEMM kernel
cur_time = benchmark(
partial(compiled_gemm, a, b, c, stream),
)
except Exception as e:
print(f"Execution error: {e}")
cur_time = float("inf")
if cur_time < min_time:
min_time = cur_time
best_kernel = compiled_gemm
if best_kernel is None:
raise ValueError("No best kernel found")
return best_kernel
This brute-force approach ensures we could find the optimal parameters, though at the cost of trying every possibilities.
For more advanced use cases, users can explore sophisticated optimization
techniques like search space pruning and genetic algorithms to reduce tuning overhead and discover better
configurations more efficiently.
To further optimize tuning performance, we can utilize caching mechanisms to avoid redundant computations.
We could cache the tuning results in a input-to-kernel dictionary (e.g., ``input_kernel_dict``).
When processing inputs with matching ``config_key`` values, the cached kernel can be reused directly without re-tuning.
The ``config_key`` is related with the input tensor's characteristics, such as the shape, data type, etc.
The setup of ``config_key`` is very flexible, users can customize it based on their own application.
For instance, if the data type is fixed in users' application, we could use the input tensor's shape as the key, i.e., ``(m, n, k)``.
To further reduce tuning overhead, we could consider using a simplified key like ``config_key = (power_of_2(m), power_of_2(n), power_of_2(k))``,
where ``m``, ``n``, and ``k`` are rounded up to the nearest power of 2. This simplification can significantly reduce the number
of unique keys while still maintaining good performance in most cases. However, it's important to validate that this
approximation doesn't negatively impact performance for your specific use case.
.. code-block:: python
config_key = (m, n, k)
if config_key in input_kernel_dict:
compiled_gemm = input_kernel_dict[config_key]
else:
compiled_gemm = autotune_gemm(...)
input_kernel_dict[config_key] = compiled_gemm
# launch gemm kernel
compiled_gemm(a_tensor, b_tensor, c_tensor, stream)
By following the methods above, you can customize your own auto-tuner to find the optimal GEMM kernel configuration
for specific matrix dimensions and data types, significantly improving computational performance for models.

View File

@ -0,0 +1,133 @@
.. _debugging:
Debugging
=========
.. contents:: Table of Contents
:depth: 2
:local:
This page provides an overview of debugging techniques and tools for CuTe DSL programs.
Getting Familiar with the Limitations
-------------------------------------
Before diving into comprehensive debugging capabilities, it's important to understand the limitations of CuTe DSL.
Understanding these limitations will help you avoid potential pitfalls from the start.
Please refer to :doc:`../limitations` for more details.
DSL Debugging
-------------
CuTe DSL provides built-in logging mechanisms to help you understand the code execution flow and
some of the internal state.
Enabling Logging
~~~~~~~~~~~~~~~~
CuTe DSL provides environment variables to control logging level:
.. code:: bash
# Enable console logging (default: False)
export CUTE_DSL_LOG_TO_CONSOLE=1
# Log to file instead of console (default: False)
export CUTE_DSL_LOG_TO_FILE=my_log.txt
# Control log verbosity (0, 10, 20, 30, 40, 50, default: 10)
export CUTE_DSL_LOG_LEVEL=20
Log Categories and Levels
~~~~~~~~~~~~~~~~~~~~~~~~~
Similar to standard Python logging, different log levels provide varying degrees of detail:
+--------+-------------+
| Level | Description |
+========+=============+
| 0 | Disabled |
+--------+-------------+
| 10 | Debug |
+--------+-------------+
| 20 | Info |
+--------+-------------+
| 30 | Warning |
+--------+-------------+
| 40 | Error |
+--------+-------------+
| 50 | Critical |
+--------+-------------+
Dump the generated IR
~~~~~~~~~~~~~~~~~~~~~
For users familiar with MLIR and compilers, CuTe DSL supports dumping the Intermediate Representation (IR).
This helps you verify whether the IR is generated as expected.
.. code:: bash
# Dump Generated CuTe IR (default: False)
export CUTE_DSL_PRINT_IR=1
# Keep Generated CuTe IR in a file (default: False)
export CUTE_DSL_KEEP_IR=1
Kernel Functional Debugging
----------------------------
Using Python's ``print`` and CuTe's ``cute.printf``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CuTe DSL programs can use both Python's native ``print()`` as well as our own ``cute.printf()`` to
print debug information during kernel generation and execution. They differ in a few key ways:
- Python's ``print()`` executes during compile-time only (no effect on the generated kernel) and is
typically used for printing static values (e.g. a fully static layouts).
- ``cute.printf()`` executes at runtime on the GPU itself and changes the PTX being generated. This
can be used for printing values of tensors at runtime for diagnostics, but comes at a performance
overhead similar to that of `printf()` in CUDA C.
For detailed examples of using these functions for debugging, please refer to the associated
notebook referenced in :doc:`notebooks`.
Handling Unresponsive/Hung Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When a kernel becomes unresponsive and ``SIGINT`` (``CTRL+C``) fails to terminate it,
you can follow these steps to forcefully terminate the process:
1. Use ``CTRL+Z`` to suspend the unresponsive kernel
2. Execute the following command to terminate the suspended process:
.. code:: bash
# Terminate the most recently suspended process
kill -9 $(jobs -p | tail -1)
CuTe DSL can also be debugged using standard NVIDIA CUDA tools.
Using Compute-Sanitizer
~~~~~~~~~~~~~~~~~~~~~~~
For detecting memory errors and race conditions:
.. code:: bash
compute-sanitizer --some_options python your_dsl_code.py
Please refer to the `compute-sanitizer documentation <https://developer.nvidia.com/compute-sanitizer>`_ for more details.
Conclusion
----------
This page covered several key methods for debugging CuTe DSL programs. Effective debugging typically requires a combination of these approaches.
If you encounter issues with DSL, you can enable logging and share the logs with the CUTLASS team as a GitHub issue to report a bug.

View File

@ -0,0 +1,90 @@
.. _dsl_code_generation:
.. |DC| replace:: dynamic compilation
.. |DSL| replace:: CuTe DSL
.. |IR| replace:: intermediate representation (IR)
End-to-End Code Generation
==========================
.. contents::
:depth: 2
:local:
1. Techniques for Turning Python into |IR|
------------------------------------------
1.1 AST rewrite
^^^^^^^^^^^^^^^^
The functions abstract-syntax tree is analysed **before** execution.
Python control-flow (``for``/``while``, ``if``/``else``) and built-ins are converted to structured |IR|
constructs. Computation inside each region is left untouched at this stage.
*Advantages*
* Sees the entire program, so every branch and loop is preserved.
* Keeps loop structure intact for optimization such as tiling, vectorisation
or GPU thread mapping.
*Disadvantages*
* Requires a well-defined Python subset that the rewriter understands.
1.2 Tracing
^^^^^^^^^^^
The decorated function is executed once with *proxy* arguments; overloaded
operators record every tensor operation that actually runs and produce a flat
trace that is lowered to |IR|.
*Advantages*
* Near-zero compile latency, ideal for straight-line arithmetic.
* No need to parse Python source, so it supports many dynamic Python
features, and Python has many features.
*Disadvantages*
* Untaken branches vanish, so the generated kernel may be wrong for other
inputs.
* Loops are flattened to the iteration count observed during tracing.
* Data-dependent control-flow freezes to a single execution path.
2. |DSL| Code-Generation Modes
------------------------------
CuTes Python front-end combines the techniques above into **two mutually
exclusive modes**, selectable with the ``preprocessor`` flag of the
``@jit`` decorator:
1. Tracing mode ``@jit(preprocess=False)`` tracing only.
This results in the fastest compilation path and is recommended only for kernels that are guaranteed to be
straight-line arithmetic. It suffers from all tracing limitations listed in the previous section.
2. Preprocessor mode (**default**) ``@jit(preprocess=True)`` **AST rewrite + tracing**.
The AST pass captures every loop and branch, eliminating the correctness and
optimisation problems of pure tracing; tracing then fills in the arithmetic.
This hybrid “preprocessor” pipeline is unique to |DSL| and was designed
specifically to overcome the disadvantages identified above.
.. figure:: dsl_modes.png
:width: 400
:align: center
*Left*: tracing mode records only the path that executed.
*Right*: preprocessor mode emits structured |IR| for every branch and loop
before tracing the arithmetic.
Why Tracing-Only Is Insufficient for Control-Flow
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* **Branch loss** The untaken side of an ``if``/``else`` is never lowered.
* **Loop unrolling** Loops are flattened to the iteration count observed,
destroying structure needed for parallel mapping and tiling.
* **Data-dependent paths** Control-flow that depends on tensor values freezes
to a single execution path at trace time.
The preprocessor mode fixes all of these by lowering control-flow first and delegating
only the arithmetic to the tracer.

View File

@ -0,0 +1,140 @@
.. _dsl_control_flow:
.. |DC| replace:: dynamic compilation
.. |IR| replace:: intermediate representation (IR)
.. |DSL| replace:: CuTe DSL
.. |Constexpr| replace:: **Constexpr** (compile-time Python value)
|DSL| Control Flow
==================
.. contents::
:depth: 2
:local:
Overview
--------
|DSL| walks Pythons AST and converts each control-flow construct it finds into
structured |IR|. You can therefore write ordinary Python loops and branches
while the compiler decides—statement by statement—whether to
* **evaluate at compile time** if the controlling value is a |Constexpr|, or
* **emit intermediate representation (IR)** when the value is dynamic.
For a high-level discussion of the overall pipeline, see
:doc:`the code-generation overview <dsl_code_generation>`.
For Loops
---------
|DSL| recognises three kinds of ranges for ``for`` loops:
* ``range`` the Python built-in
* ``cutlass.range_dynamic`` always lowers to |IR|
* ``cutlass.range_constexpr`` always unrolls at compile time
range(...)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The AST rewriter inserts a small helper stub. At runtime the loop bounds are
inspected:
* **Constant bounds** → the loop is unrolled at compile time.
* **Dynamic bounds** → the loop is emitted as structured |IR|.
cutlass.range_dynamic(...)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use when you *always* want a loop in the generated |IR|, even if the bounds
look constant.
cutlass.range_constexpr(...)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Runs in the Python interpreter and is fully unrolled before code generation.
All loop indices must be |Constexpr|.
Limitations of Dynamic For Loops
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Early-exit ``break``, ``continue``, or raising exception are not yet supported.
* Operations in the loop body are traced only when tracing is active in that
region.
**Example:**
.. code-block:: python
@cute.jit
def loop_example():
n = 10
# ❌ This loop is dynamic, early-exit isn't allowed.
for i in cutlass.range_dynamic(n):
if i == 5:
break # Early-exit
cute.printf("%d\\n", i)
# ✅ This loop is constexpr, early-exit is allowed.
for i in cutlass.range_constexpr(n):
if i == 5:
break # Early-exit
cute.printf("%d\\n", i)
If-Else Statements
------------------
Standard Python ``if``/``else`` is supported.
* **Predicate is Constexpr (compile-time Python value)** → evaluated at compile time.
* **Predicate is dynamic** → lowered to |IR|.
**Example:**
.. code-block:: python
@cute.jit
def main(const_var: cutlass.Constexpr, dynamic_var: cutlass.Int32):
if const_var: # compile-time branch
cute.printf("Const branch\\n")
else:
cute.printf("Const else\\n")
if dynamic_var == 10: # dynamic branch
cute.printf("Dynamic True\\n")
else:
cute.printf("Dynamic False\\n")
Similarly to for-loops, the ``if cutlass.const_expr`` and ``if cutlass.dynamic_expr`` constructs can
be used to force the evaluation at compile-time or the generation of IR, respectively. Unstructured
control flow is only supported when using ``if cutlass.const_expr``.
While Loops
-----------
Python ``while`` loops are always treated as **dynamic** because the loop condition may become
dynamic after the first iteration. Similarly to for-loops and ``if``/``else``, the
``while cutlass.const_expr`` and ``while cutlass.dynamic_expr`` constructs are available.
Compile-Time Metaprogramming
----------------------------
Mix compile-time constructs with normal |DSL| code to generate specialised
kernels without runtime overhead. A compile-time flag can, for example, toggle
an optional **ReLU** epilogue:
.. code-block:: python
@cute.kernel
def gemm(..., do_relu: cutlass.Constexpr):
# main GEMM work
...
if const_expr(do_relu): # compile-time guard
# ReLU code is emitted only when do_relu is True
...
.. code-block:: text
gemm(..., False) # ReLU is omitted from the generated |IR|
gemm(..., True) # ReLU is included

View File

@ -0,0 +1,198 @@
.. _dsl_dynamic_layout:
.. |DSL| replace:: CuTe DSL
.. |SLAY| replace:: static layout
.. |DLAY| replace:: dynamic layout
.. contents:: Table of Contents
:depth: 2
:local:
Static vs Dynamic layouts
=========================
Static Layout
-------------
When integrating with popular deep learning frameworks, one question is how to deal with the layout of the converted ``cute.Tensor``.
For example, when converting a ``torch.Tensor`` to a ``cute.Tensor``, the shape of the ``torch.Tensor`` is honored for the layout of
``cute.Tensor``.
.. code-block:: python
import torch
import cutlass
from cutlass.cute.runtime import from_dlpack
@cute.jit
def foo(tensor):
print(f"tensor.layout: {tensor.layout}") # Prints tensor layout at compile time
cute.printf("tensor: {}", tensor) # Prints tensor values at runtime
In this example, we define a JIT function ``foo`` that takes a ``cute.Tensor`` as input and prints its layout. Note
that Python print is used to print the layout at compile time. This works fine for |SLAY| whose value is known at
compile time.
Now let's try to run the JIT function ``foo`` with different shapes of the input ``torch.Tensor``.
.. code-block:: python
a = torch.tensor([1, 2, 3], dtype=torch.uint16)
a_pack = from_dlpack(a)
compiled_func = cute.compile(foo, a_pack)
compiled_func(a_pack)
Here we first convert a 1D ``torch.Tensor`` with 3 elements to a ``cute.Tensor`` using ``from_dlpack``. Then we compile
the JIT function ``foo`` with the converted ``cute.Tensor`` and call the compiled function.
::
tensor.layout: (3):(1)
tensor: raw_ptr(0x00000000079e5100: i16, generic, align<2>) o (3):(1) =
( 1, 2, 3 )
It prints ``(3):(1)`` for the layout because the converted ``cute.Tensor`` has a |SLAY| with shape ``(3)`` which
is the shape of the ``a``.
Now if we call the compiled function with a different shape of the input ``torch.Tensor``, it would result in an unexpected
result at runtime due to the mismatch of the type since ``compiled_func`` expects a ``cute.Tensor`` with layout ``(3):(1)``
while ``b`` has shape ``(5)``.
.. code-block:: python
b = torch.tensor([11, 12, 13, 14, 15], dtype=torch.uint16)
b_pack = from_dlpack(b)
compiled_func(b_pack) # ❌ This results in an unexpected result at runtime due to type mismatch
Following is the output which is unexpected due to the type mismatch.
::
tensor: raw_ptr(0x00000000344804c0: i16, generic, align<2>) o (3):(1) =
( 11, 12, 13 )
To fix that, we would have to trigger another code generation and compilation for the new shape for ``b``.
.. code-block:: python
compiled_func_2 = cute.compile(foo, b_pack) # This would trigger another compilation
compiled_func_2(b_pack) # ✅ Now this works fine
As shown in the example above, with the newly compiled ``compiled_func_2``, we can pass in ``b_pack`` to the compiled
JIT function ``compiled_func_2``.
::
tensor.layout: (5):(1)
tensor: raw_ptr(0x0000000034bb2840:: i16, generic, align<2>) o (5):(1) =
( 11, 12, 13, 14, 15 )
Now it recompiles and prints the values of ``b`` correctly.
It's obvoius that we need distinct codes generated and compiled for different static layout. In this case, one for layout
``(3):(1)`` and the other for layout ``(5):(1)``.
Dynamic Layout
--------------
In order to avoid generating and compiling multiple times for different shapes of the input ``torch.Tensor``, |DSL| provides a way to
generate and compile JIT function with |DLAY|.
To get dyanmic layout of the ``cute.Tensor``, a ``torch.Tensor`` object can be passed into the JIT function directly which instructs
|DSL| to call ``cute.mark_layout_dynamic`` automatically on the converted ``cute.Tensor`` per the leading dimension of the layout.
.. code-block:: python
import torch
import cutlass
from cutlass.cute.runtime import from_dlpack
@cute.jit
def foo(tensor):
print(tensor.layout) # Prints (?,?):(?,1) for dynamic layout
a = torch.tensor([[1, 2], [3, 4]], dtype=torch.uint16)
compiled_func = cute.compile(foo, a)
compiled_func(a)
b = torch.tensor([[11, 12], [13, 14], [15, 16]], dtype=torch.uint16)
compiled_func(b) # Reuse the same compiled function for different shape
In the example above, a single compilation of the JIT function ``foo`` is reused for different shapes of the input ``torch.Tensor``.
This is possible because the converted ``cute.Tensor`` has a |DLAY| ``(?,?):(?,1)`` which is compatible with the shape of the
input ``torch.Tensor`` of both calls.
Alternatively, for compact layout, ``cute.mark_compact_shape_dynamic`` can be called for a finer-grained control to specify the mode
of the layout for dynamic and the divisibility constraint for the dynamic dimension.
Refer to :doc:`framework_integration` for more details on ``from_dlpack``, ``mark_layout_dynamic``,
and ``mark_compact_shape_dynamic``.
Static Layout vs. Dynamic Layout
--------------------------------
Per the previous sections, we have seen that |SLAY| leads to distinct JIT code generations while |DLAY| leads to a single
compilation for different shapes.
That said, creating JIT function with |SLAY| is useful when the use cases targeting input data with fixed shapes.
Since more information is available at compile time, the compiler would be able to kick in optimizations that otherwise would not
be possible for the code generated for |DLAY|.
On the other hand, |DLAY| would be more flexible for the cases where the input data has varying shapes. This provides more
scalability of the generated code to deal with varying input data of different shapes.
Programming with Static and Dynamic Layout
------------------------------------------
|DSL| provides intuitive way to program with static and |DLAY| in the codes.
.. code-block:: python
import torch
import cutlass
from cutlass.cute.runtime import from_dlpack
@cute.jit
def foo(tensor, x: cutlass.Constexpr[int]):
print(cute.size(tensor)) # Prints 3 for the 1st call
# Prints ? for the 2nd call
if cute.size(tensor) > x:
cute.printf("tensor[2]: {}", tensor[2])
else:
cute.printf("tensor size <= {}", x)
a = torch.tensor([1, 2, 3], dtype=torch.uint16)
foo(from_dlpack(a), 3) # First call with static layout
b = torch.tensor([1, 2, 3, 4, 5], dtype=torch.uint16)
foo(b, 3) # Second call with dynamic layout
In this example, the JIT function ``foo`` is compiled with a |SLAY| ``(3):(1)`` for the first call, which means the
size of the tensor is known at compile time. |DSL| makes good use of this and automatically handles the if condition at the
compile time. Hence the generated codes are efficient without the if condition at all.
For the second call, the JIT function ``foo`` is compiled with a |DLAY| ``(?):(1)`` hence the tensor size is only
evaluated at runtime. |DSL| automatically generates the code to handle the |DLAY| and the if condition at runtime.
The same applies to loop as well:
.. code-block:: python
@cute.jit
def foo(tensor, x: cutlass.Constexpr[int]):
for i in range(cute.size(tensor)):
cute.printf("tensor[{}]: {}", i, tensor[i])
a = torch.tensor([1, 2, 3], dtype=torch.uint16)
foo(from_dlpack(a), 3) # First call with static layout
b = torch.tensor([1, 2, 3, 4, 5], dtype=torch.uint16)
foo(b, 3) # Second call with dynamic layout
With the static layout in the first call, |DSL| is able to fully unroll the loop at compile time. While in the second call,
the generated codes will have the loop executed at runtime based on the |DLAY|.
With the single JIT function implementation, |DSL| is able to handle control-flow constructs and automatically generate
the optimized codes for different cases. This is all possible because |DSL| is able to walk the Python AST and convert
each control-flow construct it finds accordingly.
Please refer to :doc:`dsl_control_flow` for more details.

View File

@ -0,0 +1,128 @@
.. _dsl_introduction:
.. |DC| replace:: dynamic compilation
.. |IR| replace:: IR
.. |DSL| replace:: CuTe DSL
|DSL|
======================
.. contents:: Table of Contents
:depth: 2
:local:
Overview
--------
|DSL| is a Python-based domain-specific language (DSL) designed for |DC| of numeric and GPU-oriented code. Its primary goals are:
- **Consistent with CuTe C++**, allowing users to express GPU kernels with full control of the hardware.
- **JIT compilation** for both host and GPU execution.
- `DLPack <https://github.com/dmlc/dlpack>`_ **integration**, enabling seamless interop with frameworks (e.g., PyTorch, JAX).
- **JIT caching**, so that repeated calls to the same function benefit from cached |IR| modules.
- **Native types and type inference** to reduce boilerplate and improve performance.
- **Optional lower-level control**, offering direct access to GPU backends or specialized |IR| dialects.
Decorators
----------
|DSL| provides two main Python decorators for generating optimized code via |DC|:
1. ``@jit`` — Host-side JIT-compiled functions
2. ``@kernel`` — GPU kernel functions
Both decorators can optionally use a **preprocessor** that automatically expands Python control flow (loops, conditionals) into operations consumable by the underlying |IR|.
``@jit``
~~~~~~~~~~~~~
Declares JIT-compiled functions that can be invoked from Python or from other |DSL| functions.
**Decorator Parameters**:
* ``preprocessor``:
* ``True`` (default) — Automatically translate Python flow control (e.g., loops, if-statements) into |IR| operations.
* ``False`` — No automatic expansion; Python flow control must be handled manually or avoided.
**Call-site Parameters**:
- ``no_cache``:
- ``True`` — Disables JIT caching, forcing a fresh compilation each call.
- ``False`` (default) — Enables caching for faster subsequent calls.
``@kernel``
~~~~~~~~~~~~~~~~
Defines GPU kernel functions, compiled as specialized GPU symbols through |DC|.
**Decorator Parameters**:
- ``preprocessor``:
- ``True`` (default) — Automatically expands Python loops/ifs into GPU-compatible |IR| operations.
- ``False`` — Expects manual or simplified kernel implementations.
**Kernel Launch Parameters**:
- ``grid``
Specifies the grid size as a list of integers.
- ``block``
Specifies the block size as a list of integers.
- ``cluster``
Specifies the cluster size as a list of integers.
- ``smem``
Specifies the size of shared memory in bytes (integer).
Calling Conventions
-------------------
.. list-table::
:header-rows: 1
:widths: 20 20 15 25
* - **Caller**
- **Callee**
- **Allowed**
- **Compilation/Runtime**
* - Python function
- ``@jit``
-
- DSL runtime
* - Python function
- ``@kernel``
-
- N/A (error raised)
* - ``@jit``
- ``@jit``
-
- Compile-time call, inlined
* - ``@jit``
- Python function
-
- Compile-time call, inlined
* - ``@jit``
- ``@kernel``
-
- Dynamic call via GPU driver or runtime
* - ``@kernel``
- ``@jit``
-
- Compile-time call, inlined
* - ``@kernel``
- Python function
-
- Compile-time call, inlined
* - ``@kernel``
- ``@kernel``
-
- N/A (error raised)

View File

@ -0,0 +1,196 @@
.. _dsl_jit_arg_generation:
.. |DSL| replace:: CuTe DSL
.. |CUSTOM_TYPES| replace:: customized types
|DSL| JIT Function Argument Generation
=======================================
.. contents:: Table of Contents
:depth: 2
:local:
In a nutshell
--------------
When using the ``@jit`` or ``@kernel`` decorators to define a JIT-compiled function, the arguments to the function are traced to determine the JIT function's signature.
|DSL| provides a Pythonic way to write the arguments for JIT function as one normally would in Python, and the |DSL| will take care of the rest for you.
Specifically, |DSL| honors following when generating the JIT function's arguments:
- JIT function arguments are assumed to be **dynamic arguments** by default.
- If an argument is explicitly type annotated with ``cutlass.Constexpr``, it is treated as a **compile-time constant**.
- If type annotation is provided, |DSL| validates the argument type at compile time for **type safety**.
- |DSL| provides **runtime checkable protocols** (``JitArgument`` and ``DynamicExpression``) for generating JIT function arguments for |CUSTOM_TYPES|.
More details below for each of the above.
Static argument vs. Dynamic argument
------------------------------------
|DSL| supports both static and dynamic arguments for JIT functions.
1. **Static arguments** hold values that are known at compile time. It is not included in the generated JIT function signature.
2. **Dynamic arguments** hold values that are only known at runtime.
By default, |DSL| assumes dynamic arguments and tries to infer the argument types from the call-site argument types. An explicit type annotation ``cutlass.Constexpr`` can be used to specify a static argument.
.. code-block:: python
import cutlass
import cutlass.cute as cute
@cute.jit
def foo(x: cutlass.Int32, y: cute.Constexpr):
print("x = ", x) # Prints x = ?
print("y = ", y) # Prints y = 2
cute.printf("x: {}", x) # Prints x: 2
cute.printf("y: {}", y) # Prints y: 2
foo(2, 2)
In the example above, ``x`` is a dynamic argument with type cutlass.Int32 and ``y`` is a static argument.
With the ``cutlass.Constexpr`` annotation, a more sophisticated uses case of static argument in the JIT functions can be something like:
.. code-block:: python
import cutlass
import cutlass.cute as cute
@cute.kernel
def kernel(
self,
tiled_mma: cute.TiledMma,
tma_atom_a: cute.CopyAtom,
mA_mkl: cute.Tensor,
tma_atom_b: cute.CopyAtom,
mB_nkl: cute.Tensor,
tma_atom_c: Optional[cute.CopyAtom],
mC_mnl: cute.Tensor,
cluster_layout_vmnk: cute.Layout,
a_smem_layout_staged: cute.ComposedLayout,
b_smem_layout_staged: cute.ComposedLayout,
c_smem_layout_staged: Union[cute.Layout, cute.ComposedLayout, None],
epi_tile: cute.Tile,
epilogue_op: cutlass.Constexpr,
):
...
# Perform epilogue op on accumulator and convert to C type
acc_vec = tTR_rAcc.load()
acc_vec = epilogue_op(acc_vec.to(self.c_dtype))
tTR_rC.store(acc_vec)
In this example, ``epilogue_op`` is a static argument in the JIT kernel where the argument is used for the epilogue fusion. Upon calling the kernel,
an elementwise lambda function can be passed in as the ``epilogue_op`` argument. For example, a ReLU can be applied for epilogue fusion by simply setting the
``epilogue_op`` to ``lambda x: cute.where(x > 0, x, cute.full_like(x, 0))``
Refer to the `Blackwell dense GEMM example <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py>`__ for a complete example.
Type safety
-----------
|DSL| makes good use of type annotation in JIT function signature and validates the JIT function argument types at compile time for **type safety**.
.. code-block:: python
import cutlass
import cutlass.cute as cute
import numpy as np
@cute.jit
def foo(x: cute.Tensor, y: cutlass.Float16):
...
a = np.random.randn(10, 10).astype(np.float16)
b = 32
foo(a, b)
foo(b, a) # This will fail at compile time due to type mismatch
The type safety check helps catch the type mismatch issue early at the compile time with clear error message to avoid tricky runtime errors which is usually more expensive to debug.
In the example above, the second call to ``foo`` will fail at compile time due to the type mismatch with a clear error message:
::
cutlass.base_dsl.common.DSLRuntimeError: DSLRuntimeError: expects argument #1 (a) to be <class 'cutlass.cute.typing.Tensor'>, but got <class 'int'>
JIT function arguments with |CUSTOM_TYPES|
--------------------------------------------
|DSL| supports |CUSTOM_TYPES| for JIT function arguments by providing two runtime checkable protocols:
* ``JitArgument`` which is used for host JIT functions to be called from Python.
- ``__c_pointers__``: Generate a list of ctypes pointers for the current object.
- ``__get_mlir_types__``: Generate a list of MLIR types for the current object.
- ``__new_from_mlir_values__``: Create a new object from MLIR values.
* ``DynamicExpression`` which is used for device JIT functions to be called from the host JIT functions.
- ``__extract_mlir_values__``: Generate a dynamic expression for the current object.
- ``__new_from_mlir_values__``: Create a new object from MLIR values.
Refer to `typing.py <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL/base_dsl/typing.py>`__ for more details on these protocol APIs.
Depending on different cases of the |CUSTOM_TYPES|, |DSL| provides easy ways to adopt |CUSTOM_TYPES| for JIT function arguments.
1. Direct protocol implementation in |CUSTOM_TYPES|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
One way is to implement the protocol methods directly in the |CUSTOM_TYPES| to enable the protocol based JIT function argument generation.
.. code-block:: python
import cutlass
import cutlass.cute as cute
# Customized type that implements the DynamicExpression protocol
class MyDynamicExpression:
def __init__(self, tensor, offset):
self._tensor = tensor # Dynamic argument
self._offset = offset # Dynamic argument
def __extract_mlir_values__(self):
return [self._tensor.__extract_mlir_values__(), self._offset.__extract_mlir_values__()]
def __new_from_mlir_values__(self, values):
return MyDynamicExpression(values[0], values[1])
@cute.kernel
def my_kernel(x: MyDynamicExpression):
...
In the example above, the ``MyDynamicExpression`` implements the ``DynamicExpression`` protocol and |DSL| will generate the JIT function arguments for the JIT kernel ``my_kernel`` based on the protocol methods.
2. Adaptor based protocol implementation for |CUSTOM_TYPES|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For the case where directly changing the |CUSTOM_TYPES| to implement the protocol is not feasible, |DSL| provides adaptor based approach to adapt the |CUSTOM_TYPES| for JIT function argument generation.
The JIT function argument adaptor is a callable object that implements the desired protocol methods for the registered |CUSTOM_TYPES|. This way, |DSL| automatically queries the JIT argument adaptor registry
to generate the JIT function arguments for the given |CUSTOM_TYPES|.
.. code-block:: python
@cutlass.register_jit_arg_adapter(MyFrameworkObject)
class MyFrameworkObjectAdapter:
"""
Convert a 3rd party framework object to a JIT function argument with JitArgument protocol
"""
def __init__(self, arg):
self._arg = arg
def __c_pointers__(self):
# Convert the framework object to a C-ABI compatible object
# thru its C-ABI interface
return [self._arg.get_cabi_pointer()]
def __get_mlir_types__(self):
# Return the list of MLIR types the framework object represents
return [self._arg.get_data().mlir_type]
def __new_from_mlir_values__(self, values):
# Convert the MLIR values back to the framework object
return MyFrameworkObject(values[0])
In this example, the ``MyFrameworkObjectAdapter`` implements an adaptor class which bridges the |DSL| and the 3rd party framework type ``MyFrameworkObject``.
The registration is done by just decorating the adaptor with ``cutlass.register_jit_arg_adapter`` for the customized type. With the registered adaptor,
|DSL| will automatically use the adaptor to generate the JIT function arguments for ``MyFrameworkObject`` typed arguments.

View File

@ -0,0 +1,152 @@
.. _dsl_jit_caching:
.. |DSL| replace:: CuTe DSL
.. _JIT_Caching:
|DSL| JIT Caching
====================
.. contents:: Table of Contents
:depth: 2
:local:
Zero Compile and JIT Executor
-----------------------------
Zero Compile is a feature that enables explicit kernel compilation on demand through ``cute.compile``.
When ``cute.compile`` is called, it compiles the kernel and returns a JIT Executor instance.
This JIT Executor instance can be cached and reused directly for subsequent executions without compiling the kernel again.
The JIT Executor is a component that independently executes compiled code.
It can be created either through ``cute.compile`` or implicit compilation.
The JIT Executor instance behaves like a callable object to execute the compiled code.
Each JIT Executor instance maintains a single compiled host function.
It encompasses all necessary execution components:
* Host function pointer and its MLIR execution engine
* CUDA modules (optional)
* Argument specifications defining how Python arguments are converted to C ABI-compatible types. Note that arguments with the ``cutlass.Constexpr`` hint are excluded from argument specifications since they are evaluated at compile time rather than runtime.
For example, in the following code, ``print_result`` is a ``cutlass.Constexpr`` value that is **NOT** evaluated at runtime:
.. code-block:: python
import cutlass.cute as cute
@cute.jit
def add(a, b, print_result: cutlass.Constexpr):
if print_result:
cute.printf("Result: %d\n", a + b)
return a + b
jit_executor = cute.compile(add, 1, 2, True)
jit_executor(1, 2) # output: ``Result: 3``
The JIT Executor ensures all components are properly initialized and loaded after compilation.
For example, all CUDA modules are loaded (via ``cuModuleLoad``) and kernel function pointers are extracted (via ``cuModuleGetFunction``).
When calling a JIT Executor instance, it:
* Parses Python runtime arguments and converts them to C ABI-compatible types according to argument specifications
* Invokes the host function with the converted arguments
Custom Caching with ``cute.compile``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``cute.compile`` bypasses caching in |DSL| and always performs compilation, returning a fixed JIT Executor instance.
This allows implementing custom caching strategies as shown below:
.. code-block:: python
@cute.jit
def add(b):
return a + b
# Define a custom cache
custom_cache = {}
a = 1
compiled_add_1 = cute.compile(add, 2)
custom_cache[1] = compiled_add_1
compiled_add_1(2) # result = 3
a = 2
compiled_add_2 = cute.compile(add, 2)
custom_cache[2] = compiled_add_2
compiled_add_2(2) # result = 4
# Use the custom cache
custom_cache[1](2) # result = 3
custom_cache[2](2) # result = 4
Cache in |DSL|
-----------------
By default, cache in |DSL| is implicitly enabled to avoid recompilation when kernels are called repeatedly without changes.
The cache is implemented as a map storing compiled JIT Executor instances within |DSL|.
The cache key combines hashes of:
* MLIR bytecode of the MLIR program generated by |DSL|
* All |DSL| Python source files
* All |DSL| shared libraries
* All |DSL| environment variables
The cache value is a compiled JIT Executor instance.
On a cache hit, compilation is skipped and the cached JIT Executor instance is reused.
On a cache miss, the kernel is compiled and the new JIT Executor instance is stored in the cache.
Here is an example demonstrating automatic caching of the ``add`` kernel:
.. code-block:: python
# Global variable
a = 1
@cute.jit
def add(b):
return a + b
# Cache is empty at beginning
# First call: cache miss triggers compilation
result = add(2) # result = 3
# Cache now has one instance
# Second call: cache hit reuses cached JIT Executor
result = add(2) # result = 3
a = 2
# Third call: cache miss due to changed IR code triggers recompilation
result = add(2) # result = 4
# Cache now has two instances
The cache can be serialized to files for subsequent runs.
After serialization, compiled MLIR bytecode is stored in file.
The cache directory is ``/tmp/{current_user}/cutlass_python_cache``.
The cache loads from files into memory during |DSL| initialization and saves back to files when the process exits.
The following environment variables control file caching:
.. code-block:: bash
# Disable file caching while keeping in-memory cache available, defaults to False.
export CUTE_DSL_DISABLE_FILE_CACHING=True
# Maximum number of cache files allowed, defaults to 1000.
export CUTE_DSL_FILE_CACHING_CAPACITY=1000
Limitations
~~~~~~~~~~~~~~~~~~~~~
The intention of caching is to reduce the host launch overhead before each execution. As above example shows,
the consistency between the original Python code and the MLIR program is hard to maintain because of the impact of dynamic factors such as global variables.
Therefore, the MLIR program **MUST** always be generated to verify that the kernel content matches what was previously built.
For optimal host launch latency, we recommend using above custom caching method with ``cute.compile``.

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

View File

@ -0,0 +1,412 @@
.. _framework_integration:
.. |DSL| replace:: CuTe DSL
Integration with Frameworks
=============================
.. contents:: Table of Contents
:depth: 2
:local:
In order to facilitate the integration of CUTLASS Python with popular frameworks, we leverage the
`DLPack protocol <https://github.com/dmlc/dlpack>`_ and transform tensors originating from these
frameworks to CuTe tensors. The present page documents the conventions, the API available to the
user, and provide example code snippets for common usage patterns.
Implicit Conversion
-------------------
Tensors originating from frameworks supporting the DLPack protocol can be directly provided to a
JIT function as a regular parameter. |DSL|'s runtime implicitly converts the original tensor to a
CuTe tensor with a fully dynamic layout except for the stride element corresponding to the leading
dimension. The example below demonstrates this use case.
.. code-block:: python
import torch
import cutlass.cute as cute
@cute.jit
def foo(src):
"""
The following lines print
ptr<f32, generic> o (?,?,?):(?,?,1)
<class 'cutlass.cute.core._Tensor'>
"""
print(src)
print(type(src))
a = torch.randn(30, 20, 32, device="cpu")
foo(a)
Explicit conversion using ``from_dlpack``
------------------------------------------
|DSL|'s runtime provides an interface for converting DLPack-compatible tensors to CuTe tensors,
.. code-block:: python
b = cute.runtime.from_dlpack(a)
where ``a`` is a tensor supporting the DLPack protocol with the ``__dlpack__``
and ``__dlpack_device__`` methods. The resulting CuTe tensor ``b`` has a fully static layout. This
conversion is performed without copying any tensor data, enabling seamless integration with major
frameworks. Users can create tensors using NumPy, PyTorch, etc. and directly feed them into JIT
functions writtnen using |DSL|.
The resulting CuTe tensor shares the same underlying memory buffer as the original tensor. This
zero-copy approach maximizes performance by eliminating unnecessary data duplication. However, it is
important to note that the CuTe tensor's validity is tied to the lifetime of the original tensor. If
the source tensor is destroyed or goes out of scope, the corresponding CuTe tensor becomes invalid
since it references the original memory location.
The full signature of from_dlpack is as follows:
.. code-block:: python
def from_dlpack(tensor, assumed_align=None):
The ``assumed_align`` integer parameter specifies the alignment of the tensor in unit of bytes.
The tensor's base address must be divisible by ``assumed_align``. When not provided explicitly,
the alignment is set to the natural alignment of the tensor's element type. Note that the alignment
information is part of the pointer type in the generated IR. Therefore, programs with different
alignments have a different IR and identical IRs are required for hitting the kernel caching
mechanism of |DSL|.
Code Example
~~~~~~~~~~~~
The following code demonstrates how to convert a PyTorch tensor to a CuTe tensor using the
``from_dlpack`` function with default parameters.
.. code-block:: python
import torch
import cutlass
from cutlass.cute.runtime import from_dlpack
x = torch.randn(30, 20, device="cpu")
y = from_dlpack(x)
Once converted, we can access the tensor's information through various
attributes. The following list shows the attributes of the converted tensor:
- ``tensor.shape``: the tensor's shape
- ``tensor.stride``: the tensor's stride
- ``tensor.memspace``: the tensor's memory space
- ``tensor.element_type``: the tensor's element data type
.. code-block:: python
import torch
import cutlass
from cutlass.cute.runtime import from_dlpack
x = torch.randn(30, 20, device="cpu")
y = from_dlpack(x)
print(y.shape) # (30, 20)
print(y.stride) # (20, 1)
print(y.memspace) # generic (if torch tensor in on device memory, memspace will be gmem)
print(y.element_type) # Float32
print(y) # Tensor<0x000000000875f580@generic o (30, 20):(20, 1)>
The string format of the resulting CuTe tensor is
.. code-block::
Tensor<0x{tensor.data_ptr:016x}@{tensor.memspace} o {tensor.shape}:{tensor.stride}>
As can be seen in the example above, ``from_dlpack`` first results in a tensor with a static layout.
To obtain dynamic or mixed static/dynamic layouts after calling ``from_dlpack``, the
``mark_layout_dynamic`` and ``mark_compact_shape_dynamic`` functions are used and described in
the following sections.
When to Use Explicit Conversion?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The DLPack protocol is a widely used protocol for interoperability between different frameworks.
However, there is some associated overhead. Based on our benchmark, it usually takes between 2 to 3
us per call to ``from_dlpack``.
Explicit conversion allows for caching the converted CuTe tensors in order to avoid the overhead of
repeated calls to ``from_dlpack``.
.. code-block:: python
x = torch.randn(30, 20, device="cpu")
if key not in cached_tensors:
# Do the conversion only for cache misses
cached_tensors[key] = cute.runtime.from_dlpack(x)
foo(cached_tensors[key])
Another use case for explicit conversion is to gain fine-grain control over which modes of a tensor
are considered dynamic from the perspective of the generated program.
Mark the Tensor's Layout as Dynamic with ``mark_layout_dynamic``
----------------------------------------------------------------
After calling this function, all shape modes become dynamic. The stride modes also become dynamic
with the following two exceptions:
1. the leading dimension's stride remains fixed at 1;
2. stride elements equal to 0 (which indicates broadcasting) are retained.
The full signature of ``mark_layout_dynamic`` is as follows:
.. code-block:: python
def mark_layout_dynamic(self, leading_dim: int|None = None):
The ``leading_dim`` parameter specifies the leading dimension of the tensor. The leading dimension's
stride is set to 1 unless inconsistent with the layout of the DLPack tensor. For example,
- For a tensor with layout ``(2,2,3,4):(2,1,4,12)``, if ``leading_dim`` is specified to be 1,
the layout will be marked as ``(?,?,?,?):(?,1,?,?)``.
- If ``leading_dim`` is specified to be 0, a deduction failure error is raised because the stride of
dimension 0 is 2 (not 1).
The default value for ``leading_dim`` is ``None``. In such case, the system
automatically deduces it from the tensor's layout using the following logic:
1. If a dimension's stride is 1, that dimension is marked as the leading dimension.
2. If multiple dimensions satisfy condition 1, an error is thrown indicating deduction failure.
Note that after converting a **PyTorch** tensor to the DLPack format, the stride for dimensions
with size 1 are canonicalized to 1. This canonicalization can increase the likelihood of
deduction failures. This behavior is specific to PyTorch and does not occur with NumPy for
example.
3. If no dimension satisfies condition 1, all strides are marked as dynamic.
For example:
- For a tensor with layout ``(2,2,3,4):(2,1,4,12)``, the leading dimension is 1.
The layout will be marked as ``(?,?,?,?):(?,1,?,?)``.
- For a tensor with layout ``(1,5,1):(1,1,1)``, if ``leading_dim`` is not specified,
a deduction failure error is raised.
- For a tensor with layout ``(2,2):(8,2)``, since no dimension has stride 1,
all dimensions are marked as dynamic: ``(?,?):(?,?)``.
Code Example
~~~~~~~~~~~~
The following example demonstrates how to use ``mark_layout_dynamic`` to specify dynamic tensor layouts.
* ``t0`` shows the usage of ``mark_layout_dynamic`` with unspecified ``leading_dim`` and the automatic deduction of leading dimension.
* ``t1`` & ``t2`` shows the usage of ``mark_layout_dynamic`` with specified ``leading_dim``.
* ``t3`` shows the usage of ``mark_layout_dynamic`` with no leading dimension.
* ``t4`` shows the usage of ``mark_layout_dynamic`` with broadcasted dimensions.
* ``t5`` demonstrates the deduction failure when the there're more than one dimensions with stride equals to 1.
* ``t6`` & ``t7`` demonstrates incorrect settings for ``leading_dim`` and expected errors.
.. code-block:: python
import torch
from cutlass.cute.runtime import from_dlpack
# (8,4,16,2):(2,16,64,1)
a = torch.empty(16, 4, 8, 2).permute(2, 1, 0, 3)
# (1,4,1,32,1):(4,1,4,4,4) => torch tensor when dimension has shape 1, its stride is degenerated to 1,
# resulting in (1,4,1,32,1):(1,1,1,4,1)
b = torch.empty(32, 1, 1, 1, 4).permute(3, 4, 1, 0, 2)
# (2,2):(8,2)
c = torch.empty(3, 4)[::2, ::2]
# (3,1,1,5):(5,0,0,1)
d = torch.empty(3, 1, 1, 5).expand(3, 4, 2, 5)
# auto deduce the leading dimension to be 3
t0 = from_dlpack(a).mark_layout_dynamic()
print(t0)
# (?,?,?,?):(?,?,?,1)
t1 = from_dlpack(b).mark_layout_dynamic(leading_dim=0)
print(t2)
# (?,?,?,?,?):(1,?,?,?,?)
t2 = from_dlpack(b).mark_layout_dynamic(leading_dim=2)
print(t3)
# (?,?,?,?,?):(?,?,1,?,?)
t3 = from_dlpack(c).mark_layout_dynamic()
print(t3)
# (?,?):(?,?)
t4 = from_dlpack(d).mark_layout_dynamic()
print(t4)
# (?,?,?,?):(?,0,0,1)
t5 = from_dlpack(b).mark_layout_dynamic()
# Can't decude the leading dimension from layout, please specify the leading_dim explicitly.
t6 = from_dlpack(a).mark_layout_dynamic(leading_dim=1)
# Expected strides[leading_dim] == 1, but got 16
t7 = from_dlpack(b).mark_layout_dynamic(leading_dim=3)
# Expected strides[leading_dim] == 1, but got 4
Mark the Tensor's Layout as Dynamic with ``mark_compact_shape_dynamic``
-----------------------------------------------------------------------
The ``mark_compact_shape_dynamic`` function provides fine-grain control over dynamic shapes for compact
layouts. The full signature of ``mark_compact_shape_dynamic`` is as follows:
.. code-block:: python
def mark_compact_shape_dynamic(self, mode: int, stride_order: tuple[int, ...]|None = None, divisibility: int = 1):
The ``mode`` parameter determines which shape dimension becomes dynamic. After calling this function,
the specific shape dimension given by ``mode`` is marked as dynamic immediately. The stride will be
updated accordingly but this process is delayed until the C ABI of the tensor is constructed.
For modes that have a shape of size 1, their stride are canonicalized to 0.
The ``stride_order`` parameter specifies the ordering of strides in the tensor. It is consistent
with ``torch.Tensor.dim_order()`` and defaults to ``None``. The parameter indicates the order of
modes (dimensions) if the current layout were to be converted to row-major order. It starts from the
outermost to the innermost dimension when reading it from left to right. This parameter must be
explicitly set when the stride order cannot be automatically deduced from the tensor's layout, such
as when multiple dimensions have a stride of 1.
For example:
- Layout ``(4,2):(1,4)`` has a ``stride_order`` of ``(1,0)`` indicates the innermost dimension is
0 (``4:1``), the outermost dimension is 1 (``2:4``).
- Layout ``(5,3,2,4):(3,1,15,30)`` has a ``stride_order`` of ``(3,2,0,1)`` indicates the innermost
dimension is 1 (``3:1``), the outermost dimension is 3 (``4:30``).
If ``stride_order`` is not specified, the system automatically deduces it from the tensor's layout
using the following logic:
1. Sort the strides in descending order.
2. If multiple dimensions have a stride of 1, a deduction failure error is raised.
For example:
- For a tensor with layout ``(2,2,3,4):(2,1,4,12)``, the deduced ``stride_order`` is ``[3,2,0,1]``.
- For a tensor with layout ``(1,5,1):(1,1,1)``, ``stride_order``'s deduction fails because
all dimensions have an identical stride of 1, making it impossible to determine the correct ordering.
If ``stride_order`` is specified, the system validates that the order is consistent with the
tensor's layout.
The ``divisibility`` parameter specifies the divisibility of the dynamic shape. It could be used to
represent the assumption alignment of the input. Defaults to 1.
Note that this API is only available for compact tensors. For non-compact tensors, we can use
``cute.assume`` to attach divisibility information to a specific shape mode in a host JIT function,
as demonstrated in the following example:
.. code-block:: python
@cute.jit
def foo(a: cute.Tensor):
new_shape = a.shape
# use cute.assume to set shape of mode=0 with divisibility=16
new_shape[0] = cute.assume(new_shape[0], 16)
new_layout = cute.make_layout(new_shape, stride=a.stride)
new_a = cute.make_tensor(a.iterator, new_layout)
Code Example
~~~~~~~~~~~~
The following example demonstrates how to use ``mark_compact_shape_dynamic`` to specify dynamic tensor layouts.
* ``t0`` & ``t1`` show the usage of ``mark_compact_shape_dynamic`` with unspecified ``stride_order`` and different ``mode`` and ``divisibility``.
* ``t2`` shows the usage of consecutive ``mark_compact_shape_dynamic`` with unspecified ``stride_order`` and different ``mode`` and ``divisibility``.
* ``t3`` & ``t4`` show the usage of ``mark_compact_shape_dynamic`` with different specified ``stride_order``.
* ``t5``, ``t6``, ``t7``, ``t8``, ``t9``, ``t10``, ``t11``, and ``t12`` demonstrate incorrect settings for parameters and expected errors.
.. code-block:: python
import torch
from cutlass.cute.runtime import from_dlpack
@cute.jit
def kernel(t: cute.Tensor):
pass
# (8,4,16,2):(2,16,64,1)
a = torch.empty(16, 4, 8, 2).permute(2, 1, 0, 3)
# (1,4,1,32,1):(4,1,4,4,4) => torch tensor when dimension has shape 1, its stride is degenerated to 1,
# resulting in (1,4,1,32,1):(1,1,1,4,1)
# b.dim_order() is (3,2,4,0,1)
b = torch.empty(32, 1, 1, 1, 4).permute(3, 4, 1, 0, 2)
# auto deduce the stride order to be [2,1,0,3]
t0 = from_dlpack(a).mark_compact_shape_dynamic(
mode=0, divisibility=2
)
kernel(t0)
# (?{div=2},4,16,2):(2,?{div=4},?{div=16},1)
print(t0)
t1 = from_dlpack(a).mark_compact_shape_dynamic(
mode=1, divisibility=2
)
kernel(t1)
# (8,?{div=2},16,2):(2,16,?{div=32},1)
print(t1)
t2 = from_dlpack(a).mark_compact_shape_dynamic(
mode=1, divisibility=2
).mark_compact_shape_dynamic(
mode=3, divisibility=2
)
kernel(t2)
# (8,?{div=2},16,?{div=2}):(?{div=2},?{div=16},?{div=32},1)
print(t2)
t3 = from_dlpack(b).mark_compact_shape_dynamic(
mode=2, divisibility=1, stride_order=(3, 0, 2, 4, 1)
)
kernel(t3)
# (1,4,?,32,1):(0,1,4,?{div=4},0)
print(t3)
t4 = from_dlpack(b).mark_compact_shape_dynamic(
mode=2, divisibility=1, stride_order=(2, 3, 4, 0, 1)
)
kernel(t4)
# (1,4,?,32,1):(0,1,128,4,0)
print(t4)
t5 = t2.mark_compact_shape_dynamic(
mode=3, divisibility=5, stride_order=(0, 1, 2, 3)
)
# The stride_order is not consistent with the last stride_order
t6 = from_dlpack(a).mark_compact_shape_dynamic(
mode=3, divisibility=5, stride_order=(0, 1, 2, 3)
)
# The stride_order is not consistent with the deduced stride_order
t7 = from_dlpack(b).mark_compact_shape_dynamic(
mode=0, divisibility=4
)
# The layout could not be deduced, please specify the stride_order explicitly
t8 = from_dlpack(b).mark_compact_shape_dynamic(
mode=30, divisibility=5, stride_order=(3, 0, 2, 4, 1)
)
# Expected mode value to be in range [0, 5), but got 30
t9 = from_dlpack(b).mark_compact_shape_dynamic(
mode=3, divisibility=5, stride_order=(2, 1, 2, 3, 4)
)
# Expected stride_order to contain all the dimensions of the tensor, but it doesn't contain 0.
t10 = from_dlpack(b).mark_compact_shape_dynamic(
mode=3, divisibility=5, stride_order=(0, 1, 2, 3, 4, 5)
)
# Expected stride_order to have 5 elements, but got 6.
t11 = from_dlpack(b).mark_compact_shape_dynamic(
mode=0, divisibility=4, stride_order=b.dim_order()
)
# The shape(1) of mode(0) is not divisible by the divisibility(4)
t12 = from_dlpack(b).mark_compact_shape_dynamic(
mode=0, divisibility=1, stride_order=(2, 1, 3, 0, 4)
)
# The stride_order is not consistent with the layout

View File

@ -0,0 +1,16 @@
.. _notebooks:
Educational Notebooks
=====================
A number of notebooks for educational purposes are provided in the `CUTLASS GitHub repository <https://github.com/NVIDIA/cutlass>`__.
A list with handful links is given below:
- `"Hello world" <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/hello_world.ipynb>`__
- `Printing <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/print.ipynb>`__
- `Data Types Basics <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/data_types.ipynb>`__
- `Tensors <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/tensor.ipynb>`__
- `The TensorSSA Abstraction <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/tensorssa.ipynb>`__
- `Layout Algebra <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/cute_layout_algebra.ipynb>`__
- `Element-wise Add Tutorial <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/elementwise_add.ipynb>`__
- `Using CUDA Graphs <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/cuda_graphs.ipynb>`__

View File

@ -0,0 +1,137 @@
.. _faqs:
FAQs
====
General
---------------------
**Are the DSLs replacing C++ templates?**
TL;DR: No - but also yes. The CUTLASS 4.0 release (CuTe DSL), along with all
future extensions to our Python-native programming models, does not come at the
expense of CUTLASS C++. CUTLASS 2.x and 3.x C++ APIs are both going to continue
receiving fixes and updates for the architectures we support them for. However,
CUTLASS 4.x CuTe DSL is fully isomorphic in its programming model and performance
with CuTe C++ for Blackwell, and it is our hope that the community embraces this
for much easier while still equally performant custom kernel development. This is
why we are releasing CuTe DSL with support for all architectures starting with the
NVIDIA Ampere Architecture.
**What is the difference between CuTe DSL, CUTLASS Python, and CUTLASS DSLs?**
CUTLASS Python was the Python interface for instantiating C++ kernels via a Python
frontend. This is now deprecated with the release of CUTLASS 4.0. CUTLASS DSLs are
a family of Python DSLs for native device programming in Python. Currently, this is
limited to our initial release of CuTe DSL, but future versions will include higher-level
abstractions that gradually trade off control for convenience.
**What should I learn, CUTLASS C++ or the Python DSLs?**
We believe the Python DSLs will significantly improve the learning curve and recommend starting
with them for all newcomers, as they eliminate the inherent complexity of learning C++
metaprogramming for GPU kernel programming. Since CuTe C++ and CuTe DSL share fully isomorphic
programming models and patterns, any knowledge gained can eventually be applied to C++.
**Where will the code live? PIP wheel or GitHub repo? Do I have to build it myself?**
This is a major change compared to CUTLASS C++ and Python DSLs. Going forward,
the GitHub code only exists as a way for users to file issues and pull requests against.
While it can be used with the pip wheel, we do not recommend most users do so unless they are
hacking on the DSL itself. For all other users, we recommend they
simply ``pip install nvidia-cutlas-dsl`` and use the pip wheel as the single source
of truth for the dialect compiler and DSL implementation. CUTLASS GitHub repository will
contain a ``requirements.txt`` file pinning the version of the wheel consistent with the state
of the OSS repository (please see :doc:`quick_start`). This means getting started with
CUTLASS is easier than ever: no more CMake command lines to learn and no more builds to kick
off. Simply install the pip wheel and start running the examples.
Migration
---------------------
**Should I port my code from C++ templates to Python?**
Almost certainly not, unless you need extremely fast JIT times for your kernel and C++ compile times
are a blocker for you. The 2.x and 3.x APIs will continue to be supported, and Nvidia's Hopper and
Blackwell architectures 3.x will continue to improve in terms of features
and performance.
**Are portability promises different with Python?**
For the initial release while the DSL is still in beta, we do not promise any portability
as we may make changes to the DSL itself. While we do not expect any changes to the CuTe operations,
the DSL utilities, decorators, helper classes like pipelines and schedulers may change as we refine them
with community feedback. We encourage users to file issues and discussions on GitHub during this
beta period with their feedback!
In the long term, we plan to continue to treat the OSS community with care.
Just like the prior history of CUTLASS, we plan not to break users unless necessary,
but we reserve the right to make limited breaking changes in case we believe it is a
net benefit to the community and project. These will be announced ahead of time and/or
clearly highlighted in the CHANGELOG of each release.
Technical
---------------------
**What NVIDIA architectures will it support?**
CuTe DSL will support all NVIDIA GPU architectures starting with NVIDIA Ampere Architecture (SM80).
**Will it be compatible with DL frameworks (e.g., PyTorch, JAX)?**
Yes, we will provide utilities to convert from DLPack-supported tensor formats
to ``cute.Tensor``. This should allow a user to never have to leave Python
when writing model code in their framework of choice. Our JAX interoperability story is not
as strong as PyTorch's today, however, we are actively working on improving it
and welcome contributions in this space.
**Does it compile to PTX or SASS?**
CuTe DSL compiles the program down to PTX. After that, we currently use the PTX compiler that
ships with the CUDA toolkit to compile the PTX down to SASS. We plan to remove
this limitation in the future and allow the use of the PTX JIT that is included in the
CUDA driver in case a user does not have a CUDA toolkit installed.
**Do I need to use NVCC or NVRTC?**
No, the ``nvidia-cutlass-dsl`` wheel packages is everything needed to generate GPU kernels. It
shares the driver requirements of the 12.9 toolkit which can be found
`here <https://developer.nvidia.com/cuda-toolkit-archive>`__.
**How would one debug the code?**
Since CuTe DSL is not native python and an embedded DSL instead, tools like `pdb`
cannot be used. However, if you have experience with GPU kernel programming, the debugging
techniques will be nearly identical. Typically, compile time and runtime printing
of types and values are the most expedient. Please see `documentation on printing <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/print.ipynb>`__
to learn how to print types and values at both compile time and runtime.
You can also use ``cuda-gdb`` to set breakpoints in the program and step through the execution
or use tools such as ``compute-sanitizer`` to detect and triage bugs in your program. As the DSL
matures, our source location tracking from Python user programs will also improve to provide
more helpful source-level mapping when setting breakpoints and using other tools such as nsight.
**How would one implement warp specialization in CuTe DSL?**
Exactly the same way you would in C++ but in a Python-native syntax instead.
Consult our :doc:`cute_dsl_general/dsl_control_flow` and
`"Blackwell kernel example" <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py>`__
for a detailed how-to guide.
**Can I call functions from other functions or use OOP?**
Yes. We frequently call functions from one another and set up class
hierarchies to organize and modularize our code for pipelines and schedulers.
Consult the :doc:`cute_dsl_general/dsl_introduction` documentation or our examples for more details.
License
---------------------
**Q:What is the license for CuTe DSL and the associated GitHub samples?**
CuTe DSL components available `on Github <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL>`__ and via the nvidia-cutlass-dsl Python pip wheel
are released under the `"NVIDIA Software End User License Agreement (EULA)" <https://github.com/NVIDIA/cutlass/tree/main/EULA.txt>`__.
Because the pip package includes a compiler that shares several components with the CUDA Toolkit,
it is subject to usage terms and restrictions similar to those of the CUDA SDK. Please refer to the EULA for specific terms of use.
CuTe DSL samples and Jupyter notbooks, released `on GitHub <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL>`__ are provided under
the BSD 3-Clause License and may be used and redistributed under those terms. This distinction ensures that developers have flexibility
when using or modifying the code samples, independent of the compiler and runtime components governed by the EULA.
If you have any questions or need clarification, feel free to contact us.

View File

@ -0,0 +1,34 @@
.. _functionality:
Functionality
====================
The CUTLASS DSL 4.0 release supports **Python 3.12** only. It shares the same driver requirements
as the `CUDA Toolkit 12.9 <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html>`__.
Specifically, the driver version must be 575.51.03 or later.
Currently, only Linux x86_64 is supported. Additional platform support will be added in future releases.
Supported MMA Operations
---------------------------------
**NVIDIA Ampere Architecture:**
- FP16 / BF16 tensor core instructions
**NVIDIA Hopper Architecture:**
- FP16 / BF16
- FP8
**NVIDIA Blackwell Architecture:**
- FP16 / BF16
- TF32
- I8
- F8
Notable Limitations
------------------------------
For current constraints and unsupported features, refer to the :doc:`limitations` section.

View File

@ -0,0 +1,279 @@
.. _limitations:
Limitations
====================
.. contents::
:depth: 2
:local:
Overview
---------------------
CuTe DSL is an embedded domain-specific language within Python. It utilizes a subset of Python's
syntax to provide a streamlined programming experience. It is important to understand that CuTe DSL
does NOT implement the complete Python language semantics in its JIT compilation process.
This section documents the current limitations of the CuTe DSL. While some of these limitations
may be addressed in future releases, developers should be aware of them when building applications with
the DSL.
Notable unsupported features
----------------------------
- GeForce RTX 50 Series support
- RS WGMMA (The input matrix A comes from register and the input matrix B comes from shared memory)
- Programmatic Dependent Launch (PDL)
- narrow-precision data type support, including related tensor core instructions
- convolutions
- full support for ahead of time compilation
- preferred clusters
- CLC-based tile schedulers
- EVT support
- Windows support
Programming Model
---------------------
**Python Native Data Types**
CuTe DSL supports Python data structures when used for "meta-programming,"
but these structures cannot be treated as dynamic values modifiable at runtime.
For instance, lists and dictionaries can be used to configure kernel parameters
during compilation or serve as containers for dynamic values,
but their structure and organization cannot be altered during kernel execution.
- **Static Values:**
- Evaluated during JIT compilation phase
- Immutable after compilation completes
- Most Python native types (lists, tuples, dictionaries) are processed as static values
- Primarily utilized for "meta-programming" and configuration purposes
- Example: Lists can contain dynamic values but their structure cannot
be modified during kernel execution
- **Dynamic Values:**
- Evaluated during runtime execution
- Modifiable during execution of JIT-compiled functions
- Only a specific subset of Python types are supported as dynamic values
- Primitive types are automatically converted when passed as function arguments:
- ``int````Int32`` (may be updated to ``Int64`` in future releases)
- ``bool````Bool``
- ``float````Float32`` (may be updated to ``Float64`` in future releases)
The JIT compiler processes Python native types analogously to C++ template parameters.
The compiled code cannot manipulate dynamic values of composite types
such as lists, tuples, or dictionaries.
For example, following code doesn't work as traditional Python program inside JIT function.
.. code:: python
@cute.jit
def foo(a: Float32, b: Float32, i: Int32, res: cute.Tensor):
xs = [a, b]
# indexing list with dynamic index is not supported in CuTe DSL:
res[0] = xs[i]
if i == 0:
# This will alway append Float32(3.0) to the list regardless
# of the runtime value of `i`
xs.append(Float32(3.0))
for i in range_dynamic(10):
# This only append one element to the list at compile-time
# as loop doesn't unroll at compile-time
xs.append(Float32(1.0))
**Python Function**
The DSL currently does not implement support for return values from Python functions,
although this capability is planned for future releases.
Example:
.. code:: python
@cute.jit
def foo():
return 1 # Currently unsupported in CuTe DSL
**Expression or Statement with Dependent Type**
CuTe DSL implements static typing and does not support dependent types.
The type of each expression must be determinable during compile time,
in contrast to standard Python which implements dynamic typing.
Example illustrating functionality in Python that is not supported in the DSL:
.. code:: python
# Valid in standard Python, but unsupported in CuTe DSL
max(int(1), float(2.0)) # => 2.0 : float
max(int(3), float(2.0)) # => 3 : int
In CuTe DSL, types are promoted. For example:
.. code:: python
@cute.jit
def foo(a: Int32, b: Float32, res: cute.Tensor):
res[0] = max(a, b) # Type is automatically promoted to Float32
Following code using inlined if-else expression with dependent types
is not supported in CuTe DSL:
.. code:: python
@cute.jit
def foo(cond: Boolean, a: Int32, b: Float32, res: cute.Tensor):
res[0] = a if cond else b
**Control Flow**
The DSL transforms Python control flow statements (``if``, ``for``, ``while``)
during Abstract Syntax Tree (AST) processing into structured control flow in MLIR
which has the same constraints as dependent types. For instance,
changing type of a variable in loop body is not allowed.
- Variables must be defined prior to the control flow statement
- Type consistency must be maintained throughout the control flow statement
- Don't support early exit or return from if-else statements
Example illustrating functionality in Python that is not supported in the DSL:
.. code:: python
@cute.jit
def foo():
a = Int32(1)
for i in range_dynamic(10):
a = Float32(2) # Changing type inside loop-body is not allowed in the DSL
**Built-in Operators**
The DSL transforms built-in operators like ``and``, ``or``, ``max``, ``min``, etc.
into MLIR operations. They also follow the same constraints of dependent types.
For instance, ``a and b`` requires ``a`` and ``b`` to be of the same type.
Comparison like ``==`` on Sequence of dynamic values is known to not produce
expected result at runtime.
**Object Oriented Programming**
The DSL is implemented on top of Python and supports Python's object-oriented programming (OOP) features
for meta-programming at compile-time.
However, similar to other composed data types, the DSL provides limited support for OOP when objects
contain dynamic values. It is strongly recommended to avoid passing dynamic values between member methods
through class state in your code.
The following example illustrates functionality in Python that is not supported in the DSL
without implementing the ``DynamicExpression`` protocol:
.. code:: python
class Foo:
def __init__(self, a: Int32):
self.a = a
def set_a(self, i: Int32):
self.a = i
def get_a(self):
return self.a
@cute.jit
def foo(a: Int32, res: cute.Tensor):
foo = Foo(a)
for i in cutlass.range_dynamic(10):
foo.set_a(i)
# This fails to compile because `a` is assigned a local value defined within the for-loop body
# and is not visible outside of the loop body
res[0] = foo.get_a()
The example above fails to compile because ``Foo.a`` is assigned a local value defined within the for-loop body,
which is not visible outside the loop body.
The CuTe DSL implements an internal mechanism that provides limited support for OOP patterns via protocol.
As the DSL continues to evolve to support additional features, this mechanism is subject to change
and is not recommended for direct use in users' code for better portability.
**CuTe Layout algebra in native Python**
Entirety of CuTe Layout algebra operations and APIs require JIT compilation. These
functionalities are exclusively available within JIT-compiled functions and cannot be
accessed in standard Python execution environments.
Additionally, there exists a restricted set of data types that can be passed as arguments
to JIT-compiled functions, which further constrains their usage in native Python contexts.
Only following CuTe algebra types are supported as JIT function arguments: ``Tensor``, ``Pointer``,
``Shape``, ``Stride``, ``Coord`` and ``IntTuple``. For ``Stride``, we don't support ``ScacledBasis``
from native Python Context. Unfortunately, in the first release, we don't support
passing ``Layout`` under native Python Context.
Suggestions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For reliable and predictable results:
- Avoid dependent types in your code
- Implement explicit type conversion for dynamic values
- Clearly distinguish between static (compile-time) and dynamic (runtime) values
- Use type annotations as much as possible to help JIT compiler
to identify type to avoid ambiguity
.. code:: python
# Example demonstrating explicit typing
alpha = 1.0 # Explicitly defined as float using `1.0` instead of `1`
# or `float(1)`
beta = 2.0 # Explicitly defined as float
result = max(alpha, beta) # Will correctly perform float comparison
**Debugging Capabilities**
Debugging tools and facilities for the Python DSL are currently more limited in comparison to the C++
API. For instance, we don't support single-stepping through the JIT-compiled code. And lack of exception
handling in JIT-compiled code makes it hard to debug in some cases.
**Integration with Frameworks**
Integration with certain deep learning frameworks is in early development stages and may have
limitations. For instance, converting frameworking tensor to cute.Tensor is known to have overhead
with 2us~3us per tensor as we convert from general DLPack protocol which offers comptibility with
all frameworks.
**Hashing DSL APIs and Objects**
DSL APIs and Objects are sensitive to MLIR context, region or other contextual information which has no meaning cross
different context. Any stateful design rely on ``__hash__`` likely misbehave with unexpected results. An example is
``functools.lru_cache``, which combined with ``@cute.jit``, it may cache MLIR object from one context and use in another one.
Future Improvements
---------------------
The CuTe DSL development team is actively addressing these limitations.
Upcoming releases will aim to:
- Implement support for return values from JIT compiled functions
- Improve support for built-in operators to handle more cases without dependent types
- Enhance debugging capabilities and tools
- Improve error messages with precise diagnostic information
- Extend support for additional numeric data types
- Improve performance of converting framework tensor to ``cute.Tensor`` with native support
for different frameworks
- Offer more user friendly benchmarking methodology
Design Limitations Likely to Remain
--------------------------------------------
The primary objective of CuTe DSL is to provide a domain-specific language for expressing
complex CUDA kernels with optimal GPU performance, not to execute arbitrary Python code on GPU hardware.
The following limitations will likely remain by design:
- **Complex Data Structures as Dynamic Values**: Lists, tuples, and dictionaries will continue to function
as static containers. While they can store dynamic values, their structure (adding/removing elements)
cannot be modified during execution of JIT-compiled functions.
- **Dependent Types**: Supporting dependent types would introduce substantial complexity and
adversely affect the performance characteristics of generated code.
- **CuTe Layout Algebra**: We don't have plan to extend the support of CuTe Layout Algebra
under native Python Context. We are planning to extend support for data types and allow
JIT function to interoperate with native Python code.

View File

@ -0,0 +1,108 @@
.. _overview:
Overview
===========================
CUTLASS 4.x bridges the gap between productivity and performance for CUDA kernel development.
By providing Python-based DSLs to the powerful CUTLASS C++ template library, it enables
faster iteration, easier prototyping, and a gentler learning curve for high-performance linear
algebra on NVIDIA GPUs.
Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs).
With the release of 4.0, we are releasing the first of these in CuTe DSL.
This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing
core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.
Why CUTLASS DSLs?
============================
While CUTLASS offers exceptional performance through its C++ template abstractions, the complexity
can present challenges for many developers. CUTLASS 4.x addresses this by:
- **Simplifying metaprogramming**: Metaprogramming in Python is a lot more intuitive than with C++
- **Accelerating Iteration**: Rapid prototyping with familiar Python syntax and blazing fast compile times
- **Lowering Barriers**: Reduced learning curve for GPU programming concepts and consistency between CuTe C++ and DSL
- **Maintaining Performance**: Generated code leverages optimized CUTLASS primitives
Students can learn GPU programming concepts without the complexity of C++ templates.
Researchers and performance engineers can rapidly explore algorithms, prototype, and tune
kernels before moving to production implementations.
Key Concepts and Approach
================================
CUTLASS DSLs translate Python code into a custom intermediate representation (IR),
which is then Just-In-Time (JIT) compiled into optimized CUDA kernels using MLIR and `ptxas`.
Core CuTe DSL Abstractions
-----------------------------------
- **Layouts** Describe how data is organized in memory and across threads.
- **Tensors** Combine data pointers or iterators with layout metadata.
- **Atoms** Represent fundamental hardware operations like matrix multiply-accumulate (MMA) or memory copy.
- **Tiled Operations** Define how atoms are applied across thread blocks and warps (e.g., ``TiledMma``, ``TiledCopy``).
For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md>`__.
**Pythonic Kernel Expression**
Developers express kernel logic, data movement, and computation using familiar Python syntax and control flow.
The DSLs simplify expressing loop tiling, threading strategies, and data transformations using concise Python code.
**JIT Compilation**
Python kernels are compiled at runtime into CUDA device code using MLIR infrastructure and NVIDIAs ``ptxas`` toolchain,
enabling rapid iteration and interactive debugging.
Relationship to CUTLASS C++
=================================
CUTLASS DSLs are not a replacement for the CUTLASS C++ library or its 2.x and 3.x APIs. Instead, it aims to be a high-productivity kernel
authoring framework that shares all concepts with CUTLASS 3.x C++ API such as CuTe, pipelines, schedulers etc.
- **Performance**: Generated kernels aim to match CUTLASS C++ kernels in performance; however, some performance gaps
may exist due to missing optimizations that have been added over the years to CUTLASS C++ and may be missing in the DSLs examples.
- **Library**: The CUTLASS DSLs do not currently ship with a full GEMM/Conv autotuning profiler or library interface
akin to CUTLASS C++. Instead, it focuses on generating and autotuning individual kernel instances (for example: via tile size exploration) and via native integration DL frameworks that support auto-tuning.
Getting Started
================================
- :doc:`quick_start` Initial setup and installation.
- :doc:`cute_dsl` Overview of the typical development and workflow using CuTe DSL.
- :doc:`cute_dsl_api` Refer to the full API documentation.
- :doc:`limitations` Understand current CuTe DSL constraints and differences from C++.
- :doc:`faqs` Common questions and known issues.
Current Status & Roadmap
=================================
CuTe DSL is in public beta and actively evolving. Interfaces and features are subject to
change as we improve the system.
Upcoming Milestones
----------------------------------
- Public release targeted for **Summer 2025**
- Expanded support for additional data types and kernel types
- Usability improvements: better error messages, debugging tools, and streamlined APIs
- Broader integration of CUTLASS primitives and features
For known issues and workarounds, please consult the :doc:`limitations` and :doc:`faqs`.
Community & Feedback
==================================
We welcome contributions and feedback from the developer community!
You can:
- Submit bug reports or feature requests via our `GitHub Issues page <https://github.com/NVIDIA/cutlass/issues>`__
- Join the CUTLASS community on `Discord <https://discord.com/channels/1019361803752456192/1150868614921064590>`__ to ask questions and share ideas
- Contribute examples, tutorials, or enhancements to the DSLs
- Report unclear or missing documentation
- Propose support for additional data types or kernel variants
- Help prioritize roadmap features by upvoting GitHub issues
Thank you for helping shape the future of CUTLASS DSLs!

View File

@ -0,0 +1,31 @@
.. _quick_start:
Quick Start Guide
=======================
The CUTLASS DSL 4.0 release currently supports **Linux** and **Python 3.12** only. To install CUTLASS DSLs (limited to CuTe DSL for now), use the following command
Installation
-----------------------
To install the CUTLASS DSL, run:
.. code-block:: bash
pip install nvidia-cutlass-dsl
The ``nvidia-cutlass-dsl`` wheel includes everything needed to generate GPU kernels. It requires
the same NVIDIA driver version as the
`CUDA Toolkit 12.9 <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html>`_.
To ensure compatibility with the examples and code on `GitHub <https://github.com/NVIDIA/cutlass/tree/main/python>`_,
use the ``requirements.txt`` file from the corresponding commit in the repository.
Recommended Dependencies
---------------------------------
To run examples and begin development, we recommend installing:
.. code-block:: bash
pip install torch jupyter