v4.0 update. (#2371)

This commit is contained in:
Junkai-Wu
2025-06-06 14:39:20 +08:00
committed by GitHub
parent 2e2af190bd
commit 8bdbfca682
254 changed files with 29751 additions and 1980 deletions

View File

@ -6,7 +6,7 @@ A GEMM workload usually consists of three phases: prologue, mainloop and epilogu
Consider a GEMM that has `20x20x1` output tiles, running on a GPU with `100` SMs. There is another kernel occupying all the resources of `20` SMs so only `80` SMs can be used. Assume cluster shape is `1x1x1`. The following diagram shows how the schedule would look like for such a kernel.
<p align="center"><img src=../images/non_persistent.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
![GEMM tiles are evenly divided among available SMs](../../images/non_persistent.png "GEMM Scheduling with Limited SM Resources")
### Static Scheduler
@ -14,7 +14,7 @@ CUTLASS has adopted a software technique named **persistent kernels**. Persisten
However, static scheduler is susceptible to workload imbalance if the resources of some SMs are unavailable. The following diagram illustrates this issue.
<p align="center"><img src=../images/persistent_static.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
![GEMM tiles are unevenly divided among available SMs, leading to workload imbalance](../../images/persistent_static.png "Imbalanced Workload Scheduling due to Static Scheduler")
### Dynamic Scheduler with Cluster Launch Control
A fundamental limitation of persistent scheduling is that the number of SMs this kernel can utilize is unknown in real time. Some SMs might be occupied by another kernel and thus their resources are unavailable. This makes it challenging to load-balance work across SMs.
@ -32,7 +32,7 @@ Cluster launch control follows the below rules:
The following diagram shows how the schedule would look like with cluster launch control.
<p align="center"><img src=../images/persistent_clc.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
![GEMM tiles are dynamically allocated among available SMs, leading to a balanced workload](../../images/persistent_clc.png "Dynamic Scheduler with Cluster Launch Control")
## Programming Model
### Pseudo Code

View File

@ -142,8 +142,8 @@ Put into words, `A o B = A o s:d`, for integral `s` and `d` means that we want (
* `(6,2) / 3 => (2,2)`
* `(6,2) / 6 => (1,2)`
* `(6,2) / 12 => (1,1)`
* `(3,6,2,8) / 3 => (1,6,2,8)`
* `(3,6,2,8) / 6 => (1,3,2,8)`
* `(3,6,2,8) / 3 => (1,3,2,8)`
* `(3,6,2,8) / 6 => (1,6,2,8)`
* `(3,6,2,8) / 9 => (1,2,2,8)`
* `(3,6,2,8) / 72 => (1,1,1,4)`

View File

@ -10,53 +10,58 @@ For example, we might want to tile a 41 x 55 matrix into 4 x 8 tiles,
but 41 / 4 is 10 remainder 1, and 55 / 8 is 6 remainder 7.
What do we do with those "leftover" parts of the matrix?
Another way to say this, is that `logical_divide`
To start, we note that `logical_divide`
(CuTe's way of tiling layouts) "rounds up."
For example, if `N` is the layout (1000, 1) and `B` is the layout (128, 1),
then `logical_divide(N, B)` is the layout ((128, 8), (1, 128)).
This effectively rounds up the original shape N = 1000
into an 128 x 8 matrix (as if N = 1024).
For example, if `N` is the layout `1000:1` and `B` is the layout `128:1`,
then `logical_divide(N, B)` is the layout `(128, 8):(1, 128)`.
This effectively rounds up the original shape `N = 1000`
into an `128 x 8` matrix (as if `N = 1024`).
What about those last 24 elements,
that aren't part of the original data?
that aren't part of the original data? How is the last tile handled and how do we avoid indexing out-of-bounds?
The idiomatic CuTe way to solve this problem is through "predication."
Rather than trying to reason about the "remainder tiles,"
CuTe instead rounds up, but only tries to access data in each tile
that are part of the matrix.
Like other introductions to CUDA programming, the idiomatic CuTe way to address these issues is through "predication."
Rather than attempting to reason about the "remainder tiles" by trying to represent "7 tiles of size-128 and 1 tile of size-104,"
CuTe instead rounds up to "8 tiles of size-128" and constructs predicates so that the kernel
only tries to access data in each tile that are valid within the matrix.
This corresponds well with how our GPUs optimize:
branches without warp divergence are relatively fast.
It also matches the usual CUDA idiom
when dividing N work items in 1-D fashion over B thread blocks:
first test if "my thread" is out of bounds before doing work.
There are a few ways to figure out
which elements need to be predicated.
In-kernel GEMMs like to do this in the following way.
Consider a generic tiling wherein a size-1000 vector is tiled into size-128 chunks. Then a predication tensor can be constructed as follows:
```c++
// Create the predicate tensor
Layout idA = make_layout(shape(A)); // e.g. 1000:1
Layout idAB = logical_divide(idA, B); // e.g. (128,8):(1,128)
Tensor gmem = ... // e.g. size 1000
Tensor smem = ... // e.g. size 128
Tensor pred = make_tensor<bool>(shape(idAB));
// Tile the gmem for smem
Tensor gmem_tiled = logical_divide(gmem, size(smem)); // e.g. (128,8)
// Create an identity layout for gmem and tile it similarly
Layout id_layout = make_layout(shape(gmem)); // e.g. 1000:1, explicitly constructed as identity function
Layout id_tiled = logical_divide(id_layout, size(smem)); // e.g. (128,8):(1,128), but many elements aren't "valid"
// Create a predicate tensor
Tensor pred = make_tensor<bool>(shape(id_tiled)); // e.g. (128,8)
for (int i = 0; i < size(pred); ++i) {
pred(i) = idAB(i) < size(A);
pred(i) = id_tiled(i) < size(id_layout); // Predicate: Is the offset within the original shape?
}
// ... intervening code ...
// Use the predicate tensor. c is some coordinate.
// This code would likely live inside some algorithm.
if (pred(c)) { copy(idAB(c), smem(c)); }
// Note that gmem_tiled, id_tiled, and pred tensors are all congruent
// For tile tile_i, determine if element value_j is in-bounds and copy to smem
if (pred(value_j,tile_i)) { smem(value_j) = gmem_tiled(value_j,tile_i); }
```
The general procedure is that we
1. create an "identity" layout (`Layout idA = make_layout(shape(A))`,
1. create an "identity" layout (`Layout id_layout = make_layout(shape(gmem))`,
in the above example) with the same shape as our original data;
2. repeat the same tiling/partitioning/slicing (possibly rounding up)
on that identity layout (`Layout idAB = logical_divide(idA, B)`);
on that identity layout (`Layout id_tiled = logical_divide(id_layout, size(smem));`);
3. create a "predicate tensor" by comparing the coordinates
of that reference layout with the bounds of the original layout;
@ -64,19 +69,119 @@ The general procedure is that we
4. use the predicate tensor to mask off accesses to out-of-bounds elements.
For example, suppose that we've partitioned A and B tiles
across threads as follows.
As a relatively simple example, consider predicating the epilogue of a GEMM.
Suppose that we've partitioned `mC` into cta tiles and across threads of an mma as follows.
```c++
Tensor tAgA = local_partition(gA, tA, thread_idx); // (THR_M,THR_K,k)
Tensor tAsA = local_partition(sA, tA, thread_idx); // (THR_M,THR_K,PIPE)
```cpp
// CTA partitioning
auto cta_coord = make_coord(blockIdx.x, blockIdx.y, _); // (m,n,k)
Tensor gC = local_tile(mC, cta_tiler, cta_coord, Step<_1,_1, X>{}); // (BLK_M,BLK_N)
Tensor tBgB = local_partition(gB, tB, thread_idx); // (THR_N,THR_K,k)
Tensor tBsB = local_partition(sB, tB, thread_idx); // (THR_N,THR_K,PIPE)
// Thread partitioning
auto thr_mma = mma.get_slice(threadIdx.x);
Tensor tCgC = thr_mma.partition_C(gC); // (MMA,MMA_M,MMA_N)
Tensor tCrC = thr_mma.make_fragment_C(tCgC); // (MMA,MMA_M,MMA_N)
// ... Compute gemms and accumulate into tCrC ...
// axpby epilogue
for (int i = 0; i < size(tCgC); ++i) {
tCgC(i) = alpha * tCrC(i) + beta * tCgC(i);
}
```
`tAgA` and `tBgB` partition the global A resp. B matrices over threads,
and `tAsA` and `tBsB` partition the shared memory tiles of A resp. B over threads.
Then, following the predication procedure is straightforward,
```cpp
// A coordinate tensor the same shape as mC: (m,n) -> (m,n)
Tensor cC = make_identity_tensor(shape(mC));
// Repeat partitioning steps applied to mC to our coordinate tensor cC
// CTA partitioning
Tensor cta_cC = local_tile(cC, cta_tiler, cta_coord, Step<_1,_1, X>{}); // (BLK_M,BLK_N) -> (m,n)
// Thread partitioning
Tensor tCcC = thr_mma.partition_C(cta_cC); // (MMA,MMA_M,MMA_N) -> (m,n)
// Predicated axpby epilogue
for (int i = 0; i < size(tCgC); ++i) {
if (elem_less(tCcC(i), shape(mC))) { // if coord is in-bounds
tCgC(i) = alpha * tCrC(i) + beta * tCgC(i);
}
}
```
Above, the cta is responsible for tiling/partitioning `mC` and the mma is responsible for tiling/partitioning `gC`,
so both steps are also applied to the identity tensor.
The coordinate tensor `tCcC` is congruent with the register fragment `tCrC` and the partitioned global memory tensor `tCgC`, which are this threads' subtensors of the tile of data. However, the `tCcC` tensor retains it's original codomain when evaluated: a global coordinate into the original tensor `mC`. This global coordinate is compared to the shape of `mC` to determine validity of the operation.
Advantages of this "reference identity tensor" or "coordinate tensor" approach include:
1. There is no dependence on the layout/strides of the tensor
being predicated, just the logical bounds imposed.
2. The partitioning stage(s) can be anything. A CTA tiling, a thread partitioning, a TiledMMA, and a TiledCopy can all be applied to any tensor, including a coordinate tensor.
3. It naturally extends to any-dimensional predication.
4. It's a natural generalization of a typical CUDA 1-D
parallel vector access pattern,
which computes an access index `idx` and predicates access to the vector's `idx`-th element, determining if `idx` is in-bounds.
```cpp
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx < N) // idx is a "coord" into gmem and N is the "bound"
gmem_ptr[idx] = ...;
```
In a SIMT programming model, the tensor extents should not be modified so that loops don't overrun.
Instead, predication is a general method to query the original coordinate and determine if that coordinate overruns.
This avoids variable/dynamic loop bounds in favor of instruction-level predication, preservation of thread coherence, and maintaining load balance.
It's also general enough to extend to all ranks, all layouts of threads and data, and all tiling/partitioning patterns.
Assumptions can be built into the coordinate tensors or the predicate tensors to account for special cases.
As another slightly more complex example, consider the m- and n-predication of A and B loads in a GEMM. Suppose that we've partitioned A and B tiles across ctas and threads as follows.
```c++
// CTA partitioning
auto cta_coord = make_coord(blockIdx.x, blockIdx.y, _); // (m,n,k)
Tensor gA = local_tile(mA, cta_tiler, cta_coord, Step<_1, X,_1>{}); // (BLK_M,BLK_K,k)
Tensor gB = local_tile(mB, cta_tiler, cta_coord, Step< X,_1,_1>{}); // (BLK_N,BLK_K,k)
Tensor sA = make_tensor(make_smem_ptr(smemA), sA_layout); // (BLK_M,BLK_K)
Tensor sB = make_tensor(make_smem_ptr(smemB), sB_layout); // (BLK_N,BLK_K)
// Thread partitioning
Tensor tAgA = local_partition(gA, tA, thread_idx); // (THR_M,THR_K,k)
Tensor tAsA = local_partition(sA, tA, thread_idx); // (THR_M,THR_K)
Tensor tBgB = local_partition(gB, tB, thread_idx); // (THR_N,THR_K,k)
Tensor tBsB = local_partition(sB, tB, thread_idx); // (THR_N,THR_K)
```
`gA` and `gB` are tiles of `mA` resp. `mB` according to `cta_tiler` and the `cta_coord`.
`tAgA` and `tBgB` are partitions of `gA` resp. `gB` according the the thread-layouts `tA` and `tB`
and `thread_idx`.
The following code creates "identity tensors" that map coordinates `(m,k) -> (m,k)` and `(n,k) -> (n,k)`.
```c++
// Coordinate tensors
Tensor cA = make_identity_tensor(shape(mA)); // (m,k) -> (m,k)
Tensor cB = make_identity_tensor(shape(mB)); // (n,k) -> (n,k)
```
Then, the reference tensors are tiled and partitioned
in exactly the same way the `mA` and `mB` tensors were tiled and partitioned
into `tAgA` and `tBgB`.
```c++
// CTA partitioning
Tensor cta_cA = local_tile(cA, cta_tiler, cta_coord, Step<_1, X,_1>{}); // (BLK_M,BLK_K,k) -> (m,k)
Tensor cta_cB = local_tile(cB, cta_tiler, cta_coord, Step< X,_1,_1>{}); // (BLK_N,BLK_K,k) -> (n,k)
// Thread partitioning
Tensor tAcA = local_partition(cta_cA, tA, thread_idx); // (THR_M,THR_K,k) -> (m,k)
Tensor tBcB = local_partition(cta_cB, tB, thread_idx); // (THR_N,THR_K,k) -> (m,k)
```
The following code creates predicate tensors
corresponding to `tAgA` and `tBgB`.
@ -84,166 +189,35 @@ They will be computed once in the prologue.
and will be used to mask off instructions in the inner loop.
```c++
Tensor tApA = make_tensor<bool>(make_shape (size<0>(tAgA), size<1>(tAgA)),
Tensor tApA = make_tensor<bool>(make_shape (size<0>(tAcA), size<1>(tAcA)),
make_stride( Int<1>{}, Int<0>{}));
Tensor tBpB = make_tensor<bool>(make_shape (size<0>(tBgB), size<1>(tBgB)),
Tensor tBpB = make_tensor<bool>(make_shape (size<0>(tBcB), size<1>(tBcB)),
make_stride( Int<1>{}, Int<0>{}));
```
We're only thread-parallelizing over the leftmost (row) dimension,
so we only need to predicate over the leftmost dimension.
Thus, we can make the rightmost (column) stride zero,
since we will never actually address the rightmost dimension.
The following code creates "two-dimensional identity tensors"
that map coordinates (m,k) -> (m,k)
for the tile of data within the thread block.
Here, we make a few assumptions: we're only interested in predicates for one tile of data at a time and we're only interested in predicates for the m- and n-modes and will handle the k-mode predicates differently.
The m- and n- predicates will be considered constant across every tile and will be reused in every iteration of the mainloop.
Thus, we only store the predicates for the m- and n-modes and broadcast them across the k-mode.
When populating the tensors, we carry the same assumption through:
```c++
Tensor cA = make_identity_tensor(make_shape(size<0>(sA), size<1>(sA))); // (BLK_M,BLK_K) -> (blk_m,blk_k)
Tensor cB = make_identity_tensor(make_shape(size<0>(sB), size<1>(sB))); // (BLK_N,BLK_K) -> (blk_n,blk_k)
```
The following lines then tile and partition
the two reference tensors
in exactly the same way the data were tiled and partitioned
into `tAsA` and `tBsB`.
```c++
Tensor tAcA = local_partition(cA, tA, thread_idx);
Tensor tBcB = local_partition(cB, tB, thread_idx);
```
Tiling and partitioning affect the offset and domain,
but not the codomain of the tensors,
so we're left with tensors that map `(thr_m,thr_k) -> (m,k)`
where `(thr_m,thr_k)` is this particular thread's subtensor of the tile
and `(m,k)` is the original codomain: a coordinate into the original tile.
The unrolled loops in the code below then compare
the m- and n-coordinates of those tensors with our known maximums
to mask off elements we are not allowed to access.
```c++
Tensor cA = make_identity_tensor(make_shape(size<0>(sA), size<1>(sA))); // (BLK_M,BLK_K) -> (blk_m,blk_k)
Tensor tAcA = local_partition(cA, tA, thread_idx);
Tensor cB = make_identity_tensor(make_shape(size<0>(sB), size<1>(sB))); // (BLK_N,BLK_K) -> (blk_n,blk_k)
Tensor tBcB = local_partition(cB, tB, thread_idx);
// Populate
// Populate the m- and n-predicates
CUTE_UNROLL
for (int m = 0; m < size<0>(tApA); ++m) {
tApA(m,0) = get<0>(tAcA(m,0)) < m_max_coord;
tApA(m,0) = elem_less(get<0>(tAcA(m,0,0)), shape<0>(mA)); // Compare the m-coordinate
}
CUTE_UNROLL
for (int n = 0; n < size<0>(tBpB); ++n) {
tBpB(n,0) = get<0>(tBcB(n,0)) < n_max_coord;
tBpB(n,0) = elem_less(get<0>(tBcB(n,0,0)), shape<0>(mB)); // Compare the n-coordinate
}
```
Those last `for` loops fill in the two predicate tensors.
In this case, we only need to predicate over the leftmost dimension,
so we only address `(m,0)` resp. `(n,0)`.
and only compare the m- and n-coordinates of the 0th k-tile and 0th k-block. The stride-0 broadcasting mode still allows us to treat this data as a predicate tensor for each and every element of the tile to be loaded.
We can then use the predicate tensors in `copy_if`
to copy only the elements for which the corresponding
predicate tensor elements are nonzero.
Finally, we can then use the predicate tensors in `copy_if` to copy only the elements for which the corresponding predicate tensor elements are `true`.
```c++
// Prefetch k_tile=0, gate these on k_residue as well
CUTE_UNROLL
for (int k = 0; k < size<1>(tAsA); ++k) {
if (get<1>(tAcA(0,k)) >= -k_residue) { // some other condition on the column index
copy_if(tApA, tAgA(_,k,0), tAsA(_,k,0));
}
}
CUTE_UNROLL
for (int k = 0; k < size<1>(tBsB); ++k) {
if (get<1>(tBcB(0,k)) >= -k_residue) { // some other condition on the column index
copy_if(tBpB, tBgB(_,k,0), tBsB(_,k,0));
}
}
```
Here are some advantages of this "reference tensor" approach.
1. It doesn't depend on the layout/strides of the tensor
being predicated, just the logical bounds being imposed.
2. The partitioning stage can be anything.
3. It naturally extends to any-dimensional predication.
4. It's a natural generalization of a typical CUDA 1-D
parallel vector access pattern,
which computes an access index `k`
(e.g., as `blockDim.x * blockIdx.x + threadIdx.x`)
and then predicates access to the vector's `k`-th element
on whether `k` is in bounds.
As an example of (3), the epilogue predication does exactly the same thing,
```c++
// Repeat with a tensor of coordinates for predication
Tensor cC = make_identity_tensor(make_shape(size<0>(gC), size<1>(gC)));
Tensor tCcC = thr_mma.partition_C(cC);
const bool isBetaZero = (beta == 0);
CUTE_UNROLL
for (int i = 0; i < size(tCrC); ++i) {
if (elem_less(tCcC(i), make_coord(m_max_coord,n_max_coord))) {
tCgC(i) = isBetaZero ? alpha * tCrC(i) : alpha * tCrC(i) + beta * tCgC(i);
}
}
```
but with the mma responsible for the tiling/partitioning `tCcC`
so that the reference subtensor matches the accumulator's subtensor.
Then, the reference subtensor is predicated against the `if` bounds
(in both m- and n-coordinates) inside the `for` loop.
Another way to explain this is that we don't modify the tiles
to give you the "right" extents so that you never overrun.
Instead, we let you query the original coordinate
to see if that coordinate overruns.
This avoids all branching and variable/dynamic loop bounds
(thus maintaining load balance and synchronicity,
both very important in-kernel) in favor of predication.
It's also general enough to extend to all ranks,
all layouts of threads and data,
and all tiling/partitioning patterns.
## Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
```
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
// Copy a k_tile from global memory to shared memory
copy_if(tApA, tAgA(_,_,k_tile), tAsA);
copy_if(tBpB, tBgB(_,_,k_tile), tBsB);
```

View File

@ -6,12 +6,12 @@ CuTe DSL
.. toctree::
:maxdepth: 1
DSL Introduction <cute_dsl_general/dsl_introduction.rst>
DSL Code Generation <cute_dsl_general/dsl_code_generation.rst>
DSL Control Flow <cute_dsl_general/dsl_control_flow.rst>
DSL JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
DSL JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
DSL JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
Introduction <cute_dsl_general/dsl_introduction.rst>
Code Generation <cute_dsl_general/dsl_code_generation.rst>
Control Flow <cute_dsl_general/dsl_control_flow.rst>
JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
Integration with Frameworks <cute_dsl_general/framework_integration.rst>
Debugging with the DSL <cute_dsl_general/debugging.rst>
Autotuning with the DSL <cute_dsl_general/autotuning_gemm.rst>

View File

@ -3,10 +3,6 @@
Guidance for Auto-Tuning
=============================
.. contents:: Table of Contents
:depth: 2
:local:
Numerous GEMM kernel code examples are offered within our codebase.
When integrating these kernels into frameworks, auto-tuning becomes essential
for achieving optimal performance. This involves selecting the appropriate

View File

@ -3,10 +3,6 @@
Debugging
=========
.. contents:: Table of Contents
:depth: 2
:local:
This page provides an overview of debugging techniques and tools for CuTe DSL programs.

View File

@ -6,10 +6,6 @@
End-to-End Code Generation
==========================
.. contents::
:depth: 2
:local:
1. Techniques for Turning Python into |IR|
------------------------------------------

View File

@ -4,11 +4,8 @@
.. |DSL| replace:: CuTe DSL
.. |Constexpr| replace:: **Constexpr** (compile-time Python value)
|DSL| Control Flow
Control Flow
==================
.. contents::
:depth: 2
:local:
Overview

View File

@ -3,10 +3,6 @@
.. |SLAY| replace:: static layout
.. |DLAY| replace:: dynamic layout
.. contents:: Table of Contents
:depth: 2
:local:
Static vs Dynamic layouts
=========================

View File

@ -4,12 +4,9 @@
.. |DSL| replace:: CuTe DSL
|DSL|
Introduction
======================
.. contents:: Table of Contents
:depth: 2
:local:
Overview
--------

View File

@ -2,12 +2,9 @@
.. |DSL| replace:: CuTe DSL
.. |CUSTOM_TYPES| replace:: customized types
|DSL| JIT Function Argument Generation
JIT Function Argument Generation
=======================================
.. contents:: Table of Contents
:depth: 2
:local:
In a nutshell
--------------
@ -39,7 +36,7 @@ By default, |DSL| assumes dynamic arguments and tries to infer the argument type
import cutlass.cute as cute
@cute.jit
def foo(x: cutlass.Int32, y: cute.Constexpr):
def foo(x: cutlass.Int32, y: cutlass.Constexpr):
print("x = ", x) # Prints x = ?
print("y = ", y) # Prints y = 2
cute.printf("x: {}", x) # Prints x: 2

View File

@ -3,11 +3,9 @@
.. _JIT_Caching:
|DSL| JIT Caching
JIT Caching
====================
.. contents:: Table of Contents
:depth: 2
:local:
Zero Compile and JIT Executor
-----------------------------

View File

@ -4,10 +4,6 @@
Integration with Frameworks
=============================
.. contents:: Table of Contents
:depth: 2
:local:
In order to facilitate the integration of CUTLASS Python with popular frameworks, we leverage the
`DLPack protocol <https://github.com/dmlc/dlpack>`_ and transform tensors originating from these
frameworks to CuTe tensors. The present page documents the conventions, the API available to the
@ -257,8 +253,7 @@ layouts. The full signature of ``mark_compact_shape_dynamic`` is as follows:
The ``mode`` parameter determines which shape dimension becomes dynamic. After calling this function,
the specific shape dimension given by ``mode`` is marked as dynamic immediately. The stride will be
updated accordingly but this process is delayed until the C ABI of the tensor is constructed.
For modes that have a shape of size 1, their stride are canonicalized to 0.
updated accordingly. For modes that have a shape of size 1, their stride are canonicalized to 0.
The ``stride_order`` parameter specifies the ordering of strides in the tensor. It is consistent
with ``torch.Tensor.dim_order()`` and defaults to ``None``. The parameter indicates the order of
@ -322,10 +317,6 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
import torch
from cutlass.cute.runtime import from_dlpack
@cute.jit
def kernel(t: cute.Tensor):
pass
# (8,4,16,2):(2,16,64,1)
a = torch.empty(16, 4, 8, 2).permute(2, 1, 0, 3)
# (1,4,1,32,1):(4,1,4,4,4) => torch tensor when dimension has shape 1, its stride is degenerated to 1,
@ -337,14 +328,12 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
t0 = from_dlpack(a).mark_compact_shape_dynamic(
mode=0, divisibility=2
)
kernel(t0)
# (?{div=2},4,16,2):(2,?{div=4},?{div=16},1)
print(t0)
t1 = from_dlpack(a).mark_compact_shape_dynamic(
mode=1, divisibility=2
)
kernel(t1)
# (8,?{div=2},16,2):(2,16,?{div=32},1)
print(t1)
@ -353,21 +342,18 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
).mark_compact_shape_dynamic(
mode=3, divisibility=2
)
kernel(t2)
# (8,?{div=2},16,?{div=2}):(?{div=2},?{div=16},?{div=32},1)
print(t2)
t3 = from_dlpack(b).mark_compact_shape_dynamic(
mode=2, divisibility=1, stride_order=(3, 0, 2, 4, 1)
)
kernel(t3)
# (1,4,?,32,1):(0,1,4,?{div=4},0)
print(t3)
t4 = from_dlpack(b).mark_compact_shape_dynamic(
mode=2, divisibility=1, stride_order=(2, 3, 4, 0, 1)
)
kernel(t4)
# (1,4,?,32,1):(0,1,128,4,0)
print(t4)

View File

@ -124,7 +124,8 @@ Technical
License
---------------------
**Q:What is the license for CuTe DSL and the associated GitHub samples?**
**What is the license for CuTe DSL and the associated GitHub samples?**
CuTe DSL components available `on Github <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL>`__ and via the nvidia-cutlass-dsl Python pip wheel
are released under the `"NVIDIA Software End User License Agreement (EULA)" <https://github.com/NVIDIA/cutlass/tree/main/EULA.txt>`__.
Because the pip package includes a compiler that shares several components with the CUDA Toolkit,

View File

@ -3,9 +3,6 @@
Limitations
====================
.. contents::
:depth: 2
:local:
Overview
---------------------

View File

@ -42,7 +42,7 @@ Core CuTe DSL Abstractions
- **Atoms** Represent fundamental hardware operations like matrix multiply-accumulate (MMA) or memory copy.
- **Tiled Operations** Define how atoms are applied across thread blocks and warps (e.g., ``TiledMma``, ``TiledCopy``).
For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md>`__.
For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/cute/00_quickstart.md>`__.
**Pythonic Kernel Expression**

View File

@ -29,3 +29,12 @@ To run examples and begin development, we recommend installing:
.. code-block:: bash
pip install torch jupyter
Recommended Python environment variables for jupyter notebooks
--------------------------------------------------------------
We recommend setting the following environment variable when running jupyter notebooks.
.. code-block:: bash
export PYTHONUNBUFFERED=1