v4.0 update. (#2371)

2025-06-06 14:39:20 +08:00
parent 2e2af190bd
commit 8bdbfca682
254 changed files with 29751 additions and 1980 deletions
--- a/media/docs/cpp/blackwell_cluster_launch_control.md
+++ b/media/docs/cpp/blackwell_cluster_launch_control.md
@ -6,7 +6,7 @@ A GEMM workload usually consists of three phases: prologue, mainloop and epilogu

 Consider a GEMM that has `20x20x1` output tiles, running on a GPU with `100` SMs. There is another kernel occupying all the resources of `20` SMs so only `80` SMs can be used. Assume cluster shape is `1x1x1`. The following diagram shows how the schedule would look like for such a kernel. 

-<p align="center"><img src=../images/non_persistent.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
+![GEMM tiles are evenly divided among available SMs](../../images/non_persistent.png "GEMM Scheduling with Limited SM Resources")


 ### Static Scheduler
@ -14,7 +14,7 @@ CUTLASS has adopted a software technique named **persistent kernels**. Persisten

 However, static scheduler is susceptible to workload imbalance if the resources of some SMs are unavailable. The following diagram illustrates this issue. 

-<p align="center"><img src=../images/persistent_static.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
+![GEMM tiles are unevenly divided among available SMs, leading to workload imbalance](../../images/persistent_static.png "Imbalanced Workload Scheduling due to Static Scheduler")

 ### Dynamic Scheduler with Cluster Launch Control
 A fundamental limitation of persistent scheduling is that the number of SMs this kernel can utilize is unknown in real time. Some SMs might be occupied by another kernel and thus their resources are unavailable. This makes it challenging to load-balance work across SMs.
@ -32,7 +32,7 @@ Cluster launch control follows the below rules:

 The following diagram shows how the schedule would look like with cluster launch control.

-<p align="center"><img src=../images/persistent_clc.png alt="A beautiful sunset" title="Sunset over the mountains"></p>
+![GEMM tiles are dynamically allocated among available SMs, leading to a balanced workload](../../images/persistent_clc.png "Dynamic Scheduler with Cluster Launch Control")

 ## Programming Model
 ### Pseudo Code
--- a/media/docs/cpp/cute/02_layout_algebra.md
+++ b/media/docs/cpp/cute/02_layout_algebra.md
@ -142,8 +142,8 @@ Put into words, `A o B = A o s:d`, for integral `s` and `d` means that we want (
 * `(6,2) /  3 => (2,2)`
 * `(6,2) /  6 => (1,2)`
 * `(6,2) / 12 => (1,1)`
-* `(3,6,2,8) /  3 => (1,6,2,8)`
-* `(3,6,2,8) /  6 => (1,3,2,8)`
+* `(3,6,2,8) /  3 => (1,3,2,8)`
+* `(3,6,2,8) /  6 => (1,6,2,8)`
 * `(3,6,2,8) /  9 => (1,2,2,8)`
 * `(3,6,2,8) / 72 => (1,1,1,4)`

--- a/media/docs/cpp/cute/0y_predication.md
+++ b/media/docs/cpp/cute/0y_predication.md
@ -10,53 +10,58 @@ For example, we might want to tile a 41 x 55 matrix into 4 x 8 tiles,
 but 41 / 4 is 10 remainder 1, and 55 / 8 is 6 remainder 7.
 What do we do with those "leftover" parts of the matrix?

-Another way to say this, is that `logical_divide`
+To start, we note that `logical_divide`
 (CuTe's way of tiling layouts) "rounds up."
-For example, if `N` is the layout (1000, 1) and `B` is the layout (128, 1),
-then `logical_divide(N, B)` is the layout ((128, 8), (1, 128)).
-This effectively rounds up the original shape N = 1000
-into an 128 x 8 matrix (as if N = 1024).
+For example, if `N` is the layout `1000:1` and `B` is the layout `128:1`,
+then `logical_divide(N, B)` is the layout `(128, 8):(1, 128)`.
+This effectively rounds up the original shape `N = 1000`
+into an `128 x 8` matrix (as if `N = 1024`).
 What about those last 24 elements,
-that aren't part of the original data?
+that aren't part of the original data? How is the last tile handled and how do we avoid indexing out-of-bounds?

-The idiomatic CuTe way to solve this problem is through "predication."
-Rather than trying to reason about the "remainder tiles,"
-CuTe instead rounds up, but only tries to access data in each tile
-that are part of the matrix.
+Like other introductions to CUDA programming, the idiomatic CuTe way to address these issues is through "predication."
+Rather than attempting to reason about the "remainder tiles" by trying to represent "7 tiles of size-128 and 1 tile of size-104,"
+CuTe instead rounds up to "8 tiles of size-128" and constructs predicates so that the kernel
+only tries to access data in each tile that are valid within the matrix.
 This corresponds well with how our GPUs optimize:
 branches without warp divergence are relatively fast.
 It also matches the usual CUDA idiom
 when dividing N work items in 1-D fashion over B thread blocks:
 first test if "my thread" is out of bounds before doing work.

-There are a few ways to figure out
-which elements need to be predicated.
-In-kernel GEMMs like to do this in the following way.
+Consider a generic tiling wherein a size-1000 vector is tiled into size-128 chunks. Then a predication tensor can be constructed as follows:

 ```c++
-// Create the predicate tensor
-Layout idA  = make_layout(shape(A));   // e.g. 1000:1
-Layout idAB = logical_divide(idA, B);  // e.g. (128,8):(1,128)
+Tensor gmem = ...     // e.g. size 1000
+Tensor smem = ...     // e.g. size 128

-Tensor pred = make_tensor<bool>(shape(idAB));
+// Tile the gmem for smem
+Tensor gmem_tiled = logical_divide(gmem, size(smem));      // e.g. (128,8)
+
+// Create an identity layout for gmem and tile it similarly
+Layout id_layout = make_layout(shape(gmem));               // e.g. 1000:1, explicitly constructed as identity function
+Layout id_tiled  = logical_divide(id_layout, size(smem));  // e.g. (128,8):(1,128), but many elements aren't "valid"
+
+// Create a predicate tensor
+Tensor pred = make_tensor<bool>(shape(id_tiled));          // e.g. (128,8)
 for (int i = 0; i < size(pred); ++i) {
-  pred(i) = idAB(i) < size(A);
+  pred(i) = id_tiled(i) < size(id_layout);  // Predicate: Is the offset within the original shape?
 }

 // ... intervening code ...

-// Use the predicate tensor.  c is some coordinate.
-// This code would likely live inside some algorithm.
-if (pred(c)) { copy(idAB(c), smem(c)); }
+// Note that gmem_tiled, id_tiled, and pred tensors are all congruent
+// For tile tile_i, determine if element value_j is in-bounds and copy to smem
+if (pred(value_j,tile_i)) { smem(value_j) = gmem_tiled(value_j,tile_i); }
 ```

 The general procedure is that we

-1. create an "identity" layout (`Layout idA  = make_layout(shape(A))`,
+1. create an "identity" layout (`Layout id_layout = make_layout(shape(gmem))`,
   in the above example) with the same shape as our original data;

 2. repeat the same tiling/partitioning/slicing (possibly rounding up)
-   on that identity layout (`Layout idAB = logical_divide(idA, B)`);
+   on that identity layout (`Layout id_tiled  = logical_divide(id_layout, size(smem));`);

 3. create a "predicate tensor" by comparing the coordinates
   of that reference layout with the bounds of the original layout;
@ -64,19 +69,119 @@ The general procedure is that we

 4. use the predicate tensor to mask off accesses to out-of-bounds elements.

-For example, suppose that we've partitioned A and B tiles
-across threads as follows.
+As a relatively simple example, consider predicating the epilogue of a GEMM.
+Suppose that we've partitioned `mC` into cta tiles and across threads of an mma as follows.

-```c++
-Tensor tAgA = local_partition(gA, tA, thread_idx);                  // (THR_M,THR_K,k)
-Tensor tAsA = local_partition(sA, tA, thread_idx);                  // (THR_M,THR_K,PIPE)
+```cpp
+// CTA partitioning
+auto cta_coord = make_coord(blockIdx.x, blockIdx.y, _);              // (m,n,k)
+Tensor gC = local_tile(mC, cta_tiler, cta_coord, Step<_1,_1, X>{});  // (BLK_M,BLK_N)

-Tensor tBgB = local_partition(gB, tB, thread_idx);                  // (THR_N,THR_K,k)
-Tensor tBsB = local_partition(sB, tB, thread_idx);                  // (THR_N,THR_K,PIPE)
+// Thread partitioning
+auto thr_mma = mma.get_slice(threadIdx.x);
+Tensor tCgC = thr_mma.partition_C(gC);                               // (MMA,MMA_M,MMA_N)
+Tensor tCrC = thr_mma.make_fragment_C(tCgC);                         // (MMA,MMA_M,MMA_N)
+
+// ... Compute gemms and accumulate into tCrC ...
+
+// axpby epilogue
+for (int i = 0; i < size(tCgC); ++i) {
+  tCgC(i) = alpha * tCrC(i) + beta * tCgC(i);
+}
 ```

-`tAgA` and `tBgB` partition the global A resp. B matrices over threads,
-and `tAsA` and `tBsB` partition the shared memory tiles of A resp. B over threads.
+Then, following the predication procedure is straightforward,
+
+```cpp
+// A coordinate tensor the same shape as mC: (m,n) -> (m,n)
+Tensor cC     = make_identity_tensor(shape(mC));
+
+// Repeat partitioning steps applied to mC to our coordinate tensor cC
+// CTA partitioning
+Tensor cta_cC = local_tile(cC, cta_tiler, cta_coord, Step<_1,_1, X>{});  // (BLK_M,BLK_N) -> (m,n)
+// Thread partitioning
+Tensor tCcC   = thr_mma.partition_C(cta_cC);                             // (MMA,MMA_M,MMA_N) -> (m,n)
+
+// Predicated axpby epilogue
+for (int i = 0; i < size(tCgC); ++i) {
+  if (elem_less(tCcC(i), shape(mC))) {  // if coord is in-bounds
+    tCgC(i) = alpha * tCrC(i) + beta * tCgC(i);
+  }
+}
+```
+
+Above, the cta is responsible for tiling/partitioning `mC` and the mma is responsible for tiling/partitioning `gC`,
+so both steps are also applied to the identity tensor.
+The coordinate tensor `tCcC` is congruent with the register fragment `tCrC` and the partitioned global memory tensor `tCgC`, which are this threads' subtensors of the tile of data. However, the `tCcC` tensor retains it's original codomain when evaluated: a global coordinate into the original tensor `mC`. This global coordinate is compared to the shape of `mC` to determine validity of the operation.
+
+Advantages of this "reference identity tensor" or "coordinate tensor" approach include:
+
+1. There is no dependence on the layout/strides of the tensor
+   being predicated, just the logical bounds imposed.
+
+2. The partitioning stage(s) can be anything. A CTA tiling, a thread partitioning, a TiledMMA, and a TiledCopy can all be applied to any tensor, including a coordinate tensor.
+
+3. It naturally extends to any-dimensional predication.
+
+4. It's a natural generalization of a typical CUDA 1-D
+   parallel vector access pattern,
+   which computes an access index `idx` and predicates access to the vector's `idx`-th element, determining if `idx` is in-bounds.
+```cpp
+int idx = blockDim.x * blockIdx.x + threadIdx.x;
+if (idx < N)  // idx is a "coord" into gmem and N is the "bound"
+  gmem_ptr[idx] = ...;
+```
+
+In a SIMT programming model, the tensor extents should not be modified so that loops don't overrun.
+Instead, predication is a general method to query the original coordinate and determine if that coordinate overruns.
+This avoids variable/dynamic loop bounds in favor of instruction-level predication, preservation of thread coherence, and maintaining load balance.
+It's also general enough to extend to all ranks, all layouts of threads and data, and all tiling/partitioning patterns.
+Assumptions can be built into the coordinate tensors or the predicate tensors to account for special cases.
+
+As another slightly more complex example, consider the m- and n-predication of A and B loads in a GEMM. Suppose that we've partitioned A and B tiles across ctas and threads as follows.
+
+```c++
+// CTA partitioning
+auto cta_coord = make_coord(blockIdx.x, blockIdx.y, _);              // (m,n,k)
+Tensor gA = local_tile(mA, cta_tiler, cta_coord, Step<_1, X,_1>{});  // (BLK_M,BLK_K,k)
+Tensor gB = local_tile(mB, cta_tiler, cta_coord, Step< X,_1,_1>{});  // (BLK_N,BLK_K,k)
+
+Tensor sA = make_tensor(make_smem_ptr(smemA), sA_layout);            // (BLK_M,BLK_K)
+Tensor sB = make_tensor(make_smem_ptr(smemB), sB_layout);            // (BLK_N,BLK_K)
+
+// Thread partitioning
+Tensor tAgA = local_partition(gA, tA, thread_idx);                   // (THR_M,THR_K,k)
+Tensor tAsA = local_partition(sA, tA, thread_idx);                   // (THR_M,THR_K)
+
+Tensor tBgB = local_partition(gB, tB, thread_idx);                   // (THR_N,THR_K,k)
+Tensor tBsB = local_partition(sB, tB, thread_idx);                   // (THR_N,THR_K)
+```
+
+`gA` and `gB` are tiles of `mA` resp. `mB` according to `cta_tiler` and the `cta_coord`.
+`tAgA` and `tBgB` are partitions of `gA` resp. `gB` according the the thread-layouts `tA` and `tB`
+and `thread_idx`.
+
+The following code creates "identity tensors" that map coordinates `(m,k) -> (m,k)` and `(n,k) -> (n,k)`.
+
+```c++
+// Coordinate tensors
+Tensor cA = make_identity_tensor(shape(mA));   // (m,k) -> (m,k)
+Tensor cB = make_identity_tensor(shape(mB));   // (n,k) -> (n,k)
+```
+
+Then, the reference tensors are tiled and partitioned
+in exactly the same way the `mA` and `mB` tensors were tiled and partitioned
+into `tAgA` and `tBgB`.
+
+```c++
+// CTA partitioning
+Tensor cta_cA = local_tile(cA, cta_tiler, cta_coord, Step<_1, X,_1>{});  // (BLK_M,BLK_K,k) -> (m,k)
+Tensor cta_cB = local_tile(cB, cta_tiler, cta_coord, Step< X,_1,_1>{});  // (BLK_N,BLK_K,k) -> (n,k)
+
+// Thread partitioning
+Tensor tAcA = local_partition(cta_cA, tA, thread_idx);                   // (THR_M,THR_K,k) -> (m,k)
+Tensor tBcB = local_partition(cta_cB, tB, thread_idx);                   // (THR_N,THR_K,k) -> (m,k)
+```

 The following code creates predicate tensors
 corresponding to `tAgA` and `tBgB`.
@ -84,166 +189,35 @@ They will be computed once in the prologue.
 and will be used to mask off instructions in the inner loop.

 ```c++
-Tensor tApA = make_tensor<bool>(make_shape (size<0>(tAgA), size<1>(tAgA)),
+Tensor tApA = make_tensor<bool>(make_shape (size<0>(tAcA), size<1>(tAcA)),
                                make_stride(     Int<1>{},      Int<0>{}));
-Tensor tBpB = make_tensor<bool>(make_shape (size<0>(tBgB), size<1>(tBgB)),
+Tensor tBpB = make_tensor<bool>(make_shape (size<0>(tBcB), size<1>(tBcB)),
                                make_stride(     Int<1>{},      Int<0>{}));
 ```

-We're only thread-parallelizing over the leftmost (row) dimension,
-so we only need to predicate over the leftmost dimension.
-Thus, we can make the rightmost (column) stride zero,
-since we will never actually address the rightmost dimension.
-
-The following code creates "two-dimensional identity tensors"
-that map coordinates (m,k) -> (m,k)
-for the tile of data within the thread block.
+Here, we make a few assumptions: we're only interested in predicates for one tile of data at a time and we're only interested in predicates for the m- and n-modes and will handle the k-mode predicates differently.
+The m- and n- predicates will be considered constant across every tile and will be reused in every iteration of the mainloop.
+Thus, we only store the predicates for the m- and n-modes and broadcast them across the k-mode.
+When populating the tensors, we carry the same assumption through:

 ```c++
-Tensor cA = make_identity_tensor(make_shape(size<0>(sA), size<1>(sA)));   // (BLK_M,BLK_K) -> (blk_m,blk_k)
-Tensor cB = make_identity_tensor(make_shape(size<0>(sB), size<1>(sB)));   // (BLK_N,BLK_K) -> (blk_n,blk_k)
-```
-
-The following lines then tile and partition
-the two reference tensors
-in exactly the same way the data were tiled and partitioned
-into `tAsA` and `tBsB`.
-
-```c++
-Tensor tAcA = local_partition(cA, tA, thread_idx);
-Tensor tBcB = local_partition(cB, tB, thread_idx);
-```
-
-Tiling and partitioning affect the offset and domain,
-but not the codomain of the tensors,
-so we're left with tensors that map `(thr_m,thr_k) -> (m,k)`
-where `(thr_m,thr_k)` is this particular thread's subtensor of the tile
-and `(m,k)` is the original codomain: a coordinate into the original tile.
-
-The unrolled loops in the code below then compare
-the m- and n-coordinates of those tensors with our known maximums
-to mask off elements we are not allowed to access.
-
-```c++
-Tensor cA   = make_identity_tensor(make_shape(size<0>(sA), size<1>(sA)));  // (BLK_M,BLK_K) -> (blk_m,blk_k)
-Tensor tAcA = local_partition(cA, tA, thread_idx);
-
-Tensor cB   = make_identity_tensor(make_shape(size<0>(sB), size<1>(sB)));  // (BLK_N,BLK_K) -> (blk_n,blk_k)
-Tensor tBcB = local_partition(cB, tB, thread_idx);
-
-// Populate
+// Populate the m- and n-predicates
 CUTE_UNROLL
 for (int m = 0; m < size<0>(tApA); ++m) {
-  tApA(m,0) = get<0>(tAcA(m,0)) < m_max_coord;
+  tApA(m,0) = elem_less(get<0>(tAcA(m,0,0)), shape<0>(mA));  // Compare the m-coordinate
 }
 CUTE_UNROLL
 for (int n = 0; n < size<0>(tBpB); ++n) {
-  tBpB(n,0) = get<0>(tBcB(n,0)) < n_max_coord;
+  tBpB(n,0) = elem_less(get<0>(tBcB(n,0,0)), shape<0>(mB));  // Compare the n-coordinate
 }
 ```

-Those last `for` loops fill in the two predicate tensors.
-In this case, we only need to predicate over the leftmost dimension,
-so we only address `(m,0)` resp. `(n,0)`.
+and only compare the m- and n-coordinates of the 0th k-tile and 0th k-block. The stride-0 broadcasting mode still allows us to treat this data as a predicate tensor for each and every element of the tile to be loaded.

-We can then use the predicate tensors in `copy_if`
-to copy only the elements for which the corresponding
-predicate tensor elements are nonzero.
+Finally, we can then use the predicate tensors in `copy_if` to copy only the elements for which the corresponding predicate tensor elements are `true`.

 ```c++
-// Prefetch k_tile=0, gate these on k_residue as well
-CUTE_UNROLL
-for (int k = 0; k < size<1>(tAsA); ++k) {
-  if (get<1>(tAcA(0,k)) >= -k_residue) { // some other condition on the column index
-    copy_if(tApA, tAgA(_,k,0), tAsA(_,k,0));
-  }
-}
-
-CUTE_UNROLL
-for (int k = 0; k < size<1>(tBsB); ++k) {
-  if (get<1>(tBcB(0,k)) >= -k_residue) { // some other condition on the column index
-    copy_if(tBpB, tBgB(_,k,0), tBsB(_,k,0));
-  }
-}
-```
-
-Here are some advantages of this "reference tensor" approach.
-
-1. It doesn't depend on the layout/strides of the tensor
-   being predicated, just the logical bounds being imposed.
-
-2. The partitioning stage can be anything.
-
-3. It naturally extends to any-dimensional predication.
-
-4. It's a natural generalization of a typical CUDA 1-D
-   parallel vector access pattern,
-   which computes an access index `k`
-   (e.g., as `blockDim.x * blockIdx.x + threadIdx.x`)
-   and then predicates access to the vector's `k`-th element
-   on whether `k` is in bounds.
-
-As an example of (3), the epilogue predication does exactly the same thing,
-
-```c++
-// Repeat with a tensor of coordinates for predication
-Tensor cC   = make_identity_tensor(make_shape(size<0>(gC), size<1>(gC)));
-Tensor tCcC = thr_mma.partition_C(cC);
-
-const bool isBetaZero = (beta == 0);
-
-CUTE_UNROLL
-for (int i = 0; i < size(tCrC); ++i) {
-  if (elem_less(tCcC(i), make_coord(m_max_coord,n_max_coord))) {
-    tCgC(i) = isBetaZero ? alpha * tCrC(i) : alpha * tCrC(i) + beta * tCgC(i);
-  }
-}
-```
-
-but with the mma responsible for the tiling/partitioning `tCcC`
-so that the reference subtensor matches the accumulator's subtensor.
-Then, the reference subtensor is predicated against the `if` bounds
-(in both m- and n-coordinates) inside the `for` loop.
-
-Another way to explain this is that we don't modify the tiles
-to give you the "right" extents so that you never overrun.
-Instead, we let you query the original coordinate
-to see if that coordinate overruns.
-This avoids all branching and variable/dynamic loop bounds
-(thus maintaining load balance and synchronicity,
-both very important in-kernel) in favor of predication.
-It's also general enough to extend to all ranks,
-all layouts of threads and data,
-and all tiling/partitioning patterns.
-
-## Copyright
-
-Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: BSD-3-Clause
-
-```
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are met:
-
-  1. Redistributions of source code must retain the above copyright notice, this
-  list of conditions and the following disclaimer.
-
-  2. Redistributions in binary form must reproduce the above copyright notice,
-  this list of conditions and the following disclaimer in the documentation
-  and/or other materials provided with the distribution.
-
-  3. Neither the name of the copyright holder nor the names of its
-  contributors may be used to endorse or promote products derived from
-  this software without specific prior written permission.
-
-  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
-  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
-  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+// Copy a k_tile from global memory to shared memory
+copy_if(tApA, tAgA(_,_,k_tile), tAsA);
+copy_if(tBpB, tBgB(_,_,k_tile), tBsB);
 ```
--- a/media/docs/pythonDSL/cute_dsl.rst
+++ b/media/docs/pythonDSL/cute_dsl.rst
@ -6,12 +6,12 @@ CuTe DSL
 .. toctree::
  :maxdepth: 1

-  DSL Introduction <cute_dsl_general/dsl_introduction.rst>
-  DSL Code Generation <cute_dsl_general/dsl_code_generation.rst>
-  DSL Control Flow <cute_dsl_general/dsl_control_flow.rst>
-  DSL JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
-  DSL JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
-  DSL JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
+  Introduction <cute_dsl_general/dsl_introduction.rst>
+  Code Generation <cute_dsl_general/dsl_code_generation.rst>
+  Control Flow <cute_dsl_general/dsl_control_flow.rst>
+  JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
+  JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
+  JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
  Integration with Frameworks <cute_dsl_general/framework_integration.rst>
  Debugging with the DSL <cute_dsl_general/debugging.rst>
  Autotuning with the DSL <cute_dsl_general/autotuning_gemm.rst>
--- a/media/docs/pythonDSL/cute_dsl_general/autotuning_gemm.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/autotuning_gemm.rst
@ -3,10 +3,6 @@
 Guidance for Auto-Tuning
 ============================= 

-.. contents:: Table of Contents
-   :depth: 2
-   :local:
-
 Numerous GEMM kernel code examples are offered within our codebase. 
 When integrating these kernels into frameworks, auto-tuning becomes essential 
 for achieving optimal performance. This involves selecting the appropriate 
--- a/media/docs/pythonDSL/cute_dsl_general/debugging.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/debugging.rst
@ -3,10 +3,6 @@
 Debugging
 =========

-.. contents:: Table of Contents
-   :depth: 2
-   :local:
-
 This page provides an overview of debugging techniques and tools for CuTe DSL programs.


--- a/media/docs/pythonDSL/cute_dsl_general/dsl_code_generation.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_code_generation.rst
@ -6,10 +6,6 @@
 End-to-End Code Generation
 ==========================

-.. contents::
-   :depth: 2
-   :local:
-

 1. Techniques for Turning Python into |IR|
 ------------------------------------------
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_control_flow.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_control_flow.rst
@ -4,11 +4,8 @@
 .. |DSL|       replace:: CuTe DSL
 .. |Constexpr| replace:: **Constexpr** (compile-time Python value)

-|DSL| Control Flow
+Control Flow
 ==================
-.. contents::
-   :depth: 2
-   :local:


 Overview
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_dynamic_layout.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_dynamic_layout.rst
@ -3,10 +3,6 @@
 .. |SLAY| replace:: static layout
 .. |DLAY| replace:: dynamic layout

-.. contents:: Table of Contents
-   :depth: 2
-   :local:
-
 Static vs Dynamic layouts
 =========================

--- a/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
@ -4,12 +4,9 @@
 .. |DSL| replace:: CuTe DSL


-|DSL|
+Introduction
 ======================

-.. contents:: Table of Contents
-   :depth: 2
-   :local:

 Overview
 --------
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_jit_arg_generation.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_jit_arg_generation.rst
@ -2,12 +2,9 @@
 .. |DSL| replace:: CuTe DSL
 .. |CUSTOM_TYPES| replace:: customized types

-|DSL| JIT Function Argument Generation
+JIT Function Argument Generation
 =======================================

-.. contents:: Table of Contents
-   :depth: 2
-   :local:

 In a nutshell
 --------------
@ -39,7 +36,7 @@ By default, |DSL| assumes dynamic arguments and tries to infer the argument type
    import cutlass.cute as cute

    @cute.jit
-    def foo(x: cutlass.Int32, y: cute.Constexpr):
+    def foo(x: cutlass.Int32, y: cutlass.Constexpr):
        print("x = ", x)        # Prints x = ?
        print("y = ", y)        # Prints y = 2
        cute.printf("x: {}", x) # Prints x: 2
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst
@ -3,11 +3,9 @@

 .. _JIT_Caching:

-|DSL| JIT Caching
+JIT Caching
 ====================
-.. contents:: Table of Contents
-   :depth: 2
-   :local:
+

 Zero Compile and JIT Executor
 -----------------------------
--- a/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst
@ -4,10 +4,6 @@
 Integration with Frameworks
 =============================

-.. contents:: Table of Contents
-   :depth: 2
-   :local:
-
 In order to facilitate the integration of CUTLASS Python with popular frameworks, we leverage the
 `DLPack protocol <https://github.com/dmlc/dlpack>`_ and transform tensors originating from these
 frameworks to CuTe tensors. The present page documents the conventions, the API available to the
@ -257,8 +253,7 @@ layouts. The full signature of ``mark_compact_shape_dynamic`` is as follows:

 The ``mode`` parameter determines which shape dimension becomes dynamic. After calling this function,
 the specific shape dimension given by ``mode`` is marked as dynamic immediately. The stride will be
-updated accordingly but this process is delayed until the C ABI of the tensor is constructed.
-For modes that have a shape of size 1, their stride are canonicalized to 0.
+updated accordingly. For modes that have a shape of size 1, their stride are canonicalized to 0.

 The ``stride_order`` parameter specifies the ordering of strides in the tensor. It is consistent
 with ``torch.Tensor.dim_order()`` and defaults to ``None``. The parameter indicates the order of
@ -322,10 +317,6 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
    import torch
    from cutlass.cute.runtime import from_dlpack

-    @cute.jit
-    def kernel(t: cute.Tensor):
-        pass
-
    # (8,4,16,2):(2,16,64,1)
    a = torch.empty(16, 4, 8, 2).permute(2, 1, 0, 3)
    # (1,4,1,32,1):(4,1,4,4,4) => torch tensor when dimension has shape 1, its stride is degenerated to 1,
@ -337,14 +328,12 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
    t0 = from_dlpack(a).mark_compact_shape_dynamic(
        mode=0, divisibility=2
    )
-    kernel(t0)
    # (?{div=2},4,16,2):(2,?{div=4},?{div=16},1)
    print(t0)

    t1 = from_dlpack(a).mark_compact_shape_dynamic(
        mode=1, divisibility=2
    )
-    kernel(t1)
    # (8,?{div=2},16,2):(2,16,?{div=32},1)
    print(t1)

@ -353,21 +342,18 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
    ).mark_compact_shape_dynamic(
        mode=3, divisibility=2
    )
-    kernel(t2)
    # (8,?{div=2},16,?{div=2}):(?{div=2},?{div=16},?{div=32},1)
    print(t2)

    t3 = from_dlpack(b).mark_compact_shape_dynamic(
        mode=2, divisibility=1, stride_order=(3, 0, 2, 4, 1)
    )
-    kernel(t3)
    # (1,4,?,32,1):(0,1,4,?{div=4},0)
    print(t3)

    t4 = from_dlpack(b).mark_compact_shape_dynamic(
        mode=2, divisibility=1, stride_order=(2, 3, 4, 0, 1)
    )
-    kernel(t4)
    # (1,4,?,32,1):(0,1,128,4,0)
    print(t4)

--- a/media/docs/pythonDSL/faqs.rst
+++ b/media/docs/pythonDSL/faqs.rst
@ -124,7 +124,8 @@ Technical

 License
 ---------------------
-**Q:What is the license for CuTe DSL and the associated GitHub samples?**
+**What is the license for CuTe DSL and the associated GitHub samples?**
+
    CuTe DSL components available `on Github <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL>`__ and via the nvidia-cutlass-dsl Python pip wheel
    are released under the `"NVIDIA Software End User License Agreement (EULA)" <https://github.com/NVIDIA/cutlass/tree/main/EULA.txt>`__.
    Because the pip package includes a compiler that shares several components with the CUDA Toolkit,
--- a/media/docs/pythonDSL/limitations.rst
+++ b/media/docs/pythonDSL/limitations.rst
@ -3,9 +3,6 @@
 Limitations
 ====================

-.. contents::
-   :depth: 2
-   :local:

 Overview
 ---------------------
--- a/media/docs/pythonDSL/overview.rst
+++ b/media/docs/pythonDSL/overview.rst
@ -42,7 +42,7 @@ Core CuTe DSL Abstractions
 - **Atoms** – Represent fundamental hardware operations like matrix multiply-accumulate (MMA) or memory copy.
 - **Tiled Operations** – Define how atoms are applied across thread blocks and warps (e.g., ``TiledMma``, ``TiledCopy``).

-For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md>`__.
+For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/cute/00_quickstart.md>`__.

 **Pythonic Kernel Expression**

--- a/media/docs/pythonDSL/quick_start.rst
+++ b/media/docs/pythonDSL/quick_start.rst
@ -29,3 +29,12 @@ To run examples and begin development, we recommend installing:
 .. code-block:: bash

   pip install torch jupyter
+
+Recommended Python environment variables for jupyter notebooks
+--------------------------------------------------------------
+
+We recommend setting the following environment variable when running jupyter notebooks.
+
+.. code-block:: bash
+
+   export PYTHONUNBUFFERED=1