Release v4.0.0 (#2294)

2025-05-13 15:55:29 -04:00
parent ad7b2f5e84
commit f115c3f854
299 changed files with 51495 additions and 4413 deletions
--- a/media/docs/cpp/blackwell.rst
+++ b/media/docs/cpp/blackwell.rst
@ -0,0 +1,10 @@
+.. _blackwell:
+
+Blackwell Specific
+==================
+
+.. toctree::
+  :maxdepth: 2
+
+  Blackwell SM100/SM120 GEMMs<blackwell_functionality.md>
+  Blackwell Cluster Launch Control<blackwell_cluster_launch_control.md>
--- a/media/docs/cpp/blackwell_cluster_launch_control.md
+++ b/media/docs/cpp/blackwell_cluster_launch_control.md
@ -6,7 +6,7 @@ A GEMM workload usually consists of three phases: prologue, mainloop and epilogu

 Consider a GEMM that has `20x20x1` output tiles, running on a GPU with `100` SMs. There is another kernel occupying all the resources of `20` SMs so only `80` SMs can be used. Assume cluster shape is `1x1x1`. The following diagram shows how the schedule would look like for such a kernel. 

-<p align="center"><img src=../../images/non_persistent.png alt="GEMM tiles are evenly divided among available SMs" title="GEMM Scheduling with Limited SM Resources"></p>
+<p align="center"><img src=../images/non_persistent.png alt="A beautiful sunset" title="Sunset over the mountains"></p>


 ### Static Scheduler
@ -14,7 +14,7 @@ CUTLASS has adopted a software technique named **persistent kernels**. Persisten

 However, static scheduler is susceptible to workload imbalance if the resources of some SMs are unavailable. The following diagram illustrates this issue. 

-<p align="center"><img src=../../images/persistent_static.png alt="GEMM tiles are unevenly divided among available SMs, leading to workload imbalance" title="Imbalanced Workload Scheduling due to Static Scheduler"></p>
+<p align="center"><img src=../images/persistent_static.png alt="A beautiful sunset" title="Sunset over the mountains"></p>

 ### Dynamic Scheduler with Cluster Launch Control
 A fundamental limitation of persistent scheduling is that the number of SMs this kernel can utilize is unknown in real time. Some SMs might be occupied by another kernel and thus their resources are unavailable. This makes it challenging to load-balance work across SMs.
@ -32,7 +32,7 @@ Cluster launch control follows the below rules:

 The following diagram shows how the schedule would look like with cluster launch control.

-<p align="center"><img src=../../images/persistent_clc.png alt="GEMM tiles are dynamically allocated among available SMs, leading to a balanced workload" title="Dynamic Scheduler with Cluster Launch Control"></p>
+<p align="center"><img src=../images/persistent_clc.png alt="A beautiful sunset" title="Sunset over the mountains"></p>

 ## Programming Model
 ### Pseudo Code
@ -120,7 +120,7 @@ The CLC pipeline has a depth of 3 to overlap the CLC operations of multiple wave



-# Copyright
+### Copyright

 Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/blackwell_functionality.md
+++ b/media/docs/cpp/blackwell_functionality.md
@ -723,7 +723,7 @@ Specialized policies must be used to generate mixed-input-datatype `mx_float4_t`
 |----------------|----|----|----|----|------------------------------------|
 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |

-# Copyright
+### Copyright

 Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/build/building_in_windows_with_visual_studio.md
+++ b/media/docs/cpp/build/building_in_windows_with_visual_studio.md
@ -5,7 +5,7 @@ Users and developers may build either
 in Visual Studio's graphical integrated development environment,
 or on the command line with `cmake --build`.

-# Software prerequisites
+## Software prerequisites

 1. Windows 10 or 11

@ -22,7 +22,7 @@ or on the command line with `cmake --build`.
 Visual Studio must be installed *before* the CUDA Toolkit.
 Otherwise, Visual Studio's build system won't know about CUDA.

-# Operating system settings
+## Operating system settings

 By default, Windows restricts the maximum file path length (`MAX_PATH`) to 260 characters.
 CUTLASS has many files and directory paths that challenge this requirement.
@ -48,7 +48,7 @@ before attempting to clone or build CUTLASS.
 [This Microsoft help article](https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry)
 explains different ways to change the registry setting.

-# Set up build environment
+## Set up build environment

 1. Run "git bash" to get a familiar command-line interface

@ -62,7 +62,7 @@ explains different ways to change the registry setting.

 Alternate approaches may rely on the CMake GUI and/or Windows' native command line.

-# Building
+## Building

 A successful CMake run will create a `CUTLASS.sln` Visual Studio "solution" file in the build directory.
 One can open this in Visual Studio and build the entire solution or any subset of projects as desired.
@ -77,7 +77,7 @@ Unlike with CMake's Makefile or Ninja generators,
 `CMAKE_BUILD_TYPE` has no effect on the Visual Studio generator,
 because the Visual Studio generator creates all build configurations.

-# Tips
+## Tips

 With Windows builds, one may find that CMake reruns unnecessarily.
 For example, cancelling a build and starting it again may rerun CMake.
@ -86,7 +86,7 @@ One work-around is to set the CMake option `CMAKE_SUPPRESS_REGENERATION=ON`.
 However, this turns off CMake's ability to detect on its own when it needs to rerun.
 As a result, one will need to know when to rerun CMake by hand.

-## Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/build/building_with_clang_as_host_compiler.md
+++ b/media/docs/cpp/build/building_with_clang_as_host_compiler.md
@ -5,7 +5,7 @@ Clang as host compiler, and NVCC as device compiler.
 This is NOT the same as building with
 Clang as both host and device compiler ("CUDA Clang").

-# Software prerequisites
+## Software prerequisites

 1. Clang (regularly tested with Clang 17;
   occasionally tested with Clang 10 and greater)
@ -29,9 +29,9 @@ A symptom of not installing all needed dependencies
 is the following error when attempting to use clang:
 `"/usr/bin/ld: cannot find -lstdc++: No such file or directory"`.

-# Running CMake
+## Running CMake

-## Required CMake options
+### Required CMake options

 The Clang build requires specifying the following CMake options.
 Replace `<path-to-clang++>` with the path to your `clang++` executable.
@ -55,7 +55,7 @@ then one can set `CMAKE_CUDA_COMPILER` as follows.

 * `CMAKE_CUDA_COMPILER=${PATH_TO_CUDA_TOOLKIT}/bin/nvcc`

-## Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/build/index.rst
+++ b/media/docs/cpp/build/index.rst
@ -0,0 +1,10 @@
+.. _cpp_build:
+
+Build
+=====
+
+.. toctree::
+  :maxdepth: 1
+
+  Building on Windows with Visual Studio<building_in_windows_with_visual_studio.md>
+  Building with Clang as host compiler<building_with_clang_as_host_compiler.md>
--- a/media/docs/cpp/code_organization.md
+++ b/media/docs/cpp/code_organization.md
@ -1,6 +1,6 @@
 ![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")

-# CUTLASS Code Organization
+# Code Organization

 This document describes the layout of the CUTLASS repository. The main components are:

--- a/media/docs/cpp/cute/02_layout_algebra.md
+++ b/media/docs/cpp/cute/02_layout_algebra.md
@ -249,9 +249,7 @@ auto same_r = make_layout(composition(layout<0>(a), get<0>(tiler)),
 We often use the `<LayoutA, LayoutB, ...>` notation to distinguish `Tiler`s from the concatenation-of-sublayouts notation `(LayoutA, LayoutB, ...)` that we used previously.

 The `result` in the above code can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
-<p align="center">
-  <img src="../../../images/cute/composition1.png" alt="composition1.png" height="250"/>
-</p>
+![composition1.png](../../../images/cute/composition1.png)

 For convenience, CuTe also interprets `Shape`s as a tiler as well. A `Shape` is interpreted as tuple-of-layouts-with-stride-1:
 ```cpp
@ -268,9 +266,7 @@ auto tiler = make_shape(Int<3>{}, Int<8>{});
 auto result = composition(a, tiler);
 ```
 where `result` can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
-<p align="center">
-  <img src="../../../images/cute/composition2.png" alt="composition2.png" height="250"/>
-</p>
+![composition2.png](../../../images/cute/composition2.png)

 ## Composition Tilers

@ -323,9 +319,7 @@ The `cotarget` parameter above is most commonly an integer -- you can see we onl

 * `complement((2,2):(1,6), 24)` is `(3,2):(2,12)`. Note that `((2,2),(3,2)):((1,6),(2,12))` has cosize `24` and produces unique indices.

-<p align="center">
-  <img src="../../../images/cute/complement1.png" alt="complement1.png" height="75"/>
-</p>
+![complement1.png](../../../images/cute/complement1.png)
 As a visualization, the above figure depicts the codomain of the last example. The image of the original layout `(2,2):(1,6)` is colored in gray. The complement effectively "repeats" the original layout (displayed in the other colors) such that the codomain size of the result is `24`. The complement `(3,2):(2,12)` can be viewed as the "layout of the repetition."

 ## Division (Tiling)
@ -371,9 +365,7 @@ This is computed in the three steps described in the implementation above.
 * Concantenation of `(B,B*) = (4,(2,3)):(2,(1,8))`.
 * Composition of `A = (4,2,3):(2,1,8)` with `(B,B*)` is then `((2,2),(2,3)):((4,1),(2,8))`.

-<p align="center">
-  <img src="../../../images/cute/divide1.png" alt="divide1.png" height="150"/>
-</p>
+![divide1.png](../../../images/cute/divide1.png)

 The above figure depicts `A` as a 1-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are six of those tiles in `A` shown by each of the colors. After the divide, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.

@ -383,9 +375,7 @@ Using the `Tiler` concept defined above, this immediately generalizes to multidi

 Similar to the 2-D composition example above, consider a 2-D layout `A = (9,(4,8)):(59,(13,1))` and want to apply `3:3` down the columns (mode-0) and `(2,4):(1,8)` across the rows (mode-1). This means the tiler can be written as `B = <3:3, (2,4):(1,8)>`.

-<p align="center">
-  <img src="../../../images/cute/divide2.png" alt="divide2.png" height="450"/>
-</p>
+![divide2.png](../../../images/cute/divide2.png)

 The above figure depicts `A` as a 2-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are twelve of those tiles in `A` shown by each of the colors. After the divide, the first mode of each mode of the result is the tile of data and the second mode of each mode iterates over each tile. In that sense, this operation can be viewed as a kind of `gather` operation or as simply a permutation on the rows and cols.

@ -429,9 +419,7 @@ We note that `logical_divide` preserves the *semantics* of the modes while permu

 This is not the case with `zipped_divide`. The mode-0 in the `zipped_divide` result is the `Tile` itself (of whatever rank the `Tiler` was) and mode-1 is the layout of those tiles. It doesn't always make sense to plot these as 2-D layouts, because the `M`-mode is now more aptly the "tile-mode" and the `N`-mode is more aptly the "rest-mode". Regardless, we still can plot the resulting layout as 2-D as shown below.

-<p align="center">
-  <img src="../../../images/cute/divide3.png" alt="divide3.png" height="450"/>
-</p>
+![divide3.png](../../../images/cute/divide3.png)

 We've kept each tile as its color in the previous images for clarity. Clearly, iterating across tiles is now equivalent to iterating across a row of this layout and iterating over elements within a tile is equivalent to iterating down a column of this layout. As we'll see in the `Tensor` section, this can be used to great effect in partitioning within or across tiles of data.

@ -476,9 +464,7 @@ This is computed in the three steps described in the implementation above.
 * Composition of `A* = (2,3):(2,8)` with `B = 6:1` is then `(2,3):(2,8)`.
 * Concatenation of `(A,A* o B) = ((2,2),(2,3)):((4,1),(2,8))`.

-<p align="center">
-  <img src="../../../images/cute/product1.png" alt="product1.png" height="175"/>
-</p>
+![product1.png](../../../images/cute/product1.png)

 The above figure depicts `A` and `B` as a 1-D layouts. The layout `B` describes the number and order of repetitions of `A` and they are colored for clarity. After the product, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.

@ -486,9 +472,7 @@ Note that the result is identical to the result of the 1-D Logical Divide exampl

 Of course, we can change the number and order of the tiles in the product by changing `B`.

-<p align="center">
-  <img src="../../../images/cute/product2.png" alt="product2.png" height="175"/>
-</p>
+![product2.png](../../../images/cute/product2.png)

 For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated tiles instead of 6 and the tiles are in a different order.

@ -496,9 +480,7 @@ For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated til

 We can use the by-mode `tiler` strategies previously developed to write multidimensional products as well.

-<p align="center">
-  <img src="../../../images/cute/product2d.png" alt="product2d.png" height="250"/>
-</p>
+![product2d.png](../../../images/cute/product2d.png)

 The above image demonstates the use of a `tiler` to apply `logical_product` by-mode. Despite this **not being the recommended approach**, the result is a rank-2 layout consisting of 2x5 row-major block that is tiled across a 3x4 column-major arrangement.

@ -519,17 +501,13 @@ Because `A` is always compatible with mode-0 of the result and `B` is always com

 This is exactly what `blocked_product` and `raked_product` do and it is why they are called rank-sensitive. Unlike other CuTe functions that take `Layout` arguments, these care about the top-level rank of the arguments so that each mode can be reassociated after the `logical_product`.

-<p align="center">
-  <img src="../../../images/cute/productblocked2d.png" alt="productblocked2d.png" height="250"/>
-</p>
+![productblocked2d.png](../../../images/cute/productblocked2d.png)

 The above image shows the same result as the `tiler` approach, but with much more intuitive arguments. A 2x5 row-major layout is arranged as a tile in a 3x4 column-major arrangement. Also note that `blocked_product` went ahead and `coalesced` mode-0 for us.

 Similarly, `raked_product` combines the modes slightly differently. Instead of the resulting "column" mode being constructed from the `A` "column" mode then the `B` "column" mode, the resulting "column" mode is constructed from the `B` "column" mode then the `A` "column" mode.

-<p align="center">
-  <img src="../../../images/cute/productraked2d.png" alt="productraked2d.png" height="250"/>
-</p>
+![productraked2d.png](../../../images/cute/productraked2d.png)

 This results in the "tile" `A` now being interleaved or "raked" with the "layout-of-tiles" `B` instead of appearing as blocks. Other references call this a "cyclic distribution."

--- a/media/docs/cpp/cute/03_tensor.md
+++ b/media/docs/cpp/cute/03_tensor.md
@ -269,9 +269,7 @@ Tensor E = A(make_coord(_,1),make_coord(0,_,1));
 Tensor F = A(make_coord(2,_),make_coord(_,3,_));
 ```

-<p align="center">
-  <img src="../../../images/cute/slice.png" alt="slice.png" height="300"/>
-</p>
+![slice.png](../../../images/cute/slice.png)

 In the image above, a `Tensor` is sliced in various ways and the subtensors generated by those slices are highlighted within the original tensor. Note that tensor `C` and `D` contain the same elements, but have different ranks and shapes due to the use of `_` versus the use of `make_coord(_,_)`. In each case, the rank of the result is equal to the number of `Underscore`s in the slicing coordinate.

@ -327,9 +325,7 @@ Tensor tv = composition(A, tv_layout);                           // (8,4)
 Tensor  v = tv(threadIdx.x, _);                                  // (4)
 ```

-<p align="center">
-  <img src="../../../images/cute/tv_layout.png" alt="tv_layout.png" height="300"/>
-</p>
+![tv_layout.png](../../../images/cute/tv_layout.png)

 The above image is a visual representation of the above code. An arbitrary 4x8 layout of data is composed with a specific 8x4 TV-layout that represents a partitioning pattern. The result of the composition is on the right where each threads' values are arranged across each row. The bottom layout depicts the inverse TV layout which shows the mapping of 4x8 logical coordinates to the thread id and value id they will be mapped to.

--- a/media/docs/cpp/cute/0t_mma_atom.md
+++ b/media/docs/cpp/cute/0t_mma_atom.md
@ -208,9 +208,7 @@ Volta architecture implements an HMMA instruction where a group of 8 threads cal

 We first take a look at how we would take the ISA semantics of thread and data partitioning for the HMMA instruction, and encode it in a Traits struct. The HMMA NT instruction has the thread-data layout:

-<p align="center">
-  <img src="../../../images/cute/HMMA.8x8x4.NT.png" alt="HMMA.8x8x4.NT.png" height="400"/>
-</p>
+![HMMA.8x8x4.NT.png](../../../images/cute/HMMA.8x8x4.NT.png)

 ### Types

@ -250,9 +248,7 @@ Again, this layout function maps the logical thread id [0,8) of the MMA operatio

 Let us look at exactly how the 8 threads within a QP are mapped to the A, B and C matrices. For the C and D matrices, the above image is broken down a bit more below. On the left is shown the whole QP level view, and on the right is shown the values owned by just thread 0.

-<p align="center">
-  <img src="../../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
-</p>
+![HMMA.8x8x4.quadpair.C.png](../../../images/cute/HMMA.8x8x4.quadpair.C.png)

 The metainformation of this single instruction level view is what we want to encode in CuTe. Specifically, the QP level view in this diagram corresponds to the four MMA traits for [SM70_F32F16F16F32](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/mma_sm70.hpp). These structs contain the `Element` types, the `Shape_MNK`, and the `ThrID` mapping we constructed above. Now, let us take a look at the definition of `CLayout`, the thread-data layout of accumulators. The job of `CLayout` is to construct a mapping between the `(logical_thr_id, logical_val_id)` and `(m, n)` coordinate in the C matrix which can then be used to build up more complicated layouts and operations like the 16x16x4 WMMA.

@ -320,9 +316,7 @@ In the case of F16 accumulators, the layout is way less complex. Each row of acc

 A and B matrix layouts depend on whether the sources are transposed or not. The diagram below shows the thread ID to data ownership map for A and B matrices in the case of NT and TN transposes.

-<p align="center">
-  <img src="../../../images/cute/HMMA.8x8x4.quadpair.AB.png" alt="HMMA.8x8x4.quadpair.AB.png" height="400"/>
-</p>
+![HMMA.8x8x4.quadpair.AB.png](../../../images/cute/HMMA.8x8x4.quadpair.AB.png)

 Let's look at the TN layout for A matrix first (right side in the diagram). Again, there are the same 8 logical threads, but each threads owns only 4 elements this time. The shape of `ALayout` will then be `Shape<_8, _4>`. As for the strides, we again need a similar mapping between `(m, k) == m + k * M`. Looking down the `M` mode, we go from `(T0, V0)` to `(T1, V0)` which is a stride of 1 for all 8 threads. For the `K` mode, as we go across, we go from `(T0, V0)` to `(T0, V1)`, which makes a stride of 8 for all 4 values. Therefore, the A layout is:

@ -375,17 +369,13 @@ using ThrID = Layout<_128, _1>;
 Accumulators are mapped hierarchically in GMMA, starting from the concept of a core matrix and building up to a layout for the whole C matrix tile. Let's look at this core matrix first. We only consider fp16 accumulators here, but extensions of fp32 accumulators as trivial as we will see later.

 Each core matrix has the layout as shown in the diagram below.
-<p align="center">
-  <img src="../../../images/cute/gmma_coremat_cd_fp16.png" alt="gmma_coremat_cd_fp16.png" height="600"/>
-</p>
+![gmma_coremat_cd_fp16.png](../../../images/cute/gmma_coremat_cd_fp16.png)

 As in the Volta examples, the thread IDs are logical only, and which of the four warps they belong to in the warpgroup is not important.

 Then GMMA tiles this core matrix first vertically along the M mode, and then repeats that column of core matrices along the N mode to construct the full MxN tile. This tiling is shown in the image below.

-<p align="center">
-  <img src="../../../images/cute/gmma_wg_n_slice.png" alt="gmma_wg_n_slice.png" height="600"/>
-</p>
+![gmma_wg_n_slice.png](../../../images/cute/gmma_wg_n_slice.png)

 With this image, we are again ready to start building the `CLayout` for `SM90_64x128x16_F16F16F16F16_TN` atom. Same as before, we are constructing a mapping between the `(logical_thr_id, logical_val_id) -> (m, n)` coordinate spaces.

@ -452,9 +442,7 @@ Let's start with `SM70_8x8x4_F32F16F16F32_NT`.
 MMA_Atom mma = MMA_Atom<SM70_8x8x4_F32F16F16F32_NT>{};
 print_latex(mma);
 ```
-<p align="center">
-  <img src="../../../images/cute/HMMA.8x8x4.NT_Atom.png" alt="HMMA.8x8x4.NT_Atom.png" height="400"/>
-</p>
+![HMMA.8x8x4.NT_Atom.png](../../../images/cute/HMMA.8x8x4.NT_Atom.png)

 The above is equivalent to 
 ```cpp
@ -472,9 +460,7 @@ We can create an object akin to a WMMA by using four of these quadpair MMAs:
                                         Stride<_2,_1>>{});   // 2x2 n-major layout of Atoms
    print_latex(mma);
 ```
-<p align="center">
-  <img src="../../../images/cute/HMMA.8x8x4.NT_2x2.png" alt="HMMA.8x8x4.NT_2x2.png" height="400"/>
-</p>
+![HMMA.8x8x4.NT_2x2.png](../../../images/cute/HMMA.8x8x4.NT_2x2.png)
 This `TiledMMA` replicates the `MMA_Atom` across threads as we can see the `T4` and `T8` and `T12` threads in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the atom's partitioning pattern for a new quadpair and this replication follows a `(2,2):(2,1)` layout.

 The above represents a 16x16x4 MMA now, but we can immediately expand this "tile size" up to 32x32x4 instead:
@ -485,9 +471,7 @@ The above represents a 16x16x4 MMA now, but we can immediately expand this "tile
                                  Tile<_32,_32,_4>{});      // 32x32x4 tiler
    print_latex(mma);
 ```
-<p align="center">
-  <img src="../../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png" alt="HMMA.8x8x4.NT_2x2_32x32x4.png" height="400"/>
-</p>
+![HMMA.8x8x4.NT_2x2_32x32x4.png](../../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png)
 This `TiledMMA` replicates the previous `TiledMMA` across values instead of threads. We can see the `T0V8` and `T16V8` and `T8V8` values in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the previous `TiledMMA`'s partitioning pattern for a new set of values.

 Continuing, we see that there are eight values that `T0` receives from the `A`-matrix. Those reads occur at coordinates
@ -513,9 +497,7 @@ which are separate, but we might prefer them to be next to each other. That is w
                                       _4>{});                   // Permutation on K, size 4 identity
    print_latex(mma);
 ```
-<p align="center">
-  <img src="../../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png" alt="HMMA.8x8x4.NT_2x2_32Mx32x4.png" height="400"/>
-</p>
+![HMMA.8x8x4.NT_2x2_32Mx32x4.png](../../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png)

 That layout `(4,4,2):(1,8,4)` is read like a scatter permutation, telling the m-coords of the original image where to go in the new image.
 ```
--- a/media/docs/cpp/cute/0x_gemm_tutorial.md
+++ b/media/docs/cpp/cute/0x_gemm_tutorial.md
@ -334,9 +334,7 @@ These thread layouts are then used to partition the tiles of data in global memo
 ```
 where we've used the same projection-style interface to avoid applying the `N`-mode of `tC` to the `(BLK_M,BLK_K)` shape of `sA` and avoid applying the `M`-mode of `tC` to the `(BLK_N,BLK_K)` shape of `sB`.

-<p align="center">
-  <img src="../../../images/cute/tC_partitioning.png" alt="tC_partitioning.png" height="300"/>
-</p>
+![tC_partitioning.png](../../../images/cute/tC_partitioning.png)
 This diagram shows a `tC` layout, highlights two threads in green and blue, shows the projections of the `tC` layout, and finally highlights the subtensors within `sA`, `sB`, and `gC` that `tCsA`, `tCsB`, and `tCgC` represent.

 With the data partitioned across the threads, *every thread* can now participate in the compute step by writing
@ -390,9 +388,7 @@ As a first example, lets look at the `TiledCopy` that `gemm_nt` generates.
  print_latex(copyA);
 ```
 The easiest way to see what this `TiledCopy` does is to look at the partition pattern in LaTeX.
-<p align="center">
-  <img src="../../../images/cute/TiledCopyA.png" alt="TiledCopyA.png" height="300"/>
-</p>
+![TiledCopyA.png](../../../images/cute/TiledCopyA.png)
 On the left is the source-tensor partitioning and on the right is the destination-tensor partitioning. The partition patterns are the same for this case, but there exist PTX instructions which require different patterns in the source and destination. The diagram shows that each thread reads 4x1 `TA` elements and there are 32x8 threads. The `UniversalCopy<uint128_t>` forces the instruction to use a 128-bit copy instruction. If the partition (of `sA` or `gA` in this case) does not result in 4 `TA` elements that cannot be vectorized to a 128-bit load/store, then CuTe will statically fail with an error message to that effect.

 To use the `TiledCopy`, the kernel writes
@ -421,9 +417,7 @@ As a first example, lets look at the `TiledMMA` that `gemm_nt` generates.
  print_latex(mmaC);
 ```
 The easiest way to see what this `TiledMMA` does is to look at the partition pattern in LaTeX.
-<p align="center">
-  <img src="../../../images/cute/TiledMmaC.png" alt="TiledMmaC.png" height="300"/>
-</p>
+![TiledMmaC.png](../../../images/cute/TiledMmaC.png)
 On the left is the A-tensor partitioning, on the top is the B-tensor partitioning, and in the middle is the C-tensor partitioning.Because the `UniversalFMA` is a 1x1x1 MMA instruction, a 16x16x1 tiling of them results in a 16x16x1 `TiledMMA`. Other MMA instructions will have different threads involved and have different instruction sizes. In this case, all threads will read a single element from `A`, `B`, and `C` each.

 To use the `TiledMMA`, the kernel writes
--- a/media/docs/cpp/cute/0z_tma_tensors.md
+++ b/media/docs/cpp/cute/0z_tma_tensors.md
@ -8,7 +8,7 @@ What is an `ArithTuple`? Are those tensor strides? What do those mean? What is t

 This documentation intends to answer those questions and introduce some of the more advanced features of CuTe.

-# Introduction to TMA instructions
+## Introduction to TMA instructions

 The Tensor Memory Accelerator (TMA) is a set of instructions for copying possibly multidimensional arrays between global and shared memory.  TMA was introduced in the Hopper architecture.  A single TMA instruction can copy an entire tile of data all at once.  As a result, the hardware no longer needs to compute individual memory addresses and issue a separate copy instruction for each element of the tile.

@ -53,9 +53,9 @@ That means that an ordinary CuTe Tensor that stores a GMEM pointer and computes

 What do we do?

-# Building a TMA Tensor
+## Building a TMA Tensor

-## Implicit CuTe Tensors
+### Implicit CuTe Tensors

 All CuTe Tensors are compositions of Layouts and Iterators. An ordinary global memory tensor's iterator is its global memory pointer. However, a CuTe Tensor's iterator doesn't have to be a pointer; it can be any random-access iterator.

@ -83,7 +83,7 @@ This tensor maps logical coordinates to on-the-fly computed integers. Because it
 But the TMA doesn't consume pointers or integers, it consumes coordinates. Can we make a tensor of implicit TMA
 coordinates for the TMA instruction to consume? If so, then we could presumably also tile and partition and slice that tensor of coordinates so that we would always have the right TMA coordinate to give to the instruction.

-## ArithTupleIterators and ArithTuples
+### ArithTupleIterators and ArithTuples

 First, we build a `counting_iterator` equivalent for TMA coordinates. It should support

@ -110,7 +110,7 @@ In summary, one creates a TMA descriptor for the *whole global memory tensor*. T

 We can now track and offset TMA coordinates with this iterator, but how do we get CuTe Layouts to generate non-integer offsets?

-## Strides aren't just integers
+### Strides aren't just integers

 Ordinary tensors have a layout that maps
 a logical coordinate `(i,j)` into a 1-D linear index `k`.
@ -122,7 +122,7 @@ to a TMA coordinate, rather than to a 1-D linear index.

 To do this, we can abstract what a stride is. Strides need not be integers, but rather any algebraic object that supports inner-product with the integers (the logical coordinate). The obvious choice is the `ArithmeticTuple` we used earlier since they can be added to each other, but this time additionally equipped with an `operator*` so it can also be scaled by an integer.

-### Aside: Integer-module strides
+#### Aside: Integer-module strides

 A group of objects that support addition between elements and product between elements and integers is called an integer-module.

@ -133,7 +133,7 @@ Rank-R tuples of integers are an integer-module.

 In principle, layout strides may be any integer-module.

-### Basis elements
+#### Basis elements

 CuTe's basis elements live in the header file `cute/numeric/arithmetic_tuple.hpp`.
 To make it easy to create `ArithmeticTuple`s that can be used as strides, CuTe defines normalized basis elements using the `E` type alias. "Normalized" means that the scaling factor of the basis element is the compile-time integer 1.
@ -172,7 +172,7 @@ Intuitively, "compatible" means that
 the nested structure of the two basis elements
 matches well enough to add the two elements together.

-### Linear combinations of strides
+#### Linear combinations of strides

 Layouts work by taking the inner product
 of the natural coordinate with their strides.
@ -200,7 +200,7 @@ and can be interpreted as the coordinate `((7,4),23)`.
 Thus, linear combinations of these strides can be used to generate TMA coordinates.
 These coordinates, in turn, can be used to offset TMA coordinate iterators.

-## Application to TMA Tensors
+### Application to TMA Tensors

 Now we can build CuTe Tensors like the one seen in the introduction.

@ -230,7 +230,7 @@ ArithTuple(0,0) o (4,5):(_1@1,_1@0):
  (0,3)  (1,3)  (2,3)  (3,3)  (4,3)
 ```

-## Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/cute/index.rst
+++ b/media/docs/cpp/cute/index.rst
@ -4,7 +4,7 @@ CuTe
 ====================

 .. toctree::
-  :maxdepth: 2
+  :maxdepth: 1

  00_quickstart<00_quickstart.md>
  01_layout<01_layout.md>
--- a/media/docs/cpp/cutlass_2x.rst
+++ b/media/docs/cpp/cutlass_2x.rst
@ -0,0 +1,12 @@
+.. _cutlass_2_x:
+
+CUTLASS 2.x
+==================
+
+.. toctree::
+  :maxdepth: 2
+
+  Layouts and Tensors<layout.md>
+  GEMM API<gemm_api.md>
+  Tile Iterator Concepts<tile_iterator_concept.md>
+  Utilities<utilities.md>
--- a/media/docs/cpp/cutlass_3x.rst
+++ b/media/docs/cpp/cutlass_3x.rst
@ -0,0 +1,11 @@
+.. _cutlass_3_x:
+
+CUTLASS 3.x
+==================
+
+.. toctree::
+  :maxdepth: 2
+
+  Design <cutlass_3x_design.md>
+  GEMM Backwards Compatibility <cutlass_3x_backwards_compatibility.md>
+  GEMM API <gemm_api_3x.md>
--- a/media/docs/cpp/cutlass_3x_backwards_compatibility.md
+++ b/media/docs/cpp/cutlass_3x_backwards_compatibility.md
@ -438,7 +438,7 @@ obtain the kernel's configuration parameters. Users can use these to approximate
 for 3.0 API kernels.  However, the reflective interfaces cannot always match the types exactly,
 as the mappings are not always bijective.

-# Copyright
+### Copyright

 Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/cutlass_3x_design.md
+++ b/media/docs/cpp/cutlass_3x_design.md
@ -114,7 +114,7 @@ In this way, CuTe reifies the thread-to-data-layout mapping,
 makes it easier to write code that is "correct by construction".
 If the code compiles, it's probably correct.

-## Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/functionality.md
+++ b/media/docs/cpp/functionality.md
@ -277,7 +277,7 @@ CUDA exposes warp-level matrix operations in the CUDA C++ WMMA API. The CUDA C++
 |  **B**     | `RowMajor`, `ColumnMajor`  | `RowMajor`, `ColumnMajor`    |
 |  **C**     | `RowMajor`, `ColumnMajor`  | `RowMajor`, `ColumnMajor`    |

-# Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/fundamental_types.md
+++ b/media/docs/cpp/fundamental_types.md
@ -355,7 +355,7 @@ support on current and future NVIDIA GPUs.

 ```

-# Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/gemm_api.md
+++ b/media/docs/cpp/gemm_api.md
@ -5,7 +5,7 @@
 CUTLASS presents a uniform programming model for matrix multiply-accumulate operations at each level of the hierarchy. This document
 focuses on device-level, threadblock-level GEMMs, warp-level GEMMs, thread-level GEMMs, and instruction-level GEMMs.

-# CUTLASS GEMM Model
+## CUTLASS GEMM Model

 CUTLASS implements the basic GEMM triple loop nest with a tiled structure mirroring the execution model hierarchy.

@ -62,7 +62,7 @@ warp-synchronous matrix multiply instructions targeting Tensor Cores.
 Alternatively, GEMMs targeting single-thread instructions may have an additional series of nested loops corresponding to 
 thread-level concurrency.

-# CUTLASS GEMM Components
+## CUTLASS GEMM Components

 This loop nest is expressed in CUTLASS via the following components which are specialized for data type, layout, and
 math instruction.
@ -71,7 +71,7 @@ math instruction.

 These components are described in the following sections.

-## Device-wide GEMM API
+### Device-wide GEMM API

 The device-level GEMM API is intended to streamline instantiation and execution of the standard
 GEMM computation across the GPU. This operator is intended to be used in host-side .cu code and
@ -119,7 +119,7 @@ The device-wide GEMM API is embodied by the following operators:
 ```


-## Threadblock-level GEMM API
+### Threadblock-level GEMM API

 GEMMs at this scope are expected to efficiently load tiles of data from global memory into internal storage and then compute matrix
 products with warp-level GEMM operators.
@ -196,7 +196,7 @@ struct Mma {
 };
 ```

-## Warp-level Matrix Multiply API
+### Warp-level Matrix Multiply API

 Warp-level GEMM operators load tiles from shared memory into registers and then compute matrix multiplies using either 
 Tensor Cores or CUDA Cores. The result is accumulated in a register tile. Iterators are defined for each
@ -416,7 +416,7 @@ class MmaSimt;
 ```


-## Thread-level GEMM API
+### Thread-level GEMM API

 Thread-level GEMM operations perform matrix multiply-accumulate on data held in registers. These target CUDA Cores exclusively.

@ -502,7 +502,7 @@ struct Mma;
 } // namespace cutlass
 ```

-## Efficient Epilogue 
+### Efficient Epilogue 

 CUTLASS GEMM operators perform mma followed by epilogue operation similar 
 to cuBLAS. CUTLASS implements an efficient row-major epilogue. Thus, to achieve 
@ -529,7 +529,7 @@ of input layouts. Thus, CUTLASS supports the following layout combinations for i

 - `{N,T} x {N,T} => {N,T}` - NN, TN, TN, TT GEMM for both row-major and column-major output

-## Instruction-level operations
+### Instruction-level operations

 CUTLASS defines a template-based interface to Tensor Core operations to avoid resorting
 to inline PTX.
@ -538,7 +538,7 @@ to inline PTX.
 - [mma_sm75.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/mma_sm75.h) - Turing TensorCore operations


-# Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/gemm_api_3x.md
+++ b/media/docs/cpp/gemm_api_3x.md
@ -19,7 +19,7 @@ Device, Kernel, and Collective.
 It also briefly discusses the Tiled MMA/Copy and Atom level,
 and then refers readers to CuTe's tutorial for more information.

-# CUTLASS GEMM Model
+## CUTLASS GEMM Model

 CUTLASS implements algorithms that express
 the classical "triply nested loop" GEMM algorithm
@ -80,7 +80,7 @@ and computes MMAs.
 These tiled copy and tiled mma iterations are generally
 fully static and get fully unrolled.

-# CUTLASS GEMM Components
+## CUTLASS GEMM Components

 CUTLASS expresses the above loop nest
 with the following components which are specialized for
@ -146,7 +146,7 @@ using GemmHandle = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
 Towards the end, we also briefly cover CuTe's tiled mma and copy as well as the atom layer APIs,
 before redirecting users to CuTe-specific documentation for further details.

-## Collective API
+### Collective API

 A Collective is "the largest collection of threads
 onto which mma atoms and copy atoms are tiled."
@ -670,7 +670,7 @@ please refer to CuTe's tutorial, e.g., the sections on

 * [a GEMM example](./cute/0x_gemm_tutorial.md).

-# Copyright
+### Copyright

 Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/getting_started.rst
+++ b/media/docs/cpp/getting_started.rst
@ -0,0 +1,16 @@
+.. _getting_started:
+
+Getting Started
+==================
+
+.. toctree::
+  :maxdepth: 2
+
+  Quickstart<quickstart.md>
+  IDE Setup<ide_setup.md>
+  Build<build/index>
+  Functionality<functionality.md>
+  Terminology<terminology.md>
+  Fundamental Types<fundamental_types.md>
+  Programming Guidelines<programming_guidelines.md>
+  
--- a/media/docs/cpp/grouped_scheduler.md
+++ b/media/docs/cpp/grouped_scheduler.md
@ -1,6 +1,6 @@
 ![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Grouped Kernel Schedulers")

-# CUTLASS Grouped Kernel Schedulers
+# Grouped Kernel Schedulers

 CUTLASS's grouped kernel is a persistent kernel which launches multiple problems (e.g., GEMMs, SYR2Ks) within a
 single CUDA kernel launch.
--- a/media/docs/cpp/ide_setup.md
+++ b/media/docs/cpp/ide_setup.md
@ -118,7 +118,7 @@ This is usually a convenient way to configure projects, but it's not as simple f
 clang doesn't understand many of the compiler flags used by nvcc. Hence, for now, we don't recommend using
 `compile_commands.json` to configure your CUDA project.

-## Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/layout.md
+++ b/media/docs/cpp/layout.md
@ -217,7 +217,7 @@ and `TensorRef` objects for each of the operands whose extents are implied as a
 redundant storage of extent quantities, CUTLASS minimizes capacity utilization of precious resources such as constant memory.
 This is consistent with BLAS conventions.

-# Summary:
+## Summary:

 The design patterns described in this document form a hierarchy:
 * `T *ptr;` is a pointer to a contiguous sequence of elements of type `T`
@ -225,7 +225,7 @@ The design patterns described in this document form a hierarchy:
 * `TensorRef<T, Layout> ref(ptr, layout);` is an object pointing to an _unbounded_ tensor containing elements of type `T` and a layout of type `Layout`
 * `TensorView<T, Layout> view(ref, extent);` is an object pointing to a _bounded_ tensor containing elements of type `T` and a layout of type `Layout`

-# Appendix: Existing Layouts
+### Appendix: Existing Layouts

 This section enumerates several existing Layout types defined in CUTLASS.

@ -268,7 +268,7 @@ Permuted Shared Memory Layouts:
 - `TensorOpCrosswise<ElementSize>`


-# Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/overview.md
+++ b/media/docs/cpp/overview.md
@ -1,619 +0,0 @@
-![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
-
-# Overview
-
-# CUTLASS 3.9.0
-
-_CUTLASS 3.9.0 - March 2025_
-
-CUTLASS is a collection of CUDA C++ template abstractions for implementing
-high-performance matrix-matrix multiplication (GEMM) and related computations at all levels 
-and scales within CUDA. It incorporates strategies for hierarchical decomposition and 
-data movement similar to those used to implement cuBLAS and cuDNN.  CUTLASS decomposes 
-these "moving parts" into reusable, modular software components abstracted by C++ template 
-classes.  Primitives for different levels of a conceptual parallelization hierarchy
-can be specialized and tuned via custom tiling sizes, data types,
-and other algorithmic policy. The resulting flexibility simplifies their use
-as building blocks within custom kernels and applications.
-
-To support a wide variety of applications, CUTLASS provides extensive support for
-mixed-precision computations, providing specialized data-movement and
-multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16,
-[FP32 emulation via tensor core instruction](https://github.com/NVIDIA/cutlass/tree/main/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm), 
- 8b floating point types (e5m2 and e4m3),
- block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8),
- narrow integer types (4 and 8b signed and unsigned integers),
- and binary 1b data types (where architectures allow for the
-native support of such data types).
-CUTLASS demonstrates optimal matrix multiply operations
-targeting the programmable, high-throughput _Tensor Cores_ implemented by
-NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.
-
-In addition to GEMMs, CUTLASS implements high-performance convolution via
-the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution
-operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline.
-This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
-
-See the [Quick Start Guide](quickstart.md) to get started quickly.
-
-See the [functionality docs](functionality.md) for a more comprehensive
-list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU
-architecture.
-
-# What's New in CUTLASS 3.9
-
-* Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
-  - Collective mainloops that target for:
-    * [Blockscaled datatypes with support for dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp)
-    * [Blockscaled datatypes with support for sparse GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm120_blockscaled_sparse_mma_tma.hpp)
-  - New [GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/dispatch_policy.hpp) and [epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.
-  - [Blackwell SM120 epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/fusion/sm120_visitor_store_tma_warpspecialized.hpp) and [full set of EVT fusions](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/fusion/sm120_callbacks_tma_warpspecialized.hpp).
-* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
-  - [Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu).
-  - [Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79b_blackwell_geforce_nvfp4_nvfp4_gemm.cu).
-  - [Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu).
-* Set of unit tests that demonstrate the usage of both [sparse](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm120_blockscaled_sparse_tensorop_gemm/) and [dense](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/) Blackwell SM120 blockscaled GEMM.
-* Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
-  - Enhancement of [blockwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) for Hopper architecture.
-  - Enhancement of [groupwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) for Hopper architecture.
-  - Support for [grouped GEMM with blockwise and groupwise scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
-  - Support for [blockwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu) for Blackwell architecture.
-  - Support for [groupwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu) for Blackwell architecture.
-  - Support for [grouped GEMM with blockwise](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu) and [groupwise scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu) for Blackwell architecture.
-* Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
-  - Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
-  - Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
-  - Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
-  - More detailed introductions and examples to leverage this feature can be found in [profiler.md](./profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
-
-Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
-CUTLASS team is working on a fix.
-
-**See the [CHANGELOG](../release_notes.md) for details of all past releases and updates.**
-
-# Performance
-
-CUTLASS primitives are very efficient.  When used to construct device-wide GEMM kernels,
-they exhibit nearly optimal utilization of peak theoretical throughput. The figure below
-shows CUTLASS 3.8's performance as a % of theoretical peak utilization 
-on various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU.
-
-![ALT](../../images/cutlass-3.8-blackwell-gemm-peak-performance.svg "")
-
-The two figures below show the continual CUTLASS performance improvements 
-on an [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) (NVIDIA Hopper architecture) since
-CUTLASS 3.1.
-CUTLASS 3.5.1 was compiled with the [CUDA 12.5u1 Toolkit](https://developer.nvidia.com/cuda-downloads). 
-Tensor Core operations are implemented using CUDA's 
-[mma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma) and
-[wgmma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions) instructions.
-
-![ALT](../../images/cutlass-3.5.1-gemm-peak-performance.png "")
-![ALT](../../images/cutlass-3.5.1-gemm-peak-performance-fp8.png "")
-
-# CuTe
-
-CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data.
-CuTe is a collection of C++ CUDA template abstractions for
-defining and operating on hierarchically multidimensional layouts of threads and data.
-CuTe provides `Layout` and `Tensor` objects that compactly package the type,
-shape, memory space, and layout of data, while performing the complicated indexing for the user.
-This lets programmers focus on the logical descriptions of their algorithms while
-CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design,
-implement, and modify all dense linear algebra operations.
-
-The core abstractions of CuTe are hierarchically multidimensional layouts
-which can be composed with data arrays to represent tensors.
-The representation of layouts is powerful enough to represent nearly
-everything we need to implement efficient dense linear algebra.
-Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
-
-CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates.
-This greatly simplifies the design and improves code composability and readability.
-More documentation specific to CuTe can be found in its
-[dedicated documentation directory](cute/00_quickstart.md).
-
-# Compatibility
-
-Minimum requirements:
-
- Architecture: Volta (compute capability 7.0)
- Compiler: Must support at least C++17
- CUDA Toolkit version: 11.4
-
-CUTLASS requires a C++17 host compiler and 
-performs best when built with the [**CUDA 12.8 Toolkit**](https://developer.nvidia.com/cuda-downloads).
-It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, and all other CUDA 12.x versions.
-
-## Operating Systems
-
-We have tested the following environments.
-
-|**Operating System** | **Compiler** |
-|-----------------|----------|
-| Ubuntu 18.04 | GCC 7.5.0  |
-| Ubuntu 20.04 | GCC 10.3.0 |
-| Ubuntu 22.04 | GCC 11.2.0 |
-
-Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.
-
-Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
-CUTLASS team is working on a fix.
-
-## Hardware
-
-CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs.
-
-|**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit Required by CUTLASS-3**|
-|---|---|---|
-|NVIDIA V100 Tensor Core GPU            |7.0|11.4|
-|NVIDIA TitanV                          |7.0|11.4|
-|NVIDIA GeForce RTX 20x0 series         |7.5|11.4|
-|NVIDIA T4                              |7.5|11.4|
-|NVIDIA A100 Tensor Core GPU            |8.0|11.4|
-|NVIDIA A10                             |8.6|11.4|
-|NVIDIA GeForce RTX 30x0 series         |8.6|11.4|
-|NVIDIA GeForce RTX 40x0 series         |8.9|11.8|
-|NVIDIA L40                             |8.9|11.8|
-|NVIDIA H100 Tensor Core GPU            |9.0|11.8|
-|NVIDIA H200 Tensor Core GPU            |9.0|11.8|
-|NVIDIA B200 Tensor Core GPU            |10.0|12.8|
-|NVIDIA GeForce RTX 50x0 series         |10.0|12.8|
-
-## Target Architecture
-
-In general, PTX code generated for one target architecture can be run on future architectures
-(i.e., it is forward compatible).
-However, CUDA 12.0 introduced the concept of "architecture-accelerated features" whose
-PTX does not have forward compatibility guarantees.
-Several Hopper and Blackwell PTX instructions fall under this category of
-architecture-accelerated features, and thus require a `sm_90a` or `sm100a` target architecture
-(note the "a" appended). For more details on this and other architecture-accelerated instructions,
-please refer to the [CUDA Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availability).
-
-The target architecture information is passed on to CUTLASS via the cmake flag
-`CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100,
-users are required to build CUTLASS with `90a` as the target architecture.
-If a user accidentally builds a kernel which uses SM90a features
-(e.g. Hopper Tensor Core Instructions), using the SM90 target
-(note the lack of "a"), with either CUDA Toolkit 12 or 11.8,
-the kernel is expected to fail with a runtime error.
-
-```
-cmake .. -DCUTLASS_NVCC_ARCHS="90a"
-```
-Or 
-
-```
-cmake .. -DCUTLASS_NVCC_ARCHS="100a" 
-```
-
-Note: The NVIDIA Blackwell SM100 architecture used in the datacenter 
-products has a different compute capability than the one underpinning 
-NVIDIA Blackwell GeForce RTX 50 series GPUs. As a result, kernels 
-compiled for Blackwell SM100 architecture with arch conditional features 
-(using `sm100a`) are not compatible with RTX 50 series GPUs. 
-
-Please refer to the [functionality documentation](functionality.md)
-for details on which kernels require which target architectures.
-
-# Documentation
-
-CUTLASS is described in the following documents and the accompanying
-[Doxygen documentation](https://nvidia.github.io/cutlass).
-
- [Quick Start Guide](quickstart.md) - basics of building and running CUTLASS
- [Functionality](functionality.md) - summarizes functionality available in CUTLASS
- [Efficient GEMM in CUDA](efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
- [CUTLASS 3.x Design](cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
- [GEMM API 3.x](gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
- [GEMM API 2.x](gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
- [Implicit GEMM Convolution](implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
- [Code Organization](code_organization.md) - describes the organization and contents of the CUTLASS project
- [Terminology](terminology.md) - describes terms used in the code
- [Programming Guidelines](programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
- [Fundamental types](fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
- [Layouts](layout.md) - describes layouts of matrices and tensors in memory
- [Tile Iterators](tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
- [CUTLASS Profiler](profiler.md) - command-line driven profiling application
- [CUTLASS Utilities](utilities.md) - additional templates used to facilitate rapid development
- [Dependent kernel launch](dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent 
-kernels in the same stream, and how it is used in CUTLASS.
-
-# Resources
-We have also described the structure of an efficient GEMM in our talk at the
-[GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).
-
- [CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA](https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8854/)
- [Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100](https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/)
- [Accelerating Convolution with Tensor Cores in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31883/)
- [Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41996/)
- [CUTLASS: Python API, Enhancements, and NVIDIA Hopper](https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41131/)
-
-# Building CUTLASS
-
-CUTLASS is a header-only template library and does not need to be built to be used by other
-projects. Client applications should target CUTLASS's `include/` directory in their include
-paths.
-
-CUTLASS unit tests, examples, and utilities can be build with CMake.
-The minimum version of CMake is given in the [Quickstart guide](quickstart.md).
-Make sure the `CUDACXX` environment  variable points to NVCC in the CUDA Toolkit installed
-on your system.
-
-```bash
-$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
-```
-
-Create a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels
-for CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 8.9, and 9.0.
-To reduce compile time you can specify
-the architectures to build CUTLASS for by changing the CMake configuration setting
-`CUTLASS_NVCC_ARCHS`.
-
-```bash
-$ mkdir build && cd build
-
-$ cmake .. -DCUTLASS_NVCC_ARCHS=80               # compiles for NVIDIA's Ampere Architecture
-```
-
-From the `build/` directory, compile and run the CUTLASS unit tests by building the target `test_unit` with make.
-
-The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS,
-and they may be executed in parallel via make's `-j` command line argument.
-
-```bash
-$ make test_unit -j
-...
-...
-...
-[----------] Global test environment tear-down
-[==========] 946 tests from 57 test cases ran. (10812 ms total)
-[  PASSED  ] 946 tests.
-```
-
-All tests should pass on supported platforms, though the exact number of tests may vary over time.
-
-
-# Project Structure
-
-CUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests. 
-[Doxygen documentation](https://nvidia.github.io/cutlass) provides a complete list of files, classes, 
-and template concepts defined in the CUTLASS project.
-
-A detailed explanation of the source code organization may be found in the 
-[CUTLASS documentation](code_organization.md), but several main components are summarized below.
-
-## CUTLASS Template Library
-
-```
-include/                     # client applications should target this directory in their build's include paths
-
-  cutlass/                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only
-
-    arch/                    # direct exposure of architecture features (including instruction-level GEMMs)
-
-    conv/                    # code specialized for convolution
-
-    epilogue/                # code specialized for the epilogue of gemm/convolution
-
-    gemm/                    # code specialized for general matrix product computations
-
-    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory
-
-    platform/                # CUDA-capable Standard Library components
-
-    reduction/               # bandwidth-limited reduction kernels that do not fit the "gemm" model
-
-    thread/                  # simt code that can be performed within a CUDA thread
-    
-    transform/               # code specialized for layout, type, and domain transformations
-
-    *                        # core vocabulary types, containers, and basic numeric operations
-
-  cute/                      # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy
-
-    algorithm/               # Definitions of core operations such as copy, gemm, and operations on cute::tuples
-
-    arch/                    # Bare bones PTX wrapper structs for copy and math instructions
-
-    atom/                    # Meta-information either link to or built from arch/ operators
-
-      mma_atom.hpp           # cute::Mma_Atom and cute::TiledMma
-
-      copy_atom.hpp          # cute::Copy_Atom and cute::TiledCopy
-
-      *sm*.hpp               # Arch specific meta-information for copy and math operations
-
-    *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations
-
-```
-
-### CUTLASS SDK Examples
-
-[CUTLASS SDK examples](https://github.com/NVIDIA/cutlass/tree/main/examples) apply CUTLASS templates to implement basic computations.
-
-### Tools
-
-```
-tools/
-  library/                   # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates
-    include/
-      cutlass/
-        library/
-
-  profiler/                  # CUTLASS Profiler         - command-line utility for executing operations in the
-                             #                            CUTLASS Library
-  
-  util/                      # CUTLASS Utilities        - contains numerous helper classes for
-    include/                 #                            manging tensors in device memory, reference
-      cutlass/               #                            implementations for GEMM, random initialization
-        util/                #                            of tensors, and I/O.
-```
-
-### Test
-
-The `test/unit/` directory consist of unit tests implemented with Google Test that demonstrate
-basic usage of Core API components and complete tests of the CUTLASS GEMM computations.
-
-Instructions for building and running the Unit tests are described in the [Quickstart guide](quickstart.md).
-
-# Performance Profiling
-
-The `tools/profiler/` directory contains a command-line utility for launching each of the GEMM kernels.
-It can be built as follows:
-
-```bash
-$ make cutlass_profiler -j16
-```
-## Building all GEMM and Convolution kernels (_long_ build times)
-
-By default, only one tile size is instantiated for each data type, math instruction, and layout.
-To instantiate all, set the following environment variable when running CMake from an empty `build/` directory.
-Beware, this results in *tens of thousands* of kernels and long build times. 
-This would also result in a large binary size and on some platforms linker to fail on building the library.
-Therefore, it's highly recommended to generate only a subset of kernels as demonstrated in the sub-section below.
-```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all
-...
-$ make cutlass_profiler -j16
-```
-
-## Building a subset of GEMM and Convolution kernels (_reduced_ build times)
-
-To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with 
-wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one
-or a subset of kernels for NVIDIA Ampere and Turing architecture:
-
-### Building a subset Tensor Core GEMM kernels
-
-To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture, 
-use the below cmake command line:
-```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
-...
-$ make cutlass_profiler -j16
-```
-
-Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:
-```bash
-./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096
-
-...
-=============================
-  Problem ID: 1
-
-        Provider: CUTLASS
-   OperationKind: gemm
-       Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8
-
-          Status: Success
-    Verification: ON
-     Disposition: Passed
-
-reference_device: Passed
-          cuBLAS: Passed
-
-       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1  \
-                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128  \
-                  --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \
-                  --max_cc=1024
-
-           Bytes: 118489088  bytes
-           FLOPs: 115992428544  flops
-
-         Runtime: 1.55948  ms
-          Memory: 70.7616 GiB/s
-
-            Math: 74378.8 GFLOP/s
-
-
-
-=============================
-...
-```
-
-### Building one CUDA Core GEMM kernel
-
-To compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
-```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
-...
-$ make cutlass_profiler -j16
-```
-
-Example command line for profiling single SGEMM CUDA kernel is as follows:
-```bash
-$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
-
-=============================
-  Problem ID: 1
-
-        Provider: CUTLASS
-   OperationKind: gemm
-       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1
-
-          Status: Success
-    Verification: ON
-     Disposition: Passed
-
-          cuBLAS: Passed
-
-       Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1  \
-                  --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
-                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
-
-           Bytes: 180355072  bytes
-           FLOPs: 115992428544  flops
-
-         Runtime: 6.73655  ms
-          Memory: 24.934 GiB/s
-
-            Math: 17218.4 GFLOP/s
-
-=============================
-```
-
-### Building a subset of Tensor Core Convolution kernels
-
-To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation 
-and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
-```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
-...
-$ make cutlass_profiler -j16
-```
-
-Example command line for profiling a subset of Tensor Core convolution kernels is as follows:
-
-```bash
-$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
-
-...
-=============================
-  Problem ID: 1
-
-        Provider: CUTLASS
-   OperationKind: conv2d
-       Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc
-
-          Status: Success
-    Verification: ON
-     Disposition: Passed
-
-reference_device: Passed
-
-       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
-                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc  \
-                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
-                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5  \
-                  --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
-
-           Bytes: 1130659840  bytes
-           FLOPs: 118482796544  flops
-
-         Runtime: 0.711496  ms
-          Memory: 1479.99 GiB/s
-
-            Math: 166526 GFLOP/s
-
-=============================
-...
-```
-
-
-### Building one Convolution CUDA kernel
-
-To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation 
-and FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
-```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
-...
-$ make cutlass_profiler -j16
-```
-
-Example command line for profiling one CUDA Core convolution kernel:
-
-```bash
-$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
-
-
-=============================
-  Problem ID: 1
-
-        Provider: CUTLASS
-   OperationKind: conv2d
-       Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
-
-          Status: Success
-    Verification: ON
-     Disposition: Passed
-
-reference_device: Passed
-
-       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
-                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc  \
-                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
-                  --eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
-                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
-
-           Bytes: 2055798784  bytes
-           FLOPs: 118482796544  flops
-
-         Runtime: 7.34266  ms
-          Memory: 260.752 GiB/s
-
-            Math: 16136.2 GFLOP/s
-
-
-=============================
-
-```
-
-## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
-  - [GEMM CMake Examples](quickstart.md#gemm-cmake-examples) 
-  - [Implicit GEMM convolution CMake Examples](quickstart.md#convolution-cmake-examples)
- [Further details about the CUTLASS Profiler are described here.](profiler.md)
-
-
-# About
-
-CUTLASS is released by NVIDIA Corporation as Open Source software under the 
-[3-clause "New" BSD license](LICENSE.txt).
-
-# Contributors
-
-The official list of CUTLASS developers and contributors is available here: [CONTRIBUTORS](CONTRIBUTORS.md).
-
-# Copyright
-
-Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: BSD-3-Clause
-
-```
-  Redistribution and use in source and binary forms, with or without
-  modification, are permitted provided that the following conditions are met:
-
-  1. Redistributions of source code must retain the above copyright notice, this
-  list of conditions and the following disclaimer.
-
-  2. Redistributions in binary form must reproduce the above copyright notice,
-  this list of conditions and the following disclaimer in the documentation
-  and/or other materials provided with the distribution.
-
-  3. Neither the name of the copyright holder nor the names of its
-  contributors may be used to endorse or promote products derived from
-  this software without specific prior written permission.
-
-  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
-  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
-  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-```
--- a/media/docs/cpp/profiler.md
+++ b/media/docs/cpp/profiler.md
@ -45,7 +45,7 @@ compile or fail to launch at runtime.
 ```bash
 $ cmake .. \
  -DCUTLASS_NVCC_ARCHS="90a" \
-  -DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm90_tensorop_s64x64x16gemm_f16_f16_f32_void_f32_*" \
+  -DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f32_*" \
  -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="max" \
  -DCUTLASS_UNITY_BUILD_ENABLED=ON
 ```
@ -525,7 +525,7 @@ To best illustrate this naming convention, we will walk through the meaning of e
 in a GEMM kernel used by the profiler:

 ```
-cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f32_{optional-mixed-dtype-config}_128x128x64_2x1x1_0_ntn_align8
+cutlass3x_sm90_tensorop_gemm_f16_f16_f32_f16_f32_{optional-mixed-dtype-config}_128x128x64_2x1x1_0_ntn_align8
 ```

 The components within this name are as follows:
@ -553,7 +553,7 @@ Note that in some special cases where the input A/B types do not match that of t
 instruction's, the MMA facing input type is added to the instruction string as well.

 ```
-cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4
+cutlass3x_sm90_tensorop_tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4
 ```

 * `s64x128x8tf32gemm`: indicates that the MMA consumes inputs in `tf32` format, and therefore
@ -563,7 +563,7 @@ For custom mainloop or epilogue schedules, details of the opted-in schedule are
 kernel name. For example,

 ```
-cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma
+cutlass3x_sm90_tensorop_gemm_f16_f16_f16_void_f16_128x128x64_1x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma
 ```

 * `warpspecialized_cooperative`: Mainloop employs a persistent warp-specialized mainloop and kernel schedule.
--- a/media/docs/cpp/programming_guidelines.md
+++ b/media/docs/cpp/programming_guidelines.md
@ -1157,7 +1157,7 @@ has shape `((X, Y), K)` and stride `((1, X), X*Y)`.
 `get<0>(stride)` is the tuple `(1, X)`, not a single integer.
 However, A is certainly M major if interpreted as a matrix.

-# Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/quickstart.md
+++ b/media/docs/cpp/quickstart.md
@ -462,7 +462,7 @@ int main(int argc, char const **args) {
 }
 ```

-# CUTLASS Library
+## CUTLASS Library

 The [CUTLASS Library](https://github.com/NVIDIA/cutlass/tree/main/tools/library) defines an API for managing and executing collections of compiled
 kernel instances and launching them from host code without template instantiations in client code.
@ -585,7 +585,7 @@ int main() {
 }
 ```

-# Example CMake Commands
+## Example CMake Commands

 To instantiate all operations supporting all tile sizes, data types, and alignment constraints, specify
 `-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
@ -750,7 +750,7 @@ are needed in the mainloop builder:

 We encourage a user to refer to Sm100 unit tests and the generated profiler-based kernels as more comprehensive samples.

-# Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/terminology.md
+++ b/media/docs/cpp/terminology.md
@ -78,7 +78,10 @@ replaced by [MMA and Copy atoms from CuTe](cute/0t_mma_atom.md).
 **Thread Map**: abstraction for defining how threads are mapped to a given tile. Deprecated starting CUTLASS 3.0.
  Replaced by `cute::Layout` in equivalent usage scenarios to represent thread tensors.

-# Copyright
+[comment]: <> (Don't remove this. This "##" is to prevent Sphinx from throwing build WARNING.)
+## 
+
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/tile_iterator_concept.md
+++ b/media/docs/cpp/tile_iterator_concept.md
@ -469,7 +469,7 @@ struct WriteableReadableRandomAccessContiguousTileIteratorConcept {
 };
 ```

-# Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/cpp/utilities.md
+++ b/media/docs/cpp/utilities.md
@ -431,7 +431,7 @@ Additional information may appear at the end of each line, such as shared memory

 Please note that `synclog` is an experimental feature, and its functionality is not always guaranteed. We encourage its use in custom kernels and CUTLASS examples, though it is known to be incompatible with profiler kernels.

-# Copyright
+### Copyright

 Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: BSD-3-Clause
--- a/media/docs/pythonDSL/cute_dsl.rst
+++ b/media/docs/pythonDSL/cute_dsl.rst
@ -0,0 +1,18 @@
+.. _cute_dsl:
+
+CuTe DSL
+========
+
+.. toctree::
+  :maxdepth: 1
+
+  DSL Introduction <cute_dsl_general/dsl_introduction.rst>
+  DSL Code Generation <cute_dsl_general/dsl_code_generation.rst>
+  DSL Control Flow <cute_dsl_general/dsl_control_flow.rst>
+  DSL JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
+  DSL JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
+  DSL JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
+  Integration with Frameworks <cute_dsl_general/framework_integration.rst>
+  Debugging with the DSL <cute_dsl_general/debugging.rst>
+  Autotuning with the DSL <cute_dsl_general/autotuning_gemm.rst>
+  Educational Notebooks <cute_dsl_general/notebooks.rst>
--- a/media/docs/pythonDSL/cute_dsl_api.rst
+++ b/media/docs/pythonDSL/cute_dsl_api.rst
@ -0,0 +1,12 @@
+.. _cute_dsl_api:
+
+CuTe DSL API
+============
+
+.. toctree::
+  :maxdepth: 1
+
+  cute <cute_dsl_api/cute.rst>
+  cute_arch <cute_dsl_api/cute_arch.rst>
+  cute_nvgpu <cute_dsl_api/cute_nvgpu.rst>
+  utils <cute_dsl_api/utils.rst>
--- a/media/docs/pythonDSL/cute_dsl_api/cute.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/cute.rst
@ -0,0 +1,11 @@
+.. _cute:
+
+cutlass.cute
+============
+
+.. automodule:: cutlass.cute
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :special-members: __init__
+   :private-members:
--- a/media/docs/pythonDSL/cute_dsl_api/cute_arch.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/cute_arch.rst
@ -0,0 +1,24 @@
+.. _cute_arch:
+
+cutlass.cute.arch
+=================
+
+The ``cute.arch`` module contains wrappers around NVVM-level MLIR Op builders that seamlessly
+inter-operate with the Python types used in CUTLASS Python. Another benefit of wrapping these Op
+builders is that the source location can be tracked with the ``@dsl_user_op`` decorator. Available
+functions include
+
+- basic API like ``thr_idx``;
+- functions related to the direct management of mbarriers;
+- low-level SMEM management (prefer using the ``SmemAllocator`` class);
+- TMEM management.
+
+API documentation
+-----------------
+
+.. automodule:: cutlass.cute.arch
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :special-members: __init__
+   :private-members:
--- a/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu.rst
@ -0,0 +1,18 @@
+.. _cute_nvgpu:
+
+cutlass.cute.nvgpu
+==================
+
+The ``cute.nvgpu`` module contains MMA and Copy Operations as well as Operation-specific helper
+functions. The arch-agnostic Operations are exposed at the top-level while arch-specific Operations
+are grouped into submodules like ``tcgen05``.
+
+.. toctree::
+  :maxdepth: 2
+  :hidden:
+
+  cute_nvgpu_common
+  cute_nvgpu_warp
+  cute_nvgpu_warpgroup
+  cute_nvgpu_cpasync
+  cute_nvgpu_tcgen05
--- a/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_common.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_common.rst
@ -0,0 +1,9 @@
+.. _cute_nvgpu_common:
+
+Common
+======
+
+.. automodule:: cutlass.cute.nvgpu
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_cpasync.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_cpasync.rst
@ -0,0 +1,10 @@
+.. _cute_nvgpu_cpasync:
+
+cpasync submodule
+=================
+
+.. automodule:: cutlass.cute.nvgpu.cpasync
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :special-members: __init__
--- a/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_tcgen05.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_tcgen05.rst
@ -0,0 +1,10 @@
+.. _cute_nvgpu_tcgen05:
+
+tcgen05 submodule
+=================
+
+.. automodule:: cutlass.cute.nvgpu.tcgen05
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :special-members: __init__
--- a/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_warp.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_warp.rst
@ -0,0 +1,10 @@
+.. _cute_nvgpu_warp:
+
+warp submodule
+==============
+
+.. automodule:: cutlass.cute.nvgpu.warp
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :special-members: __init__
--- a/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_warpgroup.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_warpgroup.rst
@ -0,0 +1,10 @@
+.. _cute_nvgpu_warpgroup:
+
+warpgroup submodule
+===================
+
+.. automodule:: cutlass.cute.nvgpu.warpgroup
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :special-members: __init__
--- a/media/docs/pythonDSL/cute_dsl_api/utils.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/utils.rst
@ -0,0 +1,9 @@
+cutlass.utils
+=============
+
+.. automodule:: cutlass.utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :special-members: __init__
+   :private-members:
--- a/media/docs/pythonDSL/cute_dsl_general/autotuning_gemm.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/autotuning_gemm.rst
@ -0,0 +1,154 @@
+.. _autotuning_gemm:
+
+Guidance for Auto-Tuning
+============================= 
+
+.. contents:: Table of Contents
+   :depth: 2
+   :local:
+
+Numerous GEMM kernel code examples are offered within our codebase. 
+When integrating these kernels into frameworks, auto-tuning becomes essential 
+for achieving optimal performance. This involves selecting the appropriate 
+kernel parameters based on the inputs of real applications.
+Next, we'll briefly introduce some tips on how to perform auto-tuning.
+
+The auto-tuning process typically involves the following steps:
+
+1. Define search space
+2. Benchmark each configuration and select the kernel with the best performance
+3. Enable caching to reduce the tuning cost
+
+The search space defines the valid combinations of kernel parameters that can be used to run the kernels. 
+Different inputs (shapes, data types, etc.) typically require different kernel parameters to achieve optimal performance.
+The search space is related to the kernel. We take the Blackwell GEMM persistent kernel as an example. 
+The search space is as follows:
+
+- ``mma_tiler_mn``: Defines the dimensions of the matrix tile that each Matrix Multiply-Accumulate (MMA) instruction processes in a single operation. 
+- ``cluster_shape_mn``: Specifies the number of CTAs along each dimension within a cluster. Refer `Parallel Thread Execution ISA documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tensorcore-5th-generation-family-instructions>`_ for the possible mma tiler size and cluster shape for different tensor data types.
+- ``use_2cta_instrs``: Whether to utilize Blackwell's 2 CTA instructions for MMA/Copy.
+- ``use_tma_store``: Whether to use Tensor Memory Access (TMA) instructions to store the result back to global memory.
+
+After defining the search space, we could traverse all parameter combinations to find the optimal kernel. 
+The ``autotune_gemm`` function below demonstrates a simple exhaustive search approach - it iterates 
+through configurations, compiles and benchmarks each kernel, and returns the best performing one.
+Since kernel compilation incurs overhead, it's important to cache and reuse compiled kernels 
+to minimize host launch latency. CuTe DSL facilitates this through its separate compilation 
+and execution workflow. More details can be found in :ref:`JIT_Caching`.
+As demonstrated in the ``autotune_gemm`` function 
+(between the ``begin of cache the compiled GEMM kernel`` and ``end of cache the compiled GEMM kernel`` comments), 
+we can use ``cute.compile()`` to compile a kernel once, cache the compiled result, and reuse the cached JIT executor for multiple kernel 
+executions. We could maintain a global configuration-to-kernel dictionary (``config_kernel_dict``) to cache the compiled GEMM kernels, 
+where each key (``kernel_cache_key``) uniquely identifies a kernel based on its characteristics.
+Usually we could use the {dtype + kernel configs} as the cached key for GEMM compilation. For example, 
+
+.. code-block:: python
+
+    kernel_cache_key = f"{ab_dtype}x{c_dtype}x{acc_dtype}x{use_2cta_instrs}x{mma_tiler}x{cluster_shape_mn}x{use_tma_store}"
+
+If the input tensor's layout is static, we should add the shape in the cached key too.
+Users can customize the ``benchmark`` function to measure kernel execution time.
+For stable and reliable performance measurements:
+
+1. Run a few warmup iterations (e.g., 5-10) to stabilize GPU temperature
+2. Execute multiple timed iterations (e.g., 100-1000) for statistical significance
+3. Use CUDA events and synchronization for precise timing
+4. Lock GPU frequencies (SM and memory frequencies) with nvidia-smi
+5. Process results by removing outliers and using min/avg statistics as measurements.
+
+This ensures reliable kernel selection through proper benchmarking.
+
+.. code-block:: python
+
+    # get the best GEMM kernel for given input tensors
+    def autotune_gemm(
+        a: cute.Tensor,
+        b: cute.Tensor,
+        c: cute.Tensor,
+        stream: cuda.CUstream,
+        use_2cta_instrs_list: List[bool] = [True],
+        use_tma_store_list: List[bool] = [True],
+        mma_tiler_m_list: List[int] = [256],
+        mma_tiler_n_list: List[int] = [256],
+        cluster_shape_m_list: List[int] = [2],
+        cluster_shape_n_list: List[int] = [1],
+    ):
+        best_kernel = None
+        min_time = float("inf")
+        # traverse the search space
+        for use_2cta_instrs in use_2cta_instrs_list:
+            for use_tma_store in use_tma_store_list:
+                for mma_tiler_mn in product(mma_tiler_m_list, mma_tiler_n_list):
+                    for cluster_shape_mn in product(cluster_shape_m_list, cluster_shape_n_list):
+                        acc_dtype = cutlass.Float32
+                        hardware_info = cutlass.utils.HardwareInfo()
+                        max_active_clusters = hardware_info.get_max_active_clusters(
+                            cluster_shape_mn[0] * cluster_shape_mn[1]
+                        )
+                        # instance a GEMM kernel
+                        gemm = PersistentDenseGemmKernel(
+                            acc_dtype,
+                            use_2cta_instrs,
+                            mma_tiler_mn,
+                            cluster_shape_mn,
+                            use_tma_store,
+                        )
+                        # begin of cache the compiled GEMM kernel
+                        if kernel_cache_key not in config_kernel_dict:
+                            # compile gemm kernel
+                            compiled_gemm = cute.compile(
+                                gemm,
+                                a,
+                                b,
+                                c,
+                                max_active_clusters,
+                                stream,
+                            )
+                            config_kernel_dict[kernel_cache_key] = compiled_gemm
+                        else:
+                            compiled_gemm = config_kernel_dict[kernel_cache_key]
+                        # end of cache the compiled GEMM kernel
+                        try:
+                            # define a benchmark function to measure the execution time of the compiled GEMM kernel
+                            cur_time = benchmark(
+                                partial(compiled_gemm, a, b, c, stream),
+                            )
+                        except Exception as e:
+                            print(f"Execution error: {e}")
+                            cur_time = float("inf")
+                        if cur_time < min_time:
+                            min_time = cur_time
+                            best_kernel = compiled_gemm
+        if best_kernel is None:
+            raise ValueError("No best kernel found")
+        return best_kernel
+
+This brute-force approach ensures we could find the optimal parameters, though at the cost of trying every possibilities.
+For more advanced use cases, users can explore sophisticated optimization 
+techniques like search space pruning and genetic algorithms to reduce tuning overhead and discover better 
+configurations more efficiently.
+
+To further optimize tuning performance, we can utilize caching mechanisms to avoid redundant computations.
+We could cache the tuning results in a input-to-kernel dictionary (e.g., ``input_kernel_dict``). 
+When processing inputs with matching ``config_key`` values, the cached kernel can be reused directly without re-tuning. 
+The ``config_key`` is related with the input tensor's characteristics, such as the shape, data type, etc. 
+The setup of ``config_key`` is very flexible, users can customize it based on their own application.
+For instance, if the data type is fixed in users' application, we could use the input tensor's shape as the key, i.e., ``(m, n, k)``. 
+To further reduce tuning overhead, we could consider using a simplified key like ``config_key = (power_of_2(m), power_of_2(n), power_of_2(k))``, 
+where ``m``, ``n``, and ``k`` are rounded up to the nearest power of 2. This simplification can significantly reduce the number 
+of unique keys while still maintaining good performance in most cases. However, it's important to validate that this 
+approximation doesn't negatively impact performance for your specific use case. 
+
+.. code-block:: python
+
+    config_key = (m, n, k)
+    if config_key in input_kernel_dict:
+        compiled_gemm = input_kernel_dict[config_key]
+    else:
+        compiled_gemm = autotune_gemm(...)
+        input_kernel_dict[config_key] = compiled_gemm
+    # launch gemm kernel
+    compiled_gemm(a_tensor, b_tensor, c_tensor, stream)
+
+By following the methods above, you can customize your own auto-tuner to find the optimal GEMM kernel configuration 
+for specific matrix dimensions and data types, significantly improving computational performance for models.
--- a/media/docs/pythonDSL/cute_dsl_general/debugging.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/debugging.rst
@ -0,0 +1,133 @@
+.. _debugging:
+
+Debugging
+=========
+
+.. contents:: Table of Contents
+   :depth: 2
+   :local:
+
+This page provides an overview of debugging techniques and tools for CuTe DSL programs.
+
+
+Getting Familiar with the Limitations
+-------------------------------------
+
+Before diving into comprehensive debugging capabilities, it's important to understand the limitations of CuTe DSL.
+Understanding these limitations will help you avoid potential pitfalls from the start.
+
+Please refer to :doc:`../limitations` for more details.
+
+
+DSL Debugging
+-------------
+
+CuTe DSL provides built-in logging mechanisms to help you understand the code execution flow and
+some of the internal state.
+
+Enabling Logging
+~~~~~~~~~~~~~~~~
+
+CuTe DSL provides environment variables to control logging level:
+
+.. code:: bash
+
+    # Enable console logging (default: False)
+    export CUTE_DSL_LOG_TO_CONSOLE=1
+
+    # Log to file instead of console (default: False)
+    export CUTE_DSL_LOG_TO_FILE=my_log.txt
+
+    # Control log verbosity (0, 10, 20, 30, 40, 50, default: 10)
+    export CUTE_DSL_LOG_LEVEL=20
+
+
+Log Categories and Levels
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Similar to standard Python logging, different log levels provide varying degrees of detail:
+
+--------+-------------+
+| Level  | Description |
+========+=============+
+| 0      | Disabled    |
+--------+-------------+
+| 10     | Debug       |
+--------+-------------+
+| 20     | Info        |
+--------+-------------+
+| 30     | Warning     |
+--------+-------------+
+| 40     | Error       |
+--------+-------------+
+| 50     | Critical    |
+--------+-------------+
+
+
+Dump the generated IR
+~~~~~~~~~~~~~~~~~~~~~
+
+For users familiar with MLIR and compilers, CuTe DSL supports dumping the Intermediate Representation (IR).
+This helps you verify whether the IR is generated as expected.
+
+.. code:: bash
+
+    # Dump Generated CuTe IR (default: False)
+    export CUTE_DSL_PRINT_IR=1
+
+    # Keep Generated CuTe IR in a file (default: False)
+    export CUTE_DSL_KEEP_IR=1
+
+
+
+Kernel Functional Debugging
+----------------------------
+
+Using Python's ``print`` and CuTe's ``cute.printf``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+CuTe DSL programs can use both Python's native ``print()`` as well as our own ``cute.printf()``  to
+print debug information during kernel generation and execution. They differ in a few key ways:
+
+- Python's ``print()`` executes during compile-time only (no effect on the generated kernel) and is
+  typically used for printing static values (e.g. a fully static layouts).
+- ``cute.printf()`` executes at runtime on the GPU itself and changes the PTX being generated. This
+  can be used for printing values of tensors at runtime for diagnostics, but comes at a performance
+  overhead similar to that of `printf()` in CUDA C.
+
+For detailed examples of using these functions for debugging, please refer to the associated
+notebook referenced in :doc:`notebooks`.
+
+Handling Unresponsive/Hung Kernels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a kernel becomes unresponsive and ``SIGINT`` (``CTRL+C``) fails to terminate it,
+you can follow these steps to forcefully terminate the process:
+
+1. Use ``CTRL+Z`` to suspend the unresponsive kernel
+2. Execute the following command to terminate the suspended process:
+
+.. code:: bash
+
+    # Terminate the most recently suspended process
+    kill -9 $(jobs -p | tail -1)
+
+
+CuTe DSL can also be debugged using standard NVIDIA CUDA tools.
+
+Using Compute-Sanitizer
+~~~~~~~~~~~~~~~~~~~~~~~
+
+For detecting memory errors and race conditions:
+
+.. code:: bash
+
+    compute-sanitizer --some_options python your_dsl_code.py
+
+Please refer to the `compute-sanitizer documentation <https://developer.nvidia.com/compute-sanitizer>`_ for more details.
+
+Conclusion
+----------
+
+This page covered several key methods for debugging CuTe DSL programs. Effective debugging typically requires a combination of these approaches.
+If you encounter issues with DSL, you can enable logging and share the logs with the CUTLASS team as a GitHub issue to report a bug.
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_code_generation.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_code_generation.rst
@ -0,0 +1,90 @@
+.. _dsl_code_generation:
+.. |DC|  replace:: dynamic compilation
+.. |DSL| replace:: CuTe DSL
+.. |IR|  replace:: intermediate representation (IR)
+
+End-to-End Code Generation
+==========================
+
+.. contents::
+   :depth: 2
+   :local:
+
+
+1. Techniques for Turning Python into |IR|
+------------------------------------------
+
+1.1 AST rewrite
+^^^^^^^^^^^^^^^^
+The function’s abstract-syntax tree is analysed **before** execution.
+Python control-flow (``for``/``while``, ``if``/``else``) and built-ins are converted to structured |IR|
+constructs.  Computation inside each region is left untouched at this stage.
+
+*Advantages*
+
+*  Sees the entire program, so every branch and loop is preserved.
+*  Keeps loop structure intact for optimization such as tiling, vectorisation
+   or GPU thread mapping.
+
+*Disadvantages*
+
+*  Requires a well-defined Python subset that the rewriter understands.
+
+
+1.2 Tracing
+^^^^^^^^^^^
+The decorated function is executed once with *proxy* arguments; overloaded
+operators record every tensor operation that actually runs and produce a flat
+trace that is lowered to |IR|.
+
+*Advantages*
+
+*  Near-zero compile latency, ideal for straight-line arithmetic.
+*  No need to parse Python source, so it supports many dynamic Python
+   features, and Python has many features.
+
+*Disadvantages*
+
+*  Untaken branches vanish, so the generated kernel may be wrong for other
+   inputs.
+*  Loops are flattened to the iteration count observed during tracing.
+*  Data-dependent control-flow freezes to a single execution path.
+
+
+2. |DSL| Code-Generation Modes
+------------------------------
+
+CuTe’s Python front-end combines the techniques above into **two mutually
+exclusive modes**, selectable with the ``preprocessor`` flag of the
+``@jit`` decorator:
+
+1. Tracing mode ``@jit(preprocess=False)`` – tracing only.
+This results in the fastest compilation path and is recommended only for kernels that are guaranteed to be
+straight-line arithmetic. It suffers from all tracing limitations listed in the previous section.
+
+2.  Preprocessor mode (**default**) ``@jit(preprocess=True)`` – **AST rewrite + tracing**.
+The AST pass captures every loop and branch, eliminating the correctness and
+optimisation problems of pure tracing; tracing then fills in the arithmetic.
+This hybrid “preprocessor” pipeline is unique to |DSL| and was designed
+specifically to overcome the disadvantages identified above.
+
+.. figure:: dsl_modes.png
+   :width: 400
+   :align: center
+
+   *Left*: tracing mode records only the path that executed.
+   *Right*: preprocessor mode emits structured |IR| for every branch and loop
+   before tracing the arithmetic.
+
+
+Why Tracing-Only Is Insufficient for Control-Flow
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* **Branch loss** – The untaken side of an ``if``/``else`` is never lowered.
+* **Loop unrolling** – Loops are flattened to the iteration count observed,
+  destroying structure needed for parallel mapping and tiling.
+* **Data-dependent paths** – Control-flow that depends on tensor values freezes
+  to a single execution path at trace time.
+
+The preprocessor mode fixes all of these by lowering control-flow first and delegating
+only the arithmetic to the tracer.
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_control_flow.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_control_flow.rst
@ -0,0 +1,140 @@
+.. _dsl_control_flow:
+.. |DC|        replace:: dynamic compilation
+.. |IR|        replace:: intermediate representation (IR)
+.. |DSL|       replace:: CuTe DSL
+.. |Constexpr| replace:: **Constexpr** (compile-time Python value)
+
+|DSL| Control Flow
+==================
+.. contents::
+   :depth: 2
+   :local:
+
+
+Overview
+--------
+|DSL| walks Python’s AST and converts each control-flow construct it finds into
+structured |IR|.  You can therefore write ordinary Python loops and branches
+while the compiler decides—statement by statement—whether to
+
+* **evaluate at compile time** if the controlling value is a |Constexpr|, or
+* **emit intermediate representation (IR)** when the value is dynamic.
+
+
+For a high-level discussion of the overall pipeline, see
+:doc:`the code-generation overview <dsl_code_generation>`.
+
+For Loops
+---------
+|DSL| recognises three kinds of ranges for ``for`` loops:
+
+* ``range`` – the Python built-in
+* ``cutlass.range_dynamic`` – always lowers to |IR|
+* ``cutlass.range_constexpr`` – always unrolls at compile time
+
+
+range(...)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The AST rewriter inserts a small helper stub.  At runtime the loop bounds are
+inspected:
+
+* **Constant bounds** → the loop is unrolled at compile time.
+* **Dynamic bounds**  → the loop is emitted as structured |IR|.
+
+
+cutlass.range_dynamic(...)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Use when you *always* want a loop in the generated |IR|, even if the bounds
+look constant.
+
+
+cutlass.range_constexpr(...)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Runs in the Python interpreter and is fully unrolled before code generation.
+All loop indices must be |Constexpr|.
+
+
+Limitations of Dynamic For Loops
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+* Early-exit ``break``, ``continue``, or raising exception are not yet supported.
+* Operations in the loop body are traced only when tracing is active in that
+  region.
+
+
+**Example:**
+
+.. code-block:: python
+
+   @cute.jit
+   def loop_example():
+       n = 10
+
+       # ❌ This loop is dynamic, early-exit isn't allowed.
+       for i in cutlass.range_dynamic(n):
+           if i == 5:
+               break         # Early-exit
+           cute.printf("%d\\n", i)
+
+       # ✅ This loop is constexpr, early-exit is allowed.
+       for i in cutlass.range_constexpr(n):
+           if i == 5:
+               break         # Early-exit
+           cute.printf("%d\\n", i)
+
+If-Else Statements
+------------------
+
+Standard Python ``if``/``else`` is supported.
+
+* **Predicate is Constexpr (compile-time Python value)** → evaluated at compile time.
+* **Predicate is dynamic**     → lowered to |IR|.
+
+**Example:**
+
+.. code-block:: python
+
+   @cute.jit
+   def main(const_var: cutlass.Constexpr, dynamic_var: cutlass.Int32):
+       if const_var:                         # compile-time branch
+           cute.printf("Const branch\\n")
+       else:
+           cute.printf("Const else\\n")
+
+       if dynamic_var == 10:                 # dynamic branch
+           cute.printf("Dynamic True\\n")
+       else:
+           cute.printf("Dynamic False\\n")
+
+Similarly to for-loops, the ``if cutlass.const_expr`` and ``if cutlass.dynamic_expr`` constructs can
+be used to force the evaluation at compile-time or the generation of IR, respectively. Unstructured
+control flow is only supported when using ``if cutlass.const_expr``.
+
+While Loops
+-----------
+
+Python ``while`` loops are always treated as **dynamic** because the loop condition may become
+dynamic after the first iteration. Similarly to for-loops and ``if``/``else``, the
+``while cutlass.const_expr`` and ``while cutlass.dynamic_expr`` constructs are available.
+
+Compile-Time Metaprogramming
+----------------------------
+
+Mix compile-time constructs with normal |DSL| code to generate specialised
+kernels without runtime overhead.  A compile-time flag can, for example, toggle
+an optional **ReLU** epilogue:
+
+.. code-block:: python
+
+   @cute.kernel
+   def gemm(..., do_relu: cutlass.Constexpr):
+       # main GEMM work
+       ...
+       if const_expr(do_relu):    # compile-time guard
+           # ReLU code is emitted only when do_relu is True
+           ...
+
+.. code-block:: text
+
+   gemm(..., False)   # ReLU is omitted from the generated |IR|
+   gemm(..., True)    # ReLU is included
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_dynamic_layout.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_dynamic_layout.rst
@ -0,0 +1,198 @@
+.. _dsl_dynamic_layout:
+.. |DSL| replace:: CuTe DSL
+.. |SLAY| replace:: static layout
+.. |DLAY| replace:: dynamic layout
+
+.. contents:: Table of Contents
+   :depth: 2
+   :local:
+
+Static vs Dynamic layouts
+=========================
+
+Static Layout
+-------------
+
+When integrating with popular deep learning frameworks, one question is how to deal with the layout of the converted ``cute.Tensor``.
+For example, when converting a ``torch.Tensor`` to a ``cute.Tensor``, the shape of the ``torch.Tensor`` is honored for the layout of
+``cute.Tensor``.
+
+.. code-block:: python
+
+    import torch
+    import cutlass
+    from cutlass.cute.runtime import from_dlpack
+
+    @cute.jit
+    def foo(tensor):
+        print(f"tensor.layout: {tensor.layout}")  # Prints tensor layout at compile time
+        cute.printf("tensor: {}", tensor)         # Prints tensor values at runtime
+
+In this example, we define a JIT function ``foo`` that takes a ``cute.Tensor`` as input and prints its layout. Note
+that Python print is used to print the layout at compile time. This works fine for |SLAY| whose value is known at
+compile time.
+
+Now let's try to run the JIT function ``foo`` with different shapes of the input ``torch.Tensor``.
+
+.. code-block:: python
+
+    a = torch.tensor([1, 2, 3], dtype=torch.uint16)
+    a_pack = from_dlpack(a)
+    compiled_func = cute.compile(foo, a_pack)
+    compiled_func(a_pack)
+
+Here we first convert a 1D ``torch.Tensor`` with 3 elements to a ``cute.Tensor`` using ``from_dlpack``. Then we compile
+the JIT function ``foo`` with the converted ``cute.Tensor`` and call the compiled function. 
+
+::
+
+    tensor.layout: (3):(1)
+    tensor: raw_ptr(0x00000000079e5100: i16, generic, align<2>) o (3):(1) = 
+  ( 1, 2, 3 )
+
+It prints ``(3):(1)`` for the layout because the converted ``cute.Tensor`` has a |SLAY| with shape ``(3)`` which
+is the shape of the ``a``.
+
+Now if we call the compiled function with a different shape of the input ``torch.Tensor``, it would result in an unexpected
+result at runtime due to the mismatch of the type since ``compiled_func`` expects a ``cute.Tensor`` with layout ``(3):(1)``
+while ``b`` has shape ``(5)``.
+
+.. code-block:: python
+
+    b = torch.tensor([11, 12, 13, 14, 15], dtype=torch.uint16)
+    b_pack = from_dlpack(b)
+    compiled_func(b_pack)  # ❌ This results in an unexpected result at runtime due to type mismatch
+
+Following is the output which is unexpected due to the type mismatch.
+
+::
+
+    tensor: raw_ptr(0x00000000344804c0: i16, generic, align<2>) o (3):(1) = 
+  ( 11, 12, 13 )
+
+To fix that, we would have to trigger another code generation and compilation for the new shape for ``b``.
+
+.. code-block:: python
+
+    compiled_func_2 = cute.compile(foo, b_pack)  # This would trigger another compilation
+    compiled_func_2(b_pack)                      # ✅ Now this works fine
+
+As shown in the example above, with the newly compiled ``compiled_func_2``,  we can pass in ``b_pack`` to the compiled
+JIT function ``compiled_func_2``.
+
+::
+
+    tensor.layout: (5):(1)
+    tensor: raw_ptr(0x0000000034bb2840:: i16, generic, align<2>) o (5):(1) = 
+  ( 11, 12, 13, 14, 15 )
+
+Now it recompiles and prints the values of ``b`` correctly.
+
+It's obvoius that we need distinct codes generated and compiled for different static layout. In this case, one for layout
+``(3):(1)`` and the other for layout ``(5):(1)``.
+
+Dynamic Layout
+--------------
+
+In order to avoid generating and compiling multiple times for different shapes of the input ``torch.Tensor``, |DSL| provides a way to
+generate and compile JIT function with |DLAY|.
+
+To get dyanmic layout of the ``cute.Tensor``, a ``torch.Tensor`` object can be passed into the JIT function directly which instructs
+|DSL| to call ``cute.mark_layout_dynamic`` automatically on the converted ``cute.Tensor`` per the leading dimension of the layout.
+
+.. code-block:: python
+
+    import torch
+    import cutlass
+    from cutlass.cute.runtime import from_dlpack
+
+    @cute.jit
+    def foo(tensor):
+        print(tensor.layout)  # Prints (?,?):(?,1) for dynamic layout
+
+    a = torch.tensor([[1, 2], [3, 4]], dtype=torch.uint16)
+    compiled_func = cute.compile(foo, a)
+    compiled_func(a)
+
+    b = torch.tensor([[11, 12], [13, 14], [15, 16]], dtype=torch.uint16)
+    compiled_func(b)  # Reuse the same compiled function for different shape
+
+In the example above, a single compilation of the JIT function ``foo`` is reused for different shapes of the input ``torch.Tensor``.
+This is possible because the converted ``cute.Tensor`` has a |DLAY| ``(?,?):(?,1)`` which is compatible with the shape of the
+input ``torch.Tensor`` of both calls.
+
+Alternatively, for compact layout, ``cute.mark_compact_shape_dynamic`` can be called for a finer-grained control to specify the mode
+of the layout for dynamic and the divisibility constraint for the dynamic dimension.
+
+Refer to :doc:`framework_integration` for more details on ``from_dlpack``, ``mark_layout_dynamic``,
+and ``mark_compact_shape_dynamic``.
+
+Static Layout vs. Dynamic Layout
+--------------------------------
+
+Per the previous sections, we have seen that |SLAY| leads to distinct JIT code generations while |DLAY| leads to a single
+compilation for different shapes.
+
+That said, creating JIT function with |SLAY| is useful when the use cases targeting input data with fixed shapes.
+Since more information is available at compile time, the compiler would be able to kick in optimizations that otherwise would not
+be possible for the code generated for |DLAY|.
+
+On the other hand, |DLAY| would be more flexible for the cases where the input data has varying shapes. This provides more
+scalability of the generated code to deal with varying input data of different shapes.
+
+Programming with Static and Dynamic Layout
+------------------------------------------
+
+|DSL| provides intuitive way to program with static and |DLAY| in the codes.
+
+.. code-block:: python
+
+    import torch
+    import cutlass
+    from cutlass.cute.runtime import from_dlpack
+
+    @cute.jit
+    def foo(tensor, x: cutlass.Constexpr[int]):
+        print(cute.size(tensor))  # Prints 3 for the 1st call
+                                  # Prints ? for the 2nd call
+        if cute.size(tensor) > x:
+            cute.printf("tensor[2]: {}", tensor[2])
+        else:
+            cute.printf("tensor size <= {}", x)
+
+    a = torch.tensor([1, 2, 3], dtype=torch.uint16)
+    foo(from_dlpack(a), 3)   # First call with static layout
+
+    b = torch.tensor([1, 2, 3, 4, 5], dtype=torch.uint16)
+    foo(b, 3)                # Second call with dynamic layout
+
+In this example, the JIT function ``foo`` is compiled with a |SLAY| ``(3):(1)`` for the first call, which means the
+size of the tensor is known at compile time. |DSL| makes good use of this and automatically handles the if condition at the
+compile time. Hence the generated codes are efficient without the if condition at all.
+
+For the second call, the JIT function ``foo`` is compiled with a |DLAY| ``(?):(1)`` hence the tensor size is only
+evaluated at runtime. |DSL| automatically generates the code to handle the |DLAY| and the if condition at runtime.
+
+The same applies to loop as well:
+
+.. code-block:: python
+
+    @cute.jit
+    def foo(tensor, x: cutlass.Constexpr[int]):
+        for i in range(cute.size(tensor)):
+            cute.printf("tensor[{}]: {}", i, tensor[i])
+
+    a = torch.tensor([1, 2, 3], dtype=torch.uint16)
+    foo(from_dlpack(a), 3)   # First call with static layout
+
+    b = torch.tensor([1, 2, 3, 4, 5], dtype=torch.uint16)
+    foo(b, 3)                # Second call with dynamic layout
+
+With the static layout in the first call, |DSL| is able to fully unroll the loop at compile time. While in the second call,
+the generated codes will have the loop executed at runtime based on the |DLAY|.
+
+With the single JIT function implementation, |DSL| is able to handle control-flow constructs and automatically generate
+the optimized codes for different cases. This is all possible because |DSL| is able to walk the Python AST and convert
+each control-flow construct it finds accordingly.
+
+Please refer to :doc:`dsl_control_flow` for more details.
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
@ -0,0 +1,128 @@
+.. _dsl_introduction:
+.. |DC| replace:: dynamic compilation
+.. |IR| replace:: IR
+.. |DSL| replace:: CuTe DSL
+
+
+|DSL|
+======================
+
+.. contents:: Table of Contents
+   :depth: 2
+   :local:
+
+Overview
+--------
+
+|DSL| is a Python-based domain-specific language (DSL) designed for |DC| of numeric and GPU-oriented code. Its primary goals are:
+
+- **Consistent with CuTe C++**, allowing users to express GPU kernels with full control of the hardware.
+- **JIT compilation** for both host and GPU execution.
+- `DLPack <https://github.com/dmlc/dlpack>`_ **integration**, enabling seamless interop with frameworks (e.g., PyTorch, JAX).
+- **JIT caching**, so that repeated calls to the same function benefit from cached |IR| modules.
+- **Native types and type inference** to reduce boilerplate and improve performance.
+- **Optional lower-level control**, offering direct access to GPU backends or specialized |IR| dialects.
+
+Decorators
+----------
+
+|DSL| provides two main Python decorators for generating optimized code via |DC|:
+
+1. ``@jit`` — Host-side JIT-compiled functions
+2. ``@kernel`` — GPU kernel functions
+
+Both decorators can optionally use a **preprocessor** that automatically expands Python control flow (loops, conditionals) into operations consumable by the underlying |IR|.
+
+``@jit``
+~~~~~~~~~~~~~
+
+Declares JIT-compiled functions that can be invoked from Python or from other |DSL| functions.
+
+**Decorator Parameters**:
+
+* ``preprocessor``:
+
+  * ``True`` (default) — Automatically translate Python flow control (e.g., loops, if-statements) into |IR| operations.
+  * ``False`` — No automatic expansion; Python flow control must be handled manually or avoided.
+
+**Call-site Parameters**:
+
+- ``no_cache``:
+
+  - ``True`` — Disables JIT caching, forcing a fresh compilation each call.
+  - ``False`` (default) — Enables caching for faster subsequent calls.
+
+``@kernel``
+~~~~~~~~~~~~~~~~
+
+Defines GPU kernel functions, compiled as specialized GPU symbols through |DC|.
+
+**Decorator Parameters**:
+
+- ``preprocessor``:
+
+  - ``True`` (default) — Automatically expands Python loops/ifs into GPU-compatible |IR| operations.
+  - ``False`` — Expects manual or simplified kernel implementations.
+
+**Kernel Launch Parameters**:
+
+- ``grid``
+  Specifies the grid size as a list of integers.
+- ``block``
+  Specifies the block size as a list of integers.
+- ``cluster``
+  Specifies the cluster size as a list of integers.
+- ``smem``
+  Specifies the size of shared memory in bytes (integer).
+
+Calling Conventions
+-------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 20 15 25
+
+   * - **Caller**
+     - **Callee**
+     - **Allowed**
+     - **Compilation/Runtime**
+
+   * - Python function
+     - ``@jit``
+     - ✅
+     - DSL runtime
+
+   * - Python function
+     - ``@kernel``
+     - ❌
+     - N/A (error raised)
+
+   * - ``@jit``
+     - ``@jit``
+     - ✅
+     - Compile-time call, inlined
+
+   * - ``@jit``
+     - Python function
+     - ✅
+     - Compile-time call, inlined
+
+   * - ``@jit``
+     - ``@kernel``
+     - ✅
+     - Dynamic call via GPU driver or runtime
+
+   * - ``@kernel``
+     - ``@jit``
+     - ✅
+     - Compile-time call, inlined
+
+   * - ``@kernel``
+     - Python function
+     - ✅
+     - Compile-time call, inlined
+
+   * - ``@kernel``
+     - ``@kernel``
+     - ❌
+     - N/A (error raised)
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_jit_arg_generation.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_jit_arg_generation.rst
@ -0,0 +1,196 @@
+.. _dsl_jit_arg_generation:
+.. |DSL| replace:: CuTe DSL
+.. |CUSTOM_TYPES| replace:: customized types
+
+|DSL| JIT Function Argument Generation
+=======================================
+
+.. contents:: Table of Contents
+   :depth: 2
+   :local:
+
+In a nutshell
+--------------
+When using the ``@jit`` or ``@kernel`` decorators to define a JIT-compiled function, the arguments to the function are traced to determine the JIT function's signature.
+|DSL| provides a Pythonic way to write the arguments for JIT function as one normally would in Python, and the |DSL| will take care of the rest for you.
+
+Specifically, |DSL| honors following when generating the JIT function's arguments:
+
+- JIT function arguments are assumed to be **dynamic arguments** by default.
+- If an argument is explicitly type annotated with ``cutlass.Constexpr``, it is treated as a **compile-time constant**.
+- If type annotation is provided, |DSL| validates the argument type at compile time for **type safety**.
+- |DSL| provides **runtime checkable protocols** (``JitArgument`` and ``DynamicExpression``) for generating JIT function arguments for |CUSTOM_TYPES|.
+
+More details below for each of the above.
+
+Static argument vs. Dynamic argument
+------------------------------------
+
+|DSL| supports both static and dynamic arguments for JIT functions.
+
+1. **Static arguments** hold values that are known at compile time. It is not included in the generated JIT function signature.
+2. **Dynamic arguments** hold values that are only known at runtime.
+
+By default, |DSL| assumes dynamic arguments and tries to infer the argument types from the call-site argument types. An explicit type annotation ``cutlass.Constexpr`` can be used to specify a static argument.
+
+.. code-block:: python
+
+    import cutlass
+    import cutlass.cute as cute
+
+    @cute.jit
+    def foo(x: cutlass.Int32, y: cute.Constexpr):
+        print("x = ", x)        # Prints x = ?
+        print("y = ", y)        # Prints y = 2
+        cute.printf("x: {}", x) # Prints x: 2
+        cute.printf("y: {}", y) # Prints y: 2
+
+    foo(2, 2)
+
+In the example above, ``x`` is a dynamic argument with type cutlass.Int32 and ``y`` is a static argument.
+
+With the ``cutlass.Constexpr`` annotation, a more sophisticated uses case of static argument in the JIT functions can be something like:
+
+.. code-block:: python
+
+    import cutlass
+    import cutlass.cute as cute
+
+    @cute.kernel
+    def kernel(
+        self,
+        tiled_mma: cute.TiledMma,
+        tma_atom_a: cute.CopyAtom,
+        mA_mkl: cute.Tensor,
+        tma_atom_b: cute.CopyAtom,
+        mB_nkl: cute.Tensor,
+        tma_atom_c: Optional[cute.CopyAtom],
+        mC_mnl: cute.Tensor,
+        cluster_layout_vmnk: cute.Layout,
+        a_smem_layout_staged: cute.ComposedLayout,
+        b_smem_layout_staged: cute.ComposedLayout,
+        c_smem_layout_staged: Union[cute.Layout, cute.ComposedLayout, None],
+        epi_tile: cute.Tile,
+        epilogue_op: cutlass.Constexpr,
+    ):
+        ...
+
+        # Perform epilogue op on accumulator and convert to C type
+        acc_vec = tTR_rAcc.load()
+        acc_vec = epilogue_op(acc_vec.to(self.c_dtype))
+        tTR_rC.store(acc_vec)
+
+In this example, ``epilogue_op`` is a static argument in the JIT kernel where the argument is used for the epilogue fusion. Upon calling the kernel,
+an elementwise lambda function can be passed in as the ``epilogue_op`` argument. For example, a ReLU can be applied for epilogue fusion by simply setting the
+``epilogue_op`` to ``lambda x: cute.where(x > 0, x, cute.full_like(x, 0))``
+
+Refer to the `Blackwell dense GEMM example <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py>`__ for a complete example.
+
+Type safety
+-----------
+
+|DSL| makes good use of type annotation in JIT function signature and validates the JIT function argument types at compile time for **type safety**.
+
+.. code-block:: python
+
+    import cutlass
+    import cutlass.cute as cute
+    import numpy as np
+
+    @cute.jit
+    def foo(x: cute.Tensor, y: cutlass.Float16):
+        ...
+
+    a = np.random.randn(10, 10).astype(np.float16)
+    b = 32
+
+    foo(a, b)
+    foo(b, a)  # This will fail at compile time due to type mismatch
+
+The type safety check helps catch the type mismatch issue early at the compile time with clear error message to avoid tricky runtime errors which is usually more expensive to debug.
+In the example above, the second call to ``foo`` will fail at compile time due to the type mismatch with a clear error message:
+
+::
+
+    cutlass.base_dsl.common.DSLRuntimeError: DSLRuntimeError: expects argument #1 (a) to be <class 'cutlass.cute.typing.Tensor'>, but got <class 'int'>
+
+JIT function arguments with |CUSTOM_TYPES|
+--------------------------------------------
+|DSL| supports |CUSTOM_TYPES| for JIT function arguments by providing two runtime checkable protocols:
+
+* ``JitArgument`` which is used for host JIT functions to be called from Python.
+    - ``__c_pointers__``: Generate a list of ctypes pointers for the current object.
+    - ``__get_mlir_types__``: Generate a list of MLIR types for the current object.
+    - ``__new_from_mlir_values__``: Create a new object from MLIR values.
+
+* ``DynamicExpression`` which is used for device JIT functions to be called from the host JIT functions.
+    - ``__extract_mlir_values__``: Generate a dynamic expression for the current object.
+    - ``__new_from_mlir_values__``: Create a new object from MLIR values.
+
+Refer to `typing.py <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL/base_dsl/typing.py>`__ for more details on these protocol APIs.
+
+Depending on different cases of the |CUSTOM_TYPES|, |DSL| provides easy ways to adopt |CUSTOM_TYPES| for JIT function arguments.
+
+1. Direct protocol implementation in |CUSTOM_TYPES|
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+One way is to implement the protocol methods directly in the |CUSTOM_TYPES| to enable the protocol based JIT function argument generation.
+
+.. code-block:: python
+
+    import cutlass
+    import cutlass.cute as cute
+
+    # Customized type that implements the DynamicExpression protocol
+    class MyDynamicExpression:
+        def __init__(self, tensor, offset):
+            self._tensor = tensor # Dynamic argument
+            self._offset = offset # Dynamic argument
+
+        def __extract_mlir_values__(self):
+            return [self._tensor.__extract_mlir_values__(), self._offset.__extract_mlir_values__()]
+
+        def __new_from_mlir_values__(self, values):
+            return MyDynamicExpression(values[0], values[1])
+
+    @cute.kernel
+    def my_kernel(x: MyDynamicExpression):
+        ...
+
+In the example above, the ``MyDynamicExpression`` implements the ``DynamicExpression`` protocol and |DSL| will generate the JIT function arguments for the JIT kernel ``my_kernel`` based on the protocol methods.
+
+2. Adaptor based protocol implementation for |CUSTOM_TYPES|
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For the case where directly changing the |CUSTOM_TYPES| to implement the protocol is not feasible, |DSL| provides adaptor based approach to adapt the |CUSTOM_TYPES| for JIT function argument generation.
+
+The JIT function argument adaptor is a callable object that implements the desired protocol methods for the registered |CUSTOM_TYPES|. This way, |DSL| automatically queries the JIT argument adaptor registry
+to generate the JIT function arguments for the given |CUSTOM_TYPES|.
+
+.. code-block:: python
+
+    @cutlass.register_jit_arg_adapter(MyFrameworkObject)
+    class MyFrameworkObjectAdapter:
+        """
+        Convert a 3rd party framework object to a JIT function argument with JitArgument protocol
+        """
+
+        def __init__(self, arg):
+            self._arg = arg
+
+        def __c_pointers__(self):
+            # Convert the framework object to a C-ABI compatible object
+            # thru its C-ABI interface
+            return [self._arg.get_cabi_pointer()]
+
+        def __get_mlir_types__(self):
+            # Return the list of MLIR types the framework object represents
+            return [self._arg.get_data().mlir_type]
+
+        def __new_from_mlir_values__(self, values):
+            # Convert the MLIR values back to the framework object
+            return MyFrameworkObject(values[0])
+
+In this example, the ``MyFrameworkObjectAdapter`` implements an adaptor class which bridges the |DSL| and the 3rd party framework type ``MyFrameworkObject``.
+The registration is done by just decorating the adaptor with ``cutlass.register_jit_arg_adapter`` for the customized type. With the registered adaptor,
+|DSL| will automatically use the adaptor to generate the JIT function arguments for ``MyFrameworkObject`` typed arguments.
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst
@ -0,0 +1,152 @@
+.. _dsl_jit_caching:
+.. |DSL| replace:: CuTe DSL
+
+.. _JIT_Caching:
+
+|DSL| JIT Caching
+====================
+.. contents:: Table of Contents
+   :depth: 2
+   :local:
+
+Zero Compile and JIT Executor
+-----------------------------
+
+Zero Compile is a feature that enables explicit kernel compilation on demand through ``cute.compile``.
+When ``cute.compile`` is called, it compiles the kernel and returns a JIT Executor instance.
+This JIT Executor instance can be cached and reused directly for subsequent executions without compiling the kernel again.
+
+The JIT Executor is a component that independently executes compiled code.
+It can be created either through ``cute.compile`` or implicit compilation.
+The JIT Executor instance behaves like a callable object to execute the compiled code.
+Each JIT Executor instance maintains a single compiled host function.
+
+It encompasses all necessary execution components:
+
+* Host function pointer and its MLIR execution engine
+* CUDA modules (optional)
+* Argument specifications defining how Python arguments are converted to C ABI-compatible types. Note that arguments with the ``cutlass.Constexpr`` hint are excluded from argument specifications since they are evaluated at compile time rather than runtime.
+
+For example, in the following code, ``print_result`` is a ``cutlass.Constexpr`` value that is **NOT** evaluated at runtime:
+
+.. code-block:: python
+
+   import cutlass.cute as cute
+
+   @cute.jit
+   def add(a, b, print_result: cutlass.Constexpr):
+      if print_result:
+         cute.printf("Result: %d\n", a + b)
+      return a + b
+
+   jit_executor = cute.compile(add, 1, 2, True)
+
+   jit_executor(1, 2) # output: ``Result: 3``
+
+The JIT Executor ensures all components are properly initialized and loaded after compilation.
+
+For example, all CUDA modules are loaded (via ``cuModuleLoad``) and kernel function pointers are extracted (via ``cuModuleGetFunction``).
+
+When calling a JIT Executor instance, it:
+
+* Parses Python runtime arguments and converts them to C ABI-compatible types according to argument specifications
+* Invokes the host function with the converted arguments
+
+Custom Caching with ``cute.compile``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``cute.compile`` bypasses caching in |DSL| and always performs compilation, returning a fixed JIT Executor instance.
+This allows implementing custom caching strategies as shown below:
+
+.. code-block:: python
+
+   @cute.jit
+   def add(b):
+      return a + b
+
+   # Define a custom cache
+   custom_cache = {}
+
+   a = 1
+   compiled_add_1 = cute.compile(add, 2)
+   custom_cache[1] = compiled_add_1
+   compiled_add_1(2) # result = 3
+
+   a = 2
+   compiled_add_2 = cute.compile(add, 2)
+   custom_cache[2] = compiled_add_2
+   compiled_add_2(2) # result = 4
+
+   # Use the custom cache
+   custom_cache[1](2) # result = 3
+   custom_cache[2](2) # result = 4
+
+
+Cache in |DSL|
+-----------------
+
+By default, cache in |DSL| is implicitly enabled to avoid recompilation when kernels are called repeatedly without changes.
+
+The cache is implemented as a map storing compiled JIT Executor instances within |DSL|.
+
+The cache key combines hashes of:
+
+* MLIR bytecode of the MLIR program generated by |DSL|
+* All |DSL| Python source files
+* All |DSL| shared libraries
+* All |DSL| environment variables
+
+The cache value is a compiled JIT Executor instance.
+
+On a cache hit, compilation is skipped and the cached JIT Executor instance is reused.
+
+On a cache miss, the kernel is compiled and the new JIT Executor instance is stored in the cache.
+
+Here is an example demonstrating automatic caching of the ``add`` kernel:
+
+.. code-block:: python
+
+   # Global variable
+   a = 1
+
+   @cute.jit
+   def add(b):
+      return a + b
+
+   # Cache is empty at beginning
+
+   # First call: cache miss triggers compilation
+   result = add(2) # result = 3
+   # Cache now has one instance
+
+   # Second call: cache hit reuses cached JIT Executor
+   result = add(2) # result = 3
+
+   a = 2
+   # Third call: cache miss due to changed IR code triggers recompilation
+   result = add(2) # result = 4
+   # Cache now has two instances
+
+The cache can be serialized to files for subsequent runs.
+After serialization, compiled MLIR bytecode is stored in file.
+The cache directory is ``/tmp/{current_user}/cutlass_python_cache``.
+The cache loads from files into memory during |DSL| initialization and saves back to files when the process exits.
+
+The following environment variables control file caching:
+
+.. code-block:: bash
+
+   # Disable file caching while keeping in-memory cache available, defaults to False.
+   export CUTE_DSL_DISABLE_FILE_CACHING=True
+
+   # Maximum number of cache files allowed, defaults to 1000.
+   export CUTE_DSL_FILE_CACHING_CAPACITY=1000
+
+Limitations
+~~~~~~~~~~~~~~~~~~~~~
+
+The intention of caching is to reduce the host launch overhead before each execution. As above example shows,
+the consistency between the original Python code and the MLIR program is hard to maintain because of the impact of dynamic factors such as global variables.
+Therefore, the MLIR program **MUST** always be generated to verify that the kernel content matches what was previously built.
+
+For optimal host launch latency, we recommend using above custom caching method with ``cute.compile``.
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_modes.png
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_modes.png
--- a/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst
@ -0,0 +1,412 @@
+.. _framework_integration:
+.. |DSL| replace:: CuTe DSL
+
+Integration with Frameworks
+=============================
+
+.. contents:: Table of Contents
+   :depth: 2
+   :local:
+
+In order to facilitate the integration of CUTLASS Python with popular frameworks, we leverage the
+`DLPack protocol <https://github.com/dmlc/dlpack>`_ and transform tensors originating from these
+frameworks to CuTe tensors. The present page documents the conventions, the API available to the
+user, and provide example code snippets for common usage patterns.
+
+Implicit Conversion
+-------------------
+
+Tensors originating from frameworks supporting the DLPack protocol can be directly provided to a
+JIT function as a regular parameter. |DSL|'s  runtime implicitly converts the original tensor to a
+CuTe tensor with a fully dynamic layout except for the stride element corresponding to the leading
+dimension. The example below demonstrates this use case.
+
+.. code-block:: python
+
+    import torch
+    import cutlass.cute as cute
+
+    @cute.jit
+    def foo(src):
+        """
+        The following lines print
+
+        ptr<f32, generic> o (?,?,?):(?,?,1)
+        <class 'cutlass.cute.core._Tensor'>
+        """
+        print(src)
+        print(type(src))
+
+    a = torch.randn(30, 20, 32, device="cpu")
+    foo(a)
+
+
+Explicit conversion using ``from_dlpack``
+------------------------------------------
+
+|DSL|'s runtime provides an interface for converting DLPack-compatible tensors to CuTe tensors,
+
+.. code-block:: python
+
+    b = cute.runtime.from_dlpack(a)
+
+where ``a`` is a tensor supporting the DLPack protocol with the ``__dlpack__``
+and ``__dlpack_device__`` methods. The resulting CuTe tensor ``b`` has a fully static layout. This
+conversion is performed without copying any tensor data, enabling seamless integration with major
+frameworks. Users can create tensors using NumPy, PyTorch, etc. and directly feed them into JIT
+functions writtnen using |DSL|.
+
+The resulting CuTe tensor shares the same underlying memory buffer as the original tensor. This
+zero-copy approach maximizes performance by eliminating unnecessary data duplication. However, it is
+important to note that the CuTe tensor's validity is tied to the lifetime of the original tensor. If
+the source tensor is destroyed or goes out of scope, the corresponding CuTe tensor becomes invalid
+since it references the original memory location.
+
+The full signature of from_dlpack is as follows:
+
+.. code-block:: python
+
+    def from_dlpack(tensor, assumed_align=None):
+
+The ``assumed_align`` integer parameter specifies the alignment of the tensor in unit of bytes.
+The tensor's base address must be divisible by ``assumed_align``. When not provided explicitly,
+the alignment is set to the natural alignment of the tensor's element type. Note that the alignment
+information is part of the pointer type in the generated IR. Therefore, programs with different
+alignments have a different IR and identical IRs are required for hitting the kernel caching
+mechanism of |DSL|.
+
+Code Example
+~~~~~~~~~~~~
+
+The following code demonstrates how to convert a PyTorch tensor to a CuTe tensor using the
+``from_dlpack`` function with default parameters.
+
+.. code-block:: python
+
+    import torch
+    import cutlass
+    from cutlass.cute.runtime import from_dlpack
+
+    x = torch.randn(30, 20, device="cpu")
+    y = from_dlpack(x)
+
+Once converted, we can access the tensor's information through various
+attributes. The following list shows the attributes of the converted tensor:
+
+- ``tensor.shape``: the tensor's shape
+- ``tensor.stride``: the tensor's stride
+- ``tensor.memspace``: the tensor's memory space
+- ``tensor.element_type``: the tensor's element data type
+
+.. code-block:: python
+
+    import torch
+    import cutlass
+    from cutlass.cute.runtime import from_dlpack
+
+    x = torch.randn(30, 20, device="cpu")
+    y = from_dlpack(x)
+
+    print(y.shape)        # (30, 20)
+    print(y.stride)       # (20, 1)
+    print(y.memspace)     # generic (if torch tensor in on device memory, memspace will be gmem)
+    print(y.element_type) # Float32
+    print(y)              # Tensor<0x000000000875f580@generic o (30, 20):(20, 1)>
+
+The string format of the resulting CuTe tensor is
+
+.. code-block::
+
+    Tensor<0x{tensor.data_ptr:016x}@{tensor.memspace} o {tensor.shape}:{tensor.stride}>
+
+As can be seen in the example above, ``from_dlpack`` first results in a tensor with a static layout.
+To obtain dynamic or mixed static/dynamic layouts after calling ``from_dlpack``, the
+``mark_layout_dynamic`` and ``mark_compact_shape_dynamic`` functions are used and described in
+the following sections.
+
+When to Use Explicit Conversion?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The DLPack protocol is a widely used protocol for interoperability between different frameworks.
+However, there is some associated overhead. Based on our benchmark, it usually takes between 2 to 3
+us per call to ``from_dlpack``.
+
+Explicit conversion allows for caching the converted CuTe tensors in order to avoid the overhead of
+repeated calls to ``from_dlpack``.
+
+.. code-block:: python
+
+    x = torch.randn(30, 20, device="cpu")
+    if key not in cached_tensors:
+        # Do the conversion only for cache misses
+        cached_tensors[key] = cute.runtime.from_dlpack(x)
+    foo(cached_tensors[key])
+
+Another use case for explicit conversion is to gain fine-grain control over which modes of a tensor
+are considered dynamic from the perspective of the generated program.
+
+Mark the Tensor's Layout as Dynamic with ``mark_layout_dynamic``
+----------------------------------------------------------------
+
+After calling this function, all shape modes become dynamic. The stride modes also become dynamic
+with the following two exceptions:
+
+1. the leading dimension's stride remains fixed at 1;
+2. stride elements equal to 0 (which indicates broadcasting) are retained.
+
+The full signature of ``mark_layout_dynamic`` is as follows:
+
+.. code-block:: python
+
+    def mark_layout_dynamic(self, leading_dim: int|None = None):
+
+The ``leading_dim`` parameter specifies the leading dimension of the tensor. The leading dimension's
+stride is set to 1 unless inconsistent with the layout of the DLPack tensor. For example,
+
+- For a tensor with layout ``(2,2,3,4):(2,1,4,12)``, if ``leading_dim`` is specified to be 1,
+  the layout will be marked as ``(?,?,?,?):(?,1,?,?)``.
+- If ``leading_dim`` is specified to be 0, a deduction failure error is raised because the stride of
+  dimension 0 is 2 (not 1).
+
+The default value for ``leading_dim`` is ``None``.  In such case, the system
+automatically deduces it from the tensor's layout using the following logic:
+
+1. If a dimension's stride is 1, that dimension is marked as the leading dimension.
+2. If multiple dimensions satisfy condition 1, an error is thrown indicating deduction failure.
+   Note that after converting a **PyTorch** tensor to the DLPack format, the stride for dimensions
+   with size 1 are canonicalized to 1. This canonicalization can increase the likelihood of
+   deduction failures. This behavior is specific to PyTorch and does not occur with NumPy for
+   example.
+3. If no dimension satisfies condition 1, all strides are marked as dynamic.
+
+For example:
+
+- For a tensor with layout ``(2,2,3,4):(2,1,4,12)``, the leading dimension is 1.
+  The layout will be marked as ``(?,?,?,?):(?,1,?,?)``.
+- For a tensor with layout ``(1,5,1):(1,1,1)``, if ``leading_dim`` is not specified,
+  a deduction failure error is raised.
+- For a tensor with layout ``(2,2):(8,2)``, since no dimension has stride 1,
+  all dimensions are marked as dynamic: ``(?,?):(?,?)``.
+
+Code Example
+~~~~~~~~~~~~
+
+The following example demonstrates how to use ``mark_layout_dynamic`` to specify dynamic tensor layouts.
+
+* ``t0`` shows the usage of ``mark_layout_dynamic`` with unspecified ``leading_dim`` and the automatic deduction of leading dimension.
+* ``t1`` & ``t2`` shows the usage of ``mark_layout_dynamic`` with specified ``leading_dim``.
+* ``t3`` shows the usage of ``mark_layout_dynamic`` with no leading dimension.
+* ``t4`` shows the usage of ``mark_layout_dynamic`` with broadcasted dimensions.
+* ``t5`` demonstrates the deduction failure when the there're more than one dimensions with stride equals to 1.
+* ``t6`` & ``t7`` demonstrates incorrect settings for ``leading_dim`` and expected errors.
+
+.. code-block:: python
+
+    import torch
+    from cutlass.cute.runtime import from_dlpack
+
+    # (8,4,16,2):(2,16,64,1)
+    a = torch.empty(16, 4, 8, 2).permute(2, 1, 0, 3)
+    # (1,4,1,32,1):(4,1,4,4,4) => torch tensor when dimension has shape 1, its stride is degenerated to 1,
+    # resulting in (1,4,1,32,1):(1,1,1,4,1)
+    b = torch.empty(32, 1, 1, 1, 4).permute(3, 4, 1, 0, 2)
+    # (2,2):(8,2)
+    c = torch.empty(3, 4)[::2, ::2]
+    # (3,1,1,5):(5,0,0,1)
+    d = torch.empty(3, 1, 1, 5).expand(3, 4, 2, 5)
+
+    # auto deduce the leading dimension to be 3
+    t0 = from_dlpack(a).mark_layout_dynamic()
+    print(t0)
+    # (?,?,?,?):(?,?,?,1)
+
+    t1 = from_dlpack(b).mark_layout_dynamic(leading_dim=0)
+    print(t2)
+    # (?,?,?,?,?):(1,?,?,?,?)
+
+    t2 = from_dlpack(b).mark_layout_dynamic(leading_dim=2)
+    print(t3)
+    # (?,?,?,?,?):(?,?,1,?,?)
+
+    t3 = from_dlpack(c).mark_layout_dynamic()
+    print(t3)
+    # (?,?):(?,?)
+
+    t4 = from_dlpack(d).mark_layout_dynamic()
+    print(t4)
+    # (?,?,?,?):(?,0,0,1)
+
+    t5 = from_dlpack(b).mark_layout_dynamic()
+    # Can't decude the leading dimension from layout, please specify the leading_dim explicitly.
+
+    t6 = from_dlpack(a).mark_layout_dynamic(leading_dim=1)
+    # Expected strides[leading_dim] == 1, but got 16
+
+    t7 = from_dlpack(b).mark_layout_dynamic(leading_dim=3)
+    # Expected strides[leading_dim] == 1, but got 4
+
+Mark the Tensor's Layout as Dynamic with ``mark_compact_shape_dynamic``
+-----------------------------------------------------------------------
+
+The ``mark_compact_shape_dynamic`` function provides fine-grain control over dynamic shapes for compact
+layouts. The full signature of ``mark_compact_shape_dynamic`` is as follows:
+
+.. code-block:: python
+
+    def mark_compact_shape_dynamic(self, mode: int, stride_order: tuple[int, ...]|None = None, divisibility: int = 1):
+
+The ``mode`` parameter determines which shape dimension becomes dynamic. After calling this function,
+the specific shape dimension given by ``mode`` is marked as dynamic immediately. The stride will be
+updated accordingly but this process is delayed until the C ABI of the tensor is constructed.
+For modes that have a shape of size 1, their stride are canonicalized to 0.
+
+The ``stride_order`` parameter specifies the ordering of strides in the tensor. It is consistent
+with ``torch.Tensor.dim_order()`` and defaults to ``None``. The parameter indicates the order of
+modes (dimensions) if the current layout were to be converted to row-major order. It starts from the
+outermost to the innermost dimension when reading it from left to right. This parameter must be
+explicitly set when the stride order cannot be automatically deduced from the tensor's layout, such
+as when multiple dimensions have a stride of 1.
+
+For example:
+
+- Layout ``(4,2):(1,4)`` has a ``stride_order`` of ``(1,0)`` indicates the innermost dimension is
+  0 (``4:1``), the outermost dimension is 1 (``2:4``).
+- Layout ``(5,3,2,4):(3,1,15,30)`` has a ``stride_order`` of ``(3,2,0,1)`` indicates the innermost
+  dimension is 1 (``3:1``), the outermost dimension is 3 (``4:30``).
+
+If ``stride_order`` is not specified, the system automatically deduces it from the tensor's layout
+using the following logic:
+
+1. Sort the strides in descending order.
+2. If multiple dimensions have a stride of 1, a deduction failure error is raised.
+
+For example:
+
+- For a tensor with layout ``(2,2,3,4):(2,1,4,12)``, the deduced ``stride_order`` is ``[3,2,0,1]``.
+- For a tensor with layout ``(1,5,1):(1,1,1)``, ``stride_order``'s deduction fails because
+  all dimensions have an identical stride of 1, making it impossible to determine the correct ordering.
+
+If ``stride_order`` is specified, the system validates that the order is consistent with the
+tensor's layout.
+
+The ``divisibility`` parameter specifies the divisibility of the dynamic shape. It could be used to
+represent the assumption alignment of the input. Defaults to 1.
+
+Note that this API is only available for compact tensors. For non-compact tensors, we can use
+``cute.assume`` to attach divisibility information to a specific shape mode in a host JIT function,
+as demonstrated in the following example:
+
+.. code-block:: python
+
+    @cute.jit
+    def foo(a: cute.Tensor):
+        new_shape = a.shape
+        # use cute.assume to set shape of mode=0 with divisibility=16
+        new_shape[0] = cute.assume(new_shape[0], 16)
+        new_layout = cute.make_layout(new_shape, stride=a.stride)
+        new_a = cute.make_tensor(a.iterator, new_layout)
+
+
+Code Example
+~~~~~~~~~~~~
+
+The following example demonstrates how to use ``mark_compact_shape_dynamic`` to specify dynamic tensor layouts.
+
+* ``t0`` & ``t1`` show the usage of ``mark_compact_shape_dynamic`` with unspecified ``stride_order`` and different ``mode`` and ``divisibility``.
+* ``t2`` shows the usage of consecutive ``mark_compact_shape_dynamic`` with unspecified ``stride_order`` and different ``mode`` and ``divisibility``.
+* ``t3`` & ``t4`` show the usage of ``mark_compact_shape_dynamic`` with different specified ``stride_order``.
+* ``t5``, ``t6``, ``t7``, ``t8``, ``t9``, ``t10``, ``t11``, and ``t12`` demonstrate incorrect settings for parameters and expected errors.
+
+.. code-block:: python
+
+    import torch
+    from cutlass.cute.runtime import from_dlpack
+
+    @cute.jit
+    def kernel(t: cute.Tensor):
+        pass
+
+    # (8,4,16,2):(2,16,64,1)
+    a = torch.empty(16, 4, 8, 2).permute(2, 1, 0, 3)
+    # (1,4,1,32,1):(4,1,4,4,4) => torch tensor when dimension has shape 1, its stride is degenerated to 1,
+    # resulting in (1,4,1,32,1):(1,1,1,4,1)
+    # b.dim_order() is (3,2,4,0,1)
+    b = torch.empty(32, 1, 1, 1, 4).permute(3, 4, 1, 0, 2)
+
+    # auto deduce the stride order to be [2,1,0,3]
+    t0 = from_dlpack(a).mark_compact_shape_dynamic(
+        mode=0, divisibility=2
+    )
+    kernel(t0)
+    # (?{div=2},4,16,2):(2,?{div=4},?{div=16},1)
+    print(t0)
+
+    t1 = from_dlpack(a).mark_compact_shape_dynamic(
+        mode=1, divisibility=2
+    )
+    kernel(t1)
+    # (8,?{div=2},16,2):(2,16,?{div=32},1)
+    print(t1)
+
+    t2 = from_dlpack(a).mark_compact_shape_dynamic(
+        mode=1, divisibility=2
+    ).mark_compact_shape_dynamic(
+        mode=3, divisibility=2
+    )
+    kernel(t2)
+    # (8,?{div=2},16,?{div=2}):(?{div=2},?{div=16},?{div=32},1)
+    print(t2)
+
+    t3 = from_dlpack(b).mark_compact_shape_dynamic(
+        mode=2, divisibility=1, stride_order=(3, 0, 2, 4, 1)
+    )
+    kernel(t3)
+    # (1,4,?,32,1):(0,1,4,?{div=4},0)
+    print(t3)
+
+    t4 = from_dlpack(b).mark_compact_shape_dynamic(
+        mode=2, divisibility=1, stride_order=(2, 3, 4, 0, 1)
+    )
+    kernel(t4)
+    # (1,4,?,32,1):(0,1,128,4,0)
+    print(t4)
+
+    t5 = t2.mark_compact_shape_dynamic(
+        mode=3, divisibility=5, stride_order=(0, 1, 2, 3)
+    )
+    # The stride_order is not consistent with the last stride_order
+
+    t6 = from_dlpack(a).mark_compact_shape_dynamic(
+        mode=3, divisibility=5, stride_order=(0, 1, 2, 3)
+    )
+    # The stride_order is not consistent with the deduced stride_order
+
+    t7 = from_dlpack(b).mark_compact_shape_dynamic(
+        mode=0, divisibility=4
+    )
+    # The layout could not be deduced, please specify the stride_order explicitly
+
+    t8 = from_dlpack(b).mark_compact_shape_dynamic(
+        mode=30, divisibility=5, stride_order=(3, 0, 2, 4, 1)
+    )
+    # Expected mode value to be in range [0, 5), but got 30
+
+    t9 = from_dlpack(b).mark_compact_shape_dynamic(
+        mode=3, divisibility=5, stride_order=(2, 1, 2, 3, 4)
+    )
+    # Expected stride_order to contain all the dimensions of the tensor, but it doesn't contain 0.
+
+    t10 = from_dlpack(b).mark_compact_shape_dynamic(
+        mode=3, divisibility=5, stride_order=(0, 1, 2, 3, 4, 5)
+    )
+    # Expected stride_order to have 5 elements, but got 6.
+
+    t11 = from_dlpack(b).mark_compact_shape_dynamic(
+        mode=0, divisibility=4, stride_order=b.dim_order()
+    )
+    # The shape(1) of mode(0) is not divisible by the divisibility(4)
+
+    t12 = from_dlpack(b).mark_compact_shape_dynamic(
+        mode=0, divisibility=1, stride_order=(2, 1, 3, 0, 4)
+    )
+    # The stride_order is not consistent with the layout
--- a/media/docs/pythonDSL/cute_dsl_general/notebooks.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/notebooks.rst
@ -0,0 +1,16 @@
+.. _notebooks:
+
+Educational Notebooks
+=====================
+
+A number of notebooks for educational purposes are provided in the `CUTLASS GitHub repository <https://github.com/NVIDIA/cutlass>`__.
+A list with handful links is given below:
+
+- `"Hello world" <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/hello_world.ipynb>`__
+- `Printing <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/print.ipynb>`__
+- `Data Types Basics <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/data_types.ipynb>`__
+- `Tensors <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/tensor.ipynb>`__
+- `The TensorSSA Abstraction <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/tensorssa.ipynb>`__
+- `Layout Algebra <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/cute_layout_algebra.ipynb>`__
+- `Element-wise Add Tutorial <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/elementwise_add.ipynb>`__
+- `Using CUDA Graphs <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/cuda_graphs.ipynb>`__
--- a/media/docs/pythonDSL/faqs.rst
+++ b/media/docs/pythonDSL/faqs.rst
@ -0,0 +1,137 @@
+.. _faqs:
+
+FAQs
+====
+
+General
+---------------------
+
+**Are the DSLs replacing C++ templates?**
+
+    TL;DR: No - but also yes. The CUTLASS 4.0 release (CuTe DSL), along with all
+    future extensions to our Python-native programming models, does not come at the
+    expense of CUTLASS C++.  CUTLASS 2.x and 3.x C++ APIs are both going to continue
+    receiving fixes and updates for the architectures we support them for. However,
+    CUTLASS 4.x CuTe DSL is fully isomorphic in its programming model and performance
+    with CuTe C++ for Blackwell, and it is our hope that the community embraces this
+    for much easier while still equally performant custom kernel development.  This is
+    why we are releasing CuTe DSL with support for all architectures starting with the
+    NVIDIA Ampere Architecture.
+
+**What is the difference between CuTe DSL, CUTLASS Python, and CUTLASS DSLs?**
+
+    CUTLASS Python was the Python interface for instantiating C++ kernels via a Python
+    frontend. This is now deprecated with the release of CUTLASS 4.0. CUTLASS DSLs are
+    a family of Python DSLs for native device programming in Python. Currently, this is
+    limited to our initial release of CuTe DSL, but future versions will include higher-level
+    abstractions that gradually trade off control for convenience.
+
+**What should I learn, CUTLASS C++ or the Python DSLs?**
+
+    We believe the Python DSLs will significantly improve the learning curve and recommend starting
+    with them for all newcomers, as they eliminate the inherent complexity of learning C++
+    metaprogramming for GPU kernel programming. Since CuTe C++ and CuTe DSL share fully isomorphic
+    programming models and patterns, any knowledge gained can eventually be applied to C++.
+
+**Where will the code live? PIP wheel or GitHub repo? Do I have to build it myself?**
+
+    This is a major change compared to CUTLASS C++ and Python DSLs. Going forward,
+    the GitHub code only exists as a way for users to file issues and pull requests against.
+    While it can be used with the pip wheel, we do not recommend most users do so unless they are
+    hacking on the DSL itself. For all other users, we recommend they
+    simply ``pip install nvidia-cutlas-dsl`` and use the pip wheel as the single source
+    of truth for the dialect compiler and DSL implementation. CUTLASS GitHub repository will
+    contain a ``requirements.txt`` file pinning the version of the wheel consistent with the state
+    of the OSS repository (please see :doc:`quick_start`). This means getting started with
+    CUTLASS is easier than ever: no more CMake command lines to learn and no more builds to kick
+    off. Simply install the pip wheel and start running the examples.
+
+Migration
+---------------------
+
+**Should I port my code from C++ templates to Python?**
+
+    Almost certainly not, unless you need extremely fast JIT times for your kernel and C++ compile times
+    are a blocker for you. The 2.x and 3.x APIs will continue to be supported, and Nvidia's Hopper and
+    Blackwell architectures 3.x will continue to improve in terms of features
+    and performance.
+
+**Are portability promises different with Python?**
+
+    For the initial release while the DSL is still in beta, we do not promise any portability
+    as we may make changes to the DSL itself. While we do not expect any changes to the CuTe operations,
+    the DSL utilities, decorators, helper classes like pipelines and schedulers may change as we refine them
+    with community feedback. We encourage users to file issues and discussions on GitHub during this
+    beta period with their feedback!
+
+    In the long term, we plan to continue to treat the OSS community with care.
+    Just like the prior history of CUTLASS, we plan not to break users unless necessary,
+    but we reserve the right to make limited breaking changes in case we believe it is a
+    net benefit to the community and project. These will be announced ahead of time and/or
+    clearly highlighted in the CHANGELOG of each release.
+
+Technical
+---------------------
+**What NVIDIA architectures will it support?**
+
+    CuTe DSL will support all NVIDIA GPU architectures starting with NVIDIA Ampere Architecture (SM80).
+
+**Will it be compatible with DL frameworks (e.g., PyTorch, JAX)?**
+
+    Yes, we will provide utilities to convert from DLPack-supported tensor formats
+    to ``cute.Tensor``. This should allow a user to never have to leave Python
+    when writing model code in their framework of choice. Our JAX interoperability story is not
+    as strong as PyTorch's today, however, we are actively working on improving it
+    and welcome contributions in this space.
+
+**Does it compile to PTX or SASS?**
+
+    CuTe DSL compiles the program down to PTX. After that, we currently use the PTX compiler that
+    ships with the CUDA toolkit to compile the PTX down to SASS. We plan to remove
+    this limitation in the future and allow the use of the PTX JIT that is included in the
+    CUDA driver in case a user does not have a CUDA toolkit installed.
+
+**Do I need to use NVCC or NVRTC?**
+
+    No, the ``nvidia-cutlass-dsl`` wheel packages is everything needed to generate GPU kernels. It
+    shares the driver requirements of the 12.9 toolkit which can be found
+    `here <https://developer.nvidia.com/cuda-toolkit-archive>`__.
+
+**How would one debug the code?**
+
+    Since CuTe DSL is not native python and an embedded DSL instead, tools like `pdb`
+    cannot be used.  However, if you have experience with GPU kernel programming, the debugging
+    techniques will be nearly identical. Typically, compile time and runtime printing
+    of types and values are the most expedient. Please see `documentation on printing <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/print.ipynb>`__
+    to learn how to print types and values at both compile time and runtime.
+    You can also use ``cuda-gdb`` to set breakpoints in the program and step through the execution
+    or use tools such as ``compute-sanitizer`` to detect and triage bugs in your program. As the DSL
+    matures, our source location tracking from Python user programs will also improve to provide
+    more helpful source-level mapping when setting breakpoints and using other tools such as nsight.
+
+**How would one implement warp specialization in CuTe DSL?**
+
+    Exactly the same way you would in C++ but in a Python-native syntax instead.
+    Consult our :doc:`cute_dsl_general/dsl_control_flow` and
+    `"Blackwell kernel example" <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py>`__
+    for a detailed how-to guide.
+
+**Can I call functions from other functions or use OOP?**
+
+    Yes. We frequently call functions from one another and set up class
+    hierarchies to organize and modularize our code for pipelines and schedulers.
+    Consult the :doc:`cute_dsl_general/dsl_introduction` documentation or our examples for more details.
+
+License
+---------------------
+**Q:What is the license for CuTe DSL and the associated GitHub samples?**
+    CuTe DSL components available `on Github <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL>`__ and via the nvidia-cutlass-dsl Python pip wheel
+    are released under the `"NVIDIA Software End User License Agreement (EULA)" <https://github.com/NVIDIA/cutlass/tree/main/EULA.txt>`__.
+    Because the pip package includes a compiler that shares several components with the CUDA Toolkit,
+    it is subject to usage terms and restrictions similar to those of the CUDA SDK. Please refer to the EULA for specific terms of use.
+
+    CuTe DSL samples and Jupyter notbooks, released `on GitHub <https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL>`__ are provided under
+    the BSD 3-Clause License and may be used and redistributed under those terms. This distinction ensures that developers have flexibility
+    when using or modifying the code samples, independent of the compiler and runtime components governed by the EULA.
+
+    If you have any questions or need clarification, feel free to contact us.
--- a/media/docs/pythonDSL/functionality.rst
+++ b/media/docs/pythonDSL/functionality.rst
@ -0,0 +1,34 @@
+.. _functionality:
+
+Functionality
+====================
+
+The CUTLASS DSL 4.0 release supports **Python 3.12** only.  It shares the same driver requirements 
+as the `CUDA Toolkit 12.9 <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html>`__.
+Specifically, the driver version must be 575.51.03 or later.
+
+Currently, only Linux x86_64 is supported. Additional platform support will be added in future releases.
+
+Supported MMA Operations
+---------------------------------
+
+**NVIDIA Ampere Architecture:**
+
+- FP16 / BF16 tensor core instructions
+
+**NVIDIA Hopper Architecture:**
+
+- FP16 / BF16
+- FP8
+
+**NVIDIA Blackwell Architecture:**
+
+- FP16 / BF16
+- TF32
+- I8
+- F8
+
+Notable Limitations
+------------------------------
+
+For current constraints and unsupported features, refer to the :doc:`limitations` section.
--- a/media/docs/pythonDSL/limitations.rst
+++ b/media/docs/pythonDSL/limitations.rst
@ -0,0 +1,279 @@
+.. _limitations:
+
+Limitations
+====================
+
+.. contents::
+   :depth: 2
+   :local:
+
+Overview
+---------------------
+CuTe DSL is an embedded domain-specific language within Python. It utilizes a subset of Python's
+syntax to provide a streamlined programming experience. It is important to understand that CuTe DSL
+does NOT implement the complete Python language semantics in its JIT compilation process.
+
+This section documents the current limitations of the CuTe DSL. While some of these limitations
+may be addressed in future releases, developers should be aware of them when building applications with
+the DSL.
+
+Notable unsupported features
+----------------------------
+
+- GeForce RTX 50 Series support
+- RS WGMMA (The input matrix A comes from register and the input matrix B comes from shared memory)
+- Programmatic Dependent Launch (PDL)
+- narrow-precision data type support, including related tensor core instructions
+- convolutions
+- full support for ahead of time compilation
+- preferred clusters
+- CLC-based tile schedulers
+- EVT support
+- Windows support
+
+Programming Model
+---------------------
+
+**Python Native Data Types**
+    CuTe DSL supports Python data structures when used for "meta-programming,"
+    but these structures cannot be treated as dynamic values modifiable at runtime.
+    For instance, lists and dictionaries can be used to configure kernel parameters
+    during compilation or serve as containers for dynamic values,
+    but their structure and organization cannot be altered during kernel execution.
+
+    - **Static Values:**
+        - Evaluated during JIT compilation phase
+        - Immutable after compilation completes
+        - Most Python native types (lists, tuples, dictionaries) are processed as static values
+        - Primarily utilized for "meta-programming" and configuration purposes
+        - Example: Lists can contain dynamic values but their structure cannot
+          be modified during kernel execution
+
+    - **Dynamic Values:**
+        - Evaluated during runtime execution
+        - Modifiable during execution of JIT-compiled functions
+        - Only a specific subset of Python types are supported as dynamic values
+        - Primitive types are automatically converted when passed as function arguments:
+          - ``int`` → ``Int32`` (may be updated to ``Int64`` in future releases)
+          - ``bool`` → ``Bool``
+          - ``float`` → ``Float32`` (may be updated to ``Float64`` in future releases)
+
+    The JIT compiler processes Python native types analogously to C++ template parameters.
+    The compiled code cannot manipulate dynamic values of composite types
+    such as lists, tuples, or dictionaries.
+
+    For example, following code doesn't work as traditional Python program inside JIT function.
+
+    .. code:: python
+
+        @cute.jit
+        def foo(a: Float32, b: Float32, i: Int32, res: cute.Tensor):
+            xs = [a, b]
+            # indexing list with dynamic index is not supported in CuTe DSL:
+            res[0] = xs[i]
+
+            if i == 0:
+                # This will alway append Float32(3.0) to the list regardless
+                # of the runtime value of `i`
+                xs.append(Float32(3.0))
+
+            for i in range_dynamic(10):
+                # This only append one element to the list at compile-time
+                # as loop doesn't unroll at compile-time
+                xs.append(Float32(1.0))
+
+**Python Function**
+    The DSL currently does not implement support for return values from Python functions,
+    although this capability is planned for future releases.
+
+    Example:
+
+    .. code:: python
+
+        @cute.jit
+        def foo():
+            return 1  # Currently unsupported in CuTe DSL
+
+**Expression or Statement with Dependent Type**
+    CuTe DSL implements static typing and does not support dependent types.
+    The type of each expression must be determinable during compile time,
+    in contrast to standard Python which implements dynamic typing.
+
+    Example illustrating functionality in Python that is not supported in the DSL:
+
+    .. code:: python
+
+        # Valid in standard Python, but unsupported in CuTe DSL
+        max(int(1), float(2.0))  # => 2.0 : float
+        max(int(3), float(2.0))  # => 3   : int
+
+    In CuTe DSL, types are promoted. For example:
+
+    .. code:: python
+
+        @cute.jit
+        def foo(a: Int32, b: Float32, res: cute.Tensor):
+            res[0] = max(a, b)  # Type is automatically promoted to Float32
+
+    Following code using inlined if-else expression with dependent types
+    is not supported in CuTe DSL:
+
+    .. code:: python
+
+        @cute.jit
+        def foo(cond: Boolean, a: Int32, b: Float32, res: cute.Tensor):
+            res[0] = a if cond else b
+
+
+**Control Flow**
+    The DSL transforms Python control flow statements (``if``, ``for``, ``while``)
+    during Abstract Syntax Tree (AST) processing into structured control flow in MLIR
+    which has the same constraints as dependent types. For instance,
+    changing type of a variable in loop body is not allowed.
+
+    - Variables must be defined prior to the control flow statement
+    - Type consistency must be maintained throughout the control flow statement
+    - Don't support early exit or return from if-else statements
+
+    Example illustrating functionality in Python that is not supported in the DSL:
+
+    .. code:: python
+
+        @cute.jit
+        def foo():
+            a = Int32(1)
+            for i in range_dynamic(10):
+                a = Float32(2)  # Changing type inside loop-body is not allowed in the DSL
+
+**Built-in Operators**
+    The DSL transforms built-in operators like ``and``, ``or``, ``max``, ``min``, etc.
+    into MLIR operations. They also follow the same constraints of dependent types.
+    For instance, ``a and b`` requires ``a`` and ``b`` to be of the same type.
+
+    Comparison like ``==`` on Sequence of dynamic values is known to not produce
+    expected result at runtime.
+
+**Object Oriented Programming**
+    The DSL is implemented on top of Python and supports Python's object-oriented programming (OOP) features
+    for meta-programming at compile-time.
+
+    However, similar to other composed data types, the DSL provides limited support for OOP when objects
+    contain dynamic values. It is strongly recommended to avoid passing dynamic values between member methods
+    through class state in your code.
+
+    The following example illustrates functionality in Python that is not supported in the DSL
+    without implementing the ``DynamicExpression`` protocol:
+
+    .. code:: python
+
+        class Foo:
+            def __init__(self, a: Int32):
+                self.a = a
+
+            def set_a(self, i: Int32):
+                self.a = i
+
+            def get_a(self):
+                return self.a
+
+        @cute.jit
+        def foo(a: Int32, res: cute.Tensor):
+            foo = Foo(a)
+            for i in cutlass.range_dynamic(10):
+                foo.set_a(i)
+
+            # This fails to compile because `a` is assigned a local value defined within the for-loop body
+            # and is not visible outside of the loop body
+            res[0] = foo.get_a()
+
+    The example above fails to compile because ``Foo.a`` is assigned a local value defined within the for-loop body,
+    which is not visible outside the loop body.
+
+    The CuTe DSL implements an internal mechanism that provides limited support for OOP patterns via protocol.
+    As the DSL continues to evolve to support additional features, this mechanism is subject to change
+    and is not recommended for direct use in users' code for better portability.
+
+
+**CuTe Layout algebra in native Python**
+    Entirety of CuTe Layout algebra operations and APIs require JIT compilation. These 
+    functionalities are exclusively available within JIT-compiled functions and cannot be 
+    accessed in standard Python execution environments.
+    
+    Additionally, there exists a restricted set of data types that can be passed as arguments 
+    to JIT-compiled functions, which further constrains their usage in native Python contexts. 
+    Only following CuTe algebra types are supported as JIT function arguments: ``Tensor``, ``Pointer``, 
+    ``Shape``, ``Stride``, ``Coord`` and ``IntTuple``. For ``Stride``, we don't support ``ScacledBasis``
+    from native Python Context. Unfortunately, in the first release, we don't support 
+    passing ``Layout`` under native Python Context.
+
+
+Suggestions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For reliable and predictable results:
+
+- Avoid dependent types in your code
+- Implement explicit type conversion for dynamic values
+- Clearly distinguish between static (compile-time) and dynamic (runtime) values
+- Use type annotations as much as possible to help JIT compiler
+  to identify type to avoid ambiguity
+
+
+.. code:: python
+
+    # Example demonstrating explicit typing
+    alpha = 1.0  # Explicitly defined as float using `1.0` instead of `1`
+                 #  or `float(1)`
+    beta = 2.0   # Explicitly defined as float
+    result = max(alpha, beta)  # Will correctly perform float comparison
+
+**Debugging Capabilities**
+    Debugging tools and facilities for the Python DSL are currently more limited in comparison to the C++
+    API. For instance, we don't support single-stepping through the JIT-compiled code. And lack of exception
+    handling in JIT-compiled code makes it hard to debug in some cases.
+
+**Integration with Frameworks**
+    Integration with certain deep learning frameworks is in early development stages and may have
+    limitations. For instance, converting frameworking tensor to cute.Tensor is known to have overhead
+    with 2us~3us per tensor as we convert from general DLPack protocol which offers comptibility with
+    all frameworks.
+
+**Hashing DSL APIs and Objects**
+    DSL APIs and Objects are sensitive to MLIR context, region or other contextual information which has no meaning cross
+    different context. Any stateful design rely on ``__hash__`` likely misbehave with unexpected results. An example is
+    ``functools.lru_cache``, which combined with ``@cute.jit``, it may cache MLIR object from one context and use in another one.
+
+
+Future Improvements
+---------------------
+
+The CuTe DSL development team is actively addressing these limitations.
+Upcoming releases will aim to:
+
+- Implement support for return values from JIT compiled functions
+- Improve support for built-in operators to handle more cases without dependent types
+- Enhance debugging capabilities and tools
+- Improve error messages with precise diagnostic information
+- Extend support for additional numeric data types
+- Improve performance of converting framework tensor to ``cute.Tensor`` with native support
+  for different frameworks
+- Offer more user friendly benchmarking methodology
+
+Design Limitations Likely to Remain
+--------------------------------------------
+
+The primary objective of CuTe DSL is to provide a domain-specific language for expressing
+complex CUDA kernels with optimal GPU performance, not to execute arbitrary Python code on GPU hardware.
+
+The following limitations will likely remain by design:
+
+- **Complex Data Structures as Dynamic Values**: Lists, tuples, and dictionaries will continue to function
+  as static containers. While they can store dynamic values, their structure (adding/removing elements)
+  cannot be modified during execution of JIT-compiled functions.
+
+- **Dependent Types**: Supporting dependent types would introduce substantial complexity and
+  adversely affect the performance characteristics of generated code.
+
+- **CuTe Layout Algebra**: We don't have plan to extend the support of CuTe Layout Algebra 
+  under native Python Context. We are planning to extend support for data types and allow 
+  JIT function to interoperate with native Python code.
--- a/media/docs/pythonDSL/overview.rst
+++ b/media/docs/pythonDSL/overview.rst
@ -0,0 +1,108 @@
+.. _overview:
+
+Overview
+===========================
+
+CUTLASS 4.x bridges the gap between productivity and performance for CUDA kernel development. 
+By providing Python-based DSLs to the powerful CUTLASS C++ template library, it enables 
+faster iteration, easier prototyping, and a gentler learning curve for high-performance linear 
+algebra on NVIDIA GPUs.
+
+Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). 
+With the release of 4.0, we are releasing the first of these in CuTe DSL. 
+This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing 
+core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.
+
+Why CUTLASS DSLs?
+============================
+
+While CUTLASS offers exceptional performance through its C++ template abstractions, the complexity 
+can present challenges for many developers. CUTLASS 4.x addresses this by:
+
+- **Simplifying metaprogramming**: Metaprogramming in Python is a lot more intuitive than with C++
+- **Accelerating Iteration**: Rapid prototyping with familiar Python syntax and blazing fast compile times
+- **Lowering Barriers**: Reduced learning curve for GPU programming concepts and consistency between CuTe C++ and DSL
+- **Maintaining Performance**: Generated code leverages optimized CUTLASS primitives
+
+Students can learn GPU programming concepts without the complexity of C++ templates. 
+Researchers and performance engineers can rapidly explore algorithms, prototype, and tune 
+kernels before moving to production implementations.
+
+Key Concepts and Approach
+================================
+
+CUTLASS DSLs translate Python code into a custom intermediate representation (IR), 
+which is then Just-In-Time (JIT) compiled into optimized CUDA kernels using MLIR and `ptxas`.
+
+Core CuTe DSL Abstractions
+-----------------------------------
+
+- **Layouts** – Describe how data is organized in memory and across threads.
+- **Tensors** – Combine data pointers or iterators with layout metadata.
+- **Atoms** – Represent fundamental hardware operations like matrix multiply-accumulate (MMA) or memory copy.
+- **Tiled Operations** – Define how atoms are applied across thread blocks and warps (e.g., ``TiledMma``, ``TiledCopy``).
+
+For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md>`__.
+
+**Pythonic Kernel Expression**
+
+Developers express kernel logic, data movement, and computation using familiar Python syntax and control flow.
+
+The DSLs simplify expressing loop tiling, threading strategies, and data transformations using concise Python code.
+
+**JIT Compilation**
+
+Python kernels are compiled at runtime into CUDA device code using MLIR infrastructure and NVIDIA’s ``ptxas`` toolchain, 
+enabling rapid iteration and interactive debugging.
+
+Relationship to CUTLASS C++
+=================================
+
+CUTLASS DSLs are not a replacement for the CUTLASS C++ library or its 2.x and 3.x APIs. Instead, it aims to be a high-productivity kernel 
+authoring framework that shares all concepts with CUTLASS 3.x C++ API such as CuTe, pipelines, schedulers etc.
+
+- **Performance**: Generated kernels aim to match CUTLASS C++ kernels in performance; however, some performance gaps 
+  may exist due to missing optimizations that have been added over the years to CUTLASS C++ and may be missing in the DSLs examples.
+- **Library**: The CUTLASS DSLs do not currently ship with a full GEMM/Conv autotuning profiler or library interface 
+  akin to CUTLASS C++. Instead, it focuses on generating and autotuning individual kernel instances (for example: via tile size exploration) and via native integration DL frameworks that support auto-tuning.
+
+Getting Started
+================================
+
+- :doc:`quick_start` – Initial setup and installation.
+- :doc:`cute_dsl` – Overview of the typical development and workflow using CuTe DSL.
+- :doc:`cute_dsl_api` – Refer to the full API documentation.
+- :doc:`limitations` – Understand current CuTe DSL constraints and differences from C++.
+- :doc:`faqs` – Common questions and known issues.
+
+Current Status & Roadmap
+=================================
+
+CuTe DSL is in public beta and actively evolving. Interfaces and features are subject to 
+change as we improve the system.
+
+Upcoming Milestones
+----------------------------------
+
+- Public release targeted for **Summer 2025**
+- Expanded support for additional data types and kernel types
+- Usability improvements: better error messages, debugging tools, and streamlined APIs
+- Broader integration of CUTLASS primitives and features
+
+For known issues and workarounds, please consult the :doc:`limitations` and :doc:`faqs`.
+
+Community & Feedback
+==================================
+
+We welcome contributions and feedback from the developer community!
+
+You can:
+
+- Submit bug reports or feature requests via our `GitHub Issues page <https://github.com/NVIDIA/cutlass/issues>`__
+- Join the CUTLASS community on `Discord <https://discord.com/channels/1019361803752456192/1150868614921064590>`__ to ask questions and share ideas
+- Contribute examples, tutorials, or enhancements to the DSLs
+- Report unclear or missing documentation
+- Propose support for additional data types or kernel variants
+- Help prioritize roadmap features by upvoting GitHub issues
+
+Thank you for helping shape the future of CUTLASS DSLs!
--- a/media/docs/pythonDSL/quick_start.rst
+++ b/media/docs/pythonDSL/quick_start.rst
@ -0,0 +1,31 @@
+.. _quick_start:
+
+Quick Start Guide
+=======================
+
+The CUTLASS DSL 4.0 release currently supports **Linux** and **Python 3.12** only. To install CUTLASS DSLs (limited to CuTe DSL for now), use the following command
+
+Installation
+-----------------------
+
+To install the CUTLASS DSL, run:
+
+.. code-block:: bash
+
+   pip install nvidia-cutlass-dsl
+
+The ``nvidia-cutlass-dsl`` wheel includes everything needed to generate GPU kernels. It requires 
+the same NVIDIA driver version as the 
+`CUDA Toolkit 12.9 <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html>`_.
+
+To ensure compatibility with the examples and code on `GitHub <https://github.com/NVIDIA/cutlass/tree/main/python>`_,
+use the ``requirements.txt`` file from the corresponding commit in the repository.
+
+Recommended Dependencies
+---------------------------------
+
+To run examples and begin development, we recommend installing:
+
+.. code-block:: bash
+
+   pip install torch jupyter