v4.3 update. (#2709)

* v4.3 update. * Update the cute_dsl_api changelog's doc link * Update version to 4.3.0 * Update the example link * Update doc to encourage user to install DSL from requirements.txt --------- Co-authored-by: Larry Wu <larwu@nvidia.com>
2025-10-22 02:26:30 +08:00
parent e6e2cc29f5
commit b1d6e2c9b3
244 changed files with 59272 additions and 10455 deletions
--- a/media/docs/cpp/cute/02_layout_algebra.md
+++ b/media/docs/cpp/cute/02_layout_algebra.md
@ -151,7 +151,7 @@ For example,
 * `(3,6,2,8) /  9 => (1,2,2,8)`
 * `(3,6,2,8) / 72 => (1,1,1,4)`

-To compute the strides of the strided layout, the residues of the above operation are used to scale the strides of `A`. For instance, the last example `(3,6,2,8):(w,x,y,z) / 72` with strides `(w,x,y,z)` produces `(72*w,24*x,4*y,2*z)` as the strides of the strided layout.
+To compute the strides of the strided layout, the residues of the above operation are used to scale the strides of `A`. For instance, the last example `(3,6,2,8):(w,x,y,z) / 72` with strides `(w,x,y,z)` produces `(72*w,24*x,4*x,2*z)` as the strides of the strided layout.

 As you may have noticed, we can only divide shapes by certain values and get a sensible result. This is called the **stride divisibility condition** and is statically checked in CuTe when possible.

@ -388,7 +388,7 @@ Informally, `logical_divide(A, B)` splits a layout `A` into two modes -- in the

 Formally, this can be written as

-$A \oslash B := A \circ (B,B^*)$
+$$A \oslash B := A \circ (B,B^*)$$

 and implemented as
 ```cpp
--- a/media/docs/pythonDSL/cute_dsl_api.rst
+++ b/media/docs/pythonDSL/cute_dsl_api.rst
@ -8,6 +8,6 @@ CuTe DSL API

  changelog <cute_dsl_api/changelog.rst>
  cute <cute_dsl_api/cute.rst>
-  cute_arch <cute_dsl_api/cute_arch.rst>
  cute_nvgpu <cute_dsl_api/cute_nvgpu.rst>
+  pipeline <cute_dsl_api/pipeline.rst>
  utils <cute_dsl_api/utils.rst>
--- a/media/docs/pythonDSL/cute_dsl_api/changelog.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/changelog.rst
@ -2,7 +2,32 @@
 Changelog for CuTe DSL API changes
 ======================================

-`4.2.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0>`_ (2025-09-15)
+`4.3.0 <https://github.com/NVIDIA/cutlass/releases/tree/main>`_ (2025-10-07)
+==============================================================================
+
+* Debuggability improvements:
+  - Supported source location tracking for DSL APIs
+  - Supported dumping PTX and SASS code
+* Remove deprecated ``cutlass.<arch>_utils.SMEM_CAPACITY["<arch_str>"]`` and ``cutlass.utils.ampere_helpers``
+* Support calling nested functions without capturing variables inside dynamic control flow
+* Replace usage of ``cute.arch.barrier`` in examples with corresponding APIs in ``pipeline``
+  - Use ``pipeline.sync`` for simple cases like synchronizing the whole CTA
+  - Use ``pipeline.NamedBarrier`` to customize barriers with different participating threads and barrier id
+* Added new APIs ``repeat`` and ``repeat_as_tuple``
+* Added new APIs ``make_rmem_tensor`` to replace ``make_fragment`` with better naming
+* Added new APIs ``make_rmem_tensor_like`` which create rmem tensor from a tensor using the same shape with compact col-major strides
+* Added ``TmemAllocator`` for allocating tensor memory
+* Updated ``SmemAllocator.allocate`` to support allocation of a single scalar value
+* Fixed ``TensorSSA.reduce`` to support static value as initial value
+* Updated docstring for following APIs to be more concise and easier to understand:
+  - ``make_layout_tv``
+  - ``is_static``
+  - ``PipelineAsync``
+  - ``SmemAllocator``
+* Fixed documentation for ``pipeline``, ``utils`` and ``cute.math``
+
+
+`4.2.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0>`_ (2025-09-10)
 ==============================================================================

 * Added back ``cute.make_tiled_copy`` per the request from community
@ -40,7 +65,7 @@ Changelog for CuTe DSL API changes
  - Introduce S2T CopyOps in `tcgen05/copy.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py>`_.
  - Introduce BlockScaled layout utilities in `blockscaled_layout.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/utils/blockscaled_layout.py>`_ for creating the required scale factor layouts in global memory, shared memory and tensor memory.

-* ``cutlass.cute.compile`` now supports compilation options. Refer to `JIT compilation options <https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.html>`_ for more details.
+* ``cutlass.cute.compile`` now supports compilation options. Refer to `JIT compilation options <https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.html>`_ for more details.
 * ``cutlass.cute.testing.assert_`` now works for device JIT function. Specify ``--enable-device-assertions`` as compilation option to enable.
 * ``cutlass.cute.make_tiled_copy`` is now deprecated. Please use ``cutlass.cute.make_tiled_copy_tv`` instead.
 * Shared memory capacity query
--- a/media/docs/pythonDSL/cute_dsl_api/cute.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/cute.rst
@ -9,3 +9,9 @@ cutlass.cute
   :show-inheritance:
   :special-members: __init__
   :private-members:
+
+.. toctree::
+  :maxdepth: 2
+  :hidden:
+
+  cute_arch
--- a/media/docs/pythonDSL/cute_dsl_api/cute_arch.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/cute_arch.rst
@ -1,17 +1,18 @@
 .. _cute_arch:

-cutlass.cute.arch
-=================
+arch
+====

-The ``cute.arch`` module contains wrappers around NVVM-level MLIR Op builders that seamlessly
-inter-operate with the Python types used in CUTLASS Python. Another benefit of wrapping these Op
-builders is that the source location can be tracked with the ``@dsl_user_op`` decorator. Available
-functions include
+The ``cute.arch`` module provides lightweight wrappers for NVVM Operation builders which implement CUDA built-in 
+device functions such as ``thread_idx``. It integrates seamlessly with CuTe DSL types.

- basic API like ``thr_idx``;
- functions related to the direct management of mbarriers;
- low-level SMEM management (prefer using the ``SmemAllocator`` class);
- TMEM management.
+These wrappers enable source location tracking through the ``@dsl_user_op``
+decorator. The module includes the following functionality:
+
+- Core CUDA built-in functions such as ``thread_idx``, ``warp_idx``, ``block_dim``, ``grid_dim``, ``cluster_dim``, and related functions
+- Memory barrier management functions including ``mbarrier_init``, ``mbarrier_arrive``, ``mbarrier_wait``, and associated operations  
+- Low-level shared memory (SMEM) management capabilities, with ``SmemAllocator`` as the recommended interface
+- Low-level tensor memory (TMEM) management capabilities, with ``TmemAllocator`` as the recommended interface

 API documentation
 -----------------
--- a/media/docs/pythonDSL/cute_dsl_api/pipeline.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/pipeline.rst
@ -0,0 +1,9 @@
+cutlass.pipeline
+================
+
+.. automodule:: cutlass.pipeline
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :special-members: __init__
+   :private-members:
--- a/media/docs/pythonDSL/cute_dsl_api/utils.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/utils.rst
@ -1,9 +1,19 @@
 cutlass.utils
 =============

+The ``cutlass.utils`` module contains utilities for developing kernels with CuTe DSL.
+
 .. automodule:: cutlass.utils
   :members:
   :undoc-members:
   :show-inheritance:
   :special-members: __init__
   :private-members:
+   :exclude-members: sm90_make_smem_layout_a, sm90_make_smem_layout_b, sm90_make_smem_layout_epi
+
+.. toctree::
+  :maxdepth: 2
+  :hidden:
+
+  utils_sm90
+  utils_sm100
--- a/media/docs/pythonDSL/cute_dsl_api/utils_sm100.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/utils_sm100.rst
@ -0,0 +1,10 @@
+.. _utils_sm100:
+
+Utilities for SM100
+===================
+
+.. automodule:: cutlass.utils.sm100
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :special-members: __init__
--- a/media/docs/pythonDSL/cute_dsl_api/utils_sm90.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/utils_sm90.rst
@ -0,0 +1,10 @@
+.. _utils_sm90:
+
+Utilities for SM90
+==================
+
+.. automodule:: cutlass.utils.sm90
+   :members:
+   :undoc-members:
+   :show-inheritance:
+   :special-members: __init__
--- a/media/docs/pythonDSL/cute_dsl_general/debugging.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/debugging.rst
@ -15,6 +15,14 @@ Understanding these limitations will help you avoid potential pitfalls from the
 Please refer to :doc:`../limitations` for more details.


+Source Code Correlation
+-----------------------
+
+CuTe DSL provides Python code to PTX/SASS correlation to enable the profiling/debugging of generated kernels with debug symbols by generating line info when compiling the kernel.
+
+You can enable that globally via the environment variable CUTE_DSL_LINEINFO=1. Alternative, you can use compilation options to enable that per kernel. Please refer to :doc:`./dsl_jit_compilation_options` for more details.
+
+
 DSL Debugging
 -------------

@ -75,6 +83,48 @@ This helps you verify whether the IR is generated as expected.
    export CUTE_DSL_KEEP_IR=1


+Dump the generated PTX & CUBIN
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For users familiar with PTX and SASS, CuTe DSL supports dumping the generated PTX and CUBIN.
+
+.. code:: bash
+
+    # Dump generated PTX in a .ptx file (default: False)
+    export CUTE_DSL_KEEP_PTX=1
+
+    # Dump generated cubin in a .cubin file (default: False)
+    export CUTE_DSL_KEEP_CUBIN=1
+
+To further get SASS from cubin, users can use ``nvdisasm`` (usually installed with CUDA toolkit) to disassemble the cubin.
+
+.. code:: bash
+
+    nvdisasm your_dsl_code.cubin > your_dsl_code.sass
+
+
+Access the dumped contents programmatically
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For compiled kernels, the generated PTX/CUBIN/IR can be accessed programmatically as well through following attributes:
+
+- ``__ptx__``: The generated PTX code of the compiled kernel.
+- ``__cubin__``: The generated CUBIN data of the compiled kernel.
+- ``__mlir__``: The generated IR code of the compiled kernel.
+
+.. code:: python
+    
+    compiled_foo = cute.compile(foo, ...)
+    print(f"PTX: {compiled_foo.__ptx__}")
+    with open("foo.cubin", "wb") as f:
+        f.write(compiled_foo.__cubin__)
+
+
+Change the dump directory
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default, all dumped files are saved in the current working directory. To specify a different directory for the dumped files, please set the environment variable CUTE_DSL_DUMP_DIR accordingly.
+

 Kernel Functional Debugging
 ----------------------------
@ -122,6 +172,7 @@ For detecting memory errors and race conditions:

 Please refer to the `compute-sanitizer documentation <https://developer.nvidia.com/compute-sanitizer>`_ for more details.

+
 Conclusion
 ----------

--- a/media/docs/pythonDSL/cute_dsl_general/dsl_jit_arg_generation.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_jit_arg_generation.rst
@ -124,7 +124,7 @@ JIT function arguments with |CUSTOM_TYPES|
    - ``__extract_mlir_values__``: Generate a dynamic expression for the current object.
    - ``__new_from_mlir_values__``: Create a new object from MLIR values.

-Refer to `typing.py <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL/cutlass/base_dsl/typing.py>`__ for more details on these protocol APIs.
+Refer to `typing.py <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL/base_dsl/typing.py>`__ for more details on these protocol APIs.

 Depending on different cases of the |CUSTOM_TYPES|, |DSL| provides easy ways to adopt |CUSTOM_TYPES| for JIT function arguments.

--- a/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.rst
@ -18,9 +18,11 @@ Compilation options allow you to customize how your JIT-compiled functions are b

 These options can be passed as keyword arguments to ``cute.compile`` or set globally for all JIT compilations. The available options and their effects are described in the following sections, along with usage examples to help you get started.

+The |DSL| provides multiple ways to specify compilation options - either by specifying additional arguments to ``cute.compile`` or by using a more Pythonic approach with separate Python types for ``cute.compile``.

-``cute.compile`` Compilation Options
------------------------------------
+
+``cute.compile`` Compilation Options as strings
+-----------------------------------------------

 You can provide additional compilation options as a string when calling ``cute.compile``. The |DSL| uses ``argparse`` to parse these options and will raise an error if any invalid options are specified.

@ -36,10 +38,30 @@ You can provide additional compilation options as a string when calling ``cute.c
     - Optimization level of compilation. The higher the level, the more optimizations are applied. The valid value range is [0, 3].
     - 3 (highest level of optimization)
     - int
-   * - ``enable-device-assertions``
-     - Enable device code assertions.
+   * - ``enable-assertions``
+     - Enable host and device code assertions.
     - False
     - bool
+   * - ``keep-cubin``
+     - Keep the generated CUBIN file.
+     - False
+     - bool
+   * - ``keep-ptx``
+     - Keep the generated PTX file.
+     - False
+     - bool
+   * - ``ptxas-options``
+     - The options to pass to the PTX Compiler library.
+     - ""
+     - str
+   * - ``generate-line-info``
+     - Generate line information for debugging.
+     - False
+     - bool
+   * - ``gpu-arch``
+     - The GPU architecture to compile for.
+     - ""
+     - str

 You can use the following code to specify compilation options:

@ -47,4 +69,34 @@ You can use the following code to specify compilation options:

   jit_executor_with_opt_level_2 = cute.compile(add, 1, 2, options="--opt-level 2")
   jit_executor_with_opt_level_1 = cute.compile(add, 1, 2, options="--opt-level 1")
-   jit_executor_with_enable_device_assertions = cute.compile(add, 1, 2, options="--enable-device-assertions")
+   jit_executor_with_enable_device_assertions = cute.compile(add, 1, 2, options="--enable-assertions")
+   jit_executor_with_keep_cubin = cute.compile(add, 1, 2, options="--keep-cubin")
+   jit_executor_with_keep_ptx = cute.compile(add, 1, 2, options="--keep-ptx")
+   jit_executor_with_ptxas_options = cute.compile(add, 1, 2, options="--ptxas-options '--opt-level=2'")
+
+
+``cute.compile`` Compilation Options as separate Python types
+-------------------------------------------------------------
+
+Alternatively, you can also use a more Pythonic way to specify compilation options with separate Python types.
+Compilation options can be programmatically composed using tuple and passed to ``cute.compile`` separately.
+
+.. code-block:: python
+
+  from cutlass.cute import OptLevel, EnableAssertions, GenerateLineInfo, KeepCUBIN, KeepPTX
+
+  my_debugging_options = (OptLevel(1), EnableAssertions, GenerateLineInfo, KeepCUBIN, KeepPTX)
+  compiled_kernel_1 = cute.compile[my_debugging_options](my_kernel_1, ...)
+  compiled_kernel_2 = cute.compile[my_debugging_options](my_kernel_2, ...)
+
+This approach causes invalid options to raise errors immediately, making it much easier to detect typos when specifying multiple options.
+Notebly, boolean options are automatically converted to True instances of the option type for convenience.
+
+.. code-block:: python
+
+   jit_executor_with_opt_level_2 = cute.compile[OptLevel(2)](add, 1, 2)
+   jit_executor_with_opt_level_1 = cute.compile[OptLevel(1)](add, 1, 2)
+   jit_executor_with_enable_device_assertions = cute.compile[EnableAssertions](add, 1, 2)
+   jit_executor_with_keep_cubin = cute.compile[KeepCUBIN](add, 1, 2)
+   jit_executor_with_keep_ptx = cute.compile[KeepPTX](add, 1, 2)
+   jit_executor_with_ptxas_options = cute.compile[PtxasOptions("--opt-level=2")](add, 1, 2)
--- a/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst
@ -63,7 +63,7 @@ The full signature of from_dlpack is as follows:

 .. code-block:: python

-    def from_dlpack(tensor, assumed_align=None):
+    def from_dlpack(tensor, assumed_align=None, use_32bit_stride=False):

 The ``assumed_align`` integer parameter specifies the alignment of the tensor in unit of bytes.
 The tensor's base address must be divisible by ``assumed_align``. When not provided explicitly,
@ -72,6 +72,13 @@ information is part of the pointer type in the generated IR. Therefore, programs
 alignments have a different IR and identical IRs are required for hitting the kernel caching
 mechanism of |DSL|.

+The ``use_32bit_stride`` parameter determines whether to use 32-bit stride for the tensor's dynamic stride values.
+By default, it is set to False (64bit) to ensure that address calculations do not risk overflow. For smaller
+problem sizes (where ``cosize(layout_of_tensor) <= Int32_MAX``), users may set it to True (32bit) to improve performance
+by reducing register usage and the number of address calculation instructions. When ``use_32bit_stride`` is set
+to True, a runtime check is performed to ensure that the layout does not overflow. Please note that this parameter
+only has an effect when the tensor's layout is marked as dynamic.
+
 Code Example
 ~~~~~~~~~~~~

@ -242,6 +249,10 @@ The following example demonstrates how to use ``mark_layout_dynamic`` to specify
    t7 = from_dlpack(b).mark_layout_dynamic(leading_dim=3)
    # Expected strides[leading_dim] == 1, but got 4

+    c = torch.empty(1000000000, 1000000000)
+    t8 = from_dlpack(c, use_32bit_stride=True).mark_layout_dynamic()
+    # Layout in DLTensorWrapper has int32 overflow risk. Please set use_32bit_stride to False.
+
 Mark the Tensor's Layout as Dynamic with ``mark_compact_shape_dynamic``
 -----------------------------------------------------------------------

@ -398,6 +409,12 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
    )
    # The stride_order is not consistent with the layout

+    c = torch.empty(1000000000, 1000000000)
+    t13 = from_dlpack(c, use_32bit_stride=True).mark_compact_shape_dynamic(
+        mode=0, divisibility=1
+    )
+    # Layout in DLTensorWrapper has int32 overflow risk. Please set use_32bit_stride to False.
+

 Bypass the DLPack Protocol
 --------------------------
--- a/media/docs/pythonDSL/quick_start.rst
+++ b/media/docs/pythonDSL/quick_start.rst
@ -8,7 +8,15 @@ The CUTLASS DSL 4.0 release currently supports **Linux** and **Python 3.12** onl
 Installation
 -----------------------

-To install the CUTLASS DSL, run:
+To ensure compatibility with the examples and code on `GitHub <https://github.com/NVIDIA/cutlass/tree/main>`_,
+use the `requirements.txt <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/requirements.txt>`_ file from the corresponding commit in the repository.
+
+.. code-block:: bash
+
+   git clone https://github.com/NVIDIA/cutlass.git
+   pip install -r cutlass/python/CuTeDSL/requirements.txt
+   
+If you just want to try out the last known stable release of the CUTLASS DSL (may not compatible with the latest examples and code), run:

 .. code-block:: bash

@ -18,9 +26,6 @@ The ``nvidia-cutlass-dsl`` wheel includes everything needed to generate GPU kern
 the same NVIDIA driver version as the 
 `CUDA Toolkit 12.9 <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html>`_.

-To ensure compatibility with the examples and code on `GitHub <https://github.com/NVIDIA/cutlass/tree/main/python>`_,
-use the ``requirements.txt`` file from the corresponding commit in the repository.
-
 Recommended Dependencies
 ---------------------------------