v4.3 update. (#2709)

* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>
This commit is contained in:
Junkai-Wu
2025-10-22 02:26:30 +08:00
committed by GitHub
parent e6e2cc29f5
commit b1d6e2c9b3
244 changed files with 59272 additions and 10455 deletions

View File

@ -151,7 +151,7 @@ For example,
* `(3,6,2,8) / 9 => (1,2,2,8)`
* `(3,6,2,8) / 72 => (1,1,1,4)`
To compute the strides of the strided layout, the residues of the above operation are used to scale the strides of `A`. For instance, the last example `(3,6,2,8):(w,x,y,z) / 72` with strides `(w,x,y,z)` produces `(72*w,24*x,4*y,2*z)` as the strides of the strided layout.
To compute the strides of the strided layout, the residues of the above operation are used to scale the strides of `A`. For instance, the last example `(3,6,2,8):(w,x,y,z) / 72` with strides `(w,x,y,z)` produces `(72*w,24*x,4*x,2*z)` as the strides of the strided layout.
As you may have noticed, we can only divide shapes by certain values and get a sensible result. This is called the **stride divisibility condition** and is statically checked in CuTe when possible.
@ -388,7 +388,7 @@ Informally, `logical_divide(A, B)` splits a layout `A` into two modes -- in the
Formally, this can be written as
$A \oslash B := A \circ (B,B^*)$
$$A \oslash B := A \circ (B,B^*)$$
and implemented as
```cpp

View File

@ -8,6 +8,6 @@ CuTe DSL API
changelog <cute_dsl_api/changelog.rst>
cute <cute_dsl_api/cute.rst>
cute_arch <cute_dsl_api/cute_arch.rst>
cute_nvgpu <cute_dsl_api/cute_nvgpu.rst>
pipeline <cute_dsl_api/pipeline.rst>
utils <cute_dsl_api/utils.rst>

View File

@ -2,7 +2,32 @@
Changelog for CuTe DSL API changes
======================================
`4.2.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0>`_ (2025-09-15)
`4.3.0 <https://github.com/NVIDIA/cutlass/releases/tree/main>`_ (2025-10-07)
==============================================================================
* Debuggability improvements:
- Supported source location tracking for DSL APIs
- Supported dumping PTX and SASS code
* Remove deprecated ``cutlass.<arch>_utils.SMEM_CAPACITY["<arch_str>"]`` and ``cutlass.utils.ampere_helpers``
* Support calling nested functions without capturing variables inside dynamic control flow
* Replace usage of ``cute.arch.barrier`` in examples with corresponding APIs in ``pipeline``
- Use ``pipeline.sync`` for simple cases like synchronizing the whole CTA
- Use ``pipeline.NamedBarrier`` to customize barriers with different participating threads and barrier id
* Added new APIs ``repeat`` and ``repeat_as_tuple``
* Added new APIs ``make_rmem_tensor`` to replace ``make_fragment`` with better naming
* Added new APIs ``make_rmem_tensor_like`` which create rmem tensor from a tensor using the same shape with compact col-major strides
* Added ``TmemAllocator`` for allocating tensor memory
* Updated ``SmemAllocator.allocate`` to support allocation of a single scalar value
* Fixed ``TensorSSA.reduce`` to support static value as initial value
* Updated docstring for following APIs to be more concise and easier to understand:
- ``make_layout_tv``
- ``is_static``
- ``PipelineAsync``
- ``SmemAllocator``
* Fixed documentation for ``pipeline``, ``utils`` and ``cute.math``
`4.2.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0>`_ (2025-09-10)
==============================================================================
* Added back ``cute.make_tiled_copy`` per the request from community
@ -40,7 +65,7 @@ Changelog for CuTe DSL API changes
- Introduce S2T CopyOps in `tcgen05/copy.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py>`_.
- Introduce BlockScaled layout utilities in `blockscaled_layout.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/utils/blockscaled_layout.py>`_ for creating the required scale factor layouts in global memory, shared memory and tensor memory.
* ``cutlass.cute.compile`` now supports compilation options. Refer to `JIT compilation options <https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.html>`_ for more details.
* ``cutlass.cute.compile`` now supports compilation options. Refer to `JIT compilation options <https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.html>`_ for more details.
* ``cutlass.cute.testing.assert_`` now works for device JIT function. Specify ``--enable-device-assertions`` as compilation option to enable.
* ``cutlass.cute.make_tiled_copy`` is now deprecated. Please use ``cutlass.cute.make_tiled_copy_tv`` instead.
* Shared memory capacity query

View File

@ -9,3 +9,9 @@ cutlass.cute
:show-inheritance:
:special-members: __init__
:private-members:
.. toctree::
:maxdepth: 2
:hidden:
cute_arch

View File

@ -1,17 +1,18 @@
.. _cute_arch:
cutlass.cute.arch
=================
arch
====
The ``cute.arch`` module contains wrappers around NVVM-level MLIR Op builders that seamlessly
inter-operate with the Python types used in CUTLASS Python. Another benefit of wrapping these Op
builders is that the source location can be tracked with the ``@dsl_user_op`` decorator. Available
functions include
The ``cute.arch`` module provides lightweight wrappers for NVVM Operation builders which implement CUDA built-in
device functions such as ``thread_idx``. It integrates seamlessly with CuTe DSL types.
- basic API like ``thr_idx``;
- functions related to the direct management of mbarriers;
- low-level SMEM management (prefer using the ``SmemAllocator`` class);
- TMEM management.
These wrappers enable source location tracking through the ``@dsl_user_op``
decorator. The module includes the following functionality:
- Core CUDA built-in functions such as ``thread_idx``, ``warp_idx``, ``block_dim``, ``grid_dim``, ``cluster_dim``, and related functions
- Memory barrier management functions including ``mbarrier_init``, ``mbarrier_arrive``, ``mbarrier_wait``, and associated operations
- Low-level shared memory (SMEM) management capabilities, with ``SmemAllocator`` as the recommended interface
- Low-level tensor memory (TMEM) management capabilities, with ``TmemAllocator`` as the recommended interface
API documentation
-----------------

View File

@ -0,0 +1,9 @@
cutlass.pipeline
================
.. automodule:: cutlass.pipeline
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__
:private-members:

View File

@ -1,9 +1,19 @@
cutlass.utils
=============
The ``cutlass.utils`` module contains utilities for developing kernels with CuTe DSL.
.. automodule:: cutlass.utils
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__
:private-members:
:exclude-members: sm90_make_smem_layout_a, sm90_make_smem_layout_b, sm90_make_smem_layout_epi
.. toctree::
:maxdepth: 2
:hidden:
utils_sm90
utils_sm100

View File

@ -0,0 +1,10 @@
.. _utils_sm100:
Utilities for SM100
===================
.. automodule:: cutlass.utils.sm100
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__

View File

@ -0,0 +1,10 @@
.. _utils_sm90:
Utilities for SM90
==================
.. automodule:: cutlass.utils.sm90
:members:
:undoc-members:
:show-inheritance:
:special-members: __init__

View File

@ -15,6 +15,14 @@ Understanding these limitations will help you avoid potential pitfalls from the
Please refer to :doc:`../limitations` for more details.
Source Code Correlation
-----------------------
CuTe DSL provides Python code to PTX/SASS correlation to enable the profiling/debugging of generated kernels with debug symbols by generating line info when compiling the kernel.
You can enable that globally via the environment variable CUTE_DSL_LINEINFO=1. Alternative, you can use compilation options to enable that per kernel. Please refer to :doc:`./dsl_jit_compilation_options` for more details.
DSL Debugging
-------------
@ -75,6 +83,48 @@ This helps you verify whether the IR is generated as expected.
export CUTE_DSL_KEEP_IR=1
Dump the generated PTX & CUBIN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For users familiar with PTX and SASS, CuTe DSL supports dumping the generated PTX and CUBIN.
.. code:: bash
# Dump generated PTX in a .ptx file (default: False)
export CUTE_DSL_KEEP_PTX=1
# Dump generated cubin in a .cubin file (default: False)
export CUTE_DSL_KEEP_CUBIN=1
To further get SASS from cubin, users can use ``nvdisasm`` (usually installed with CUDA toolkit) to disassemble the cubin.
.. code:: bash
nvdisasm your_dsl_code.cubin > your_dsl_code.sass
Access the dumped contents programmatically
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For compiled kernels, the generated PTX/CUBIN/IR can be accessed programmatically as well through following attributes:
- ``__ptx__``: The generated PTX code of the compiled kernel.
- ``__cubin__``: The generated CUBIN data of the compiled kernel.
- ``__mlir__``: The generated IR code of the compiled kernel.
.. code:: python
compiled_foo = cute.compile(foo, ...)
print(f"PTX: {compiled_foo.__ptx__}")
with open("foo.cubin", "wb") as f:
f.write(compiled_foo.__cubin__)
Change the dump directory
~~~~~~~~~~~~~~~~~~~~~~~~~
By default, all dumped files are saved in the current working directory. To specify a different directory for the dumped files, please set the environment variable CUTE_DSL_DUMP_DIR accordingly.
Kernel Functional Debugging
----------------------------
@ -122,6 +172,7 @@ For detecting memory errors and race conditions:
Please refer to the `compute-sanitizer documentation <https://developer.nvidia.com/compute-sanitizer>`_ for more details.
Conclusion
----------

View File

@ -124,7 +124,7 @@ JIT function arguments with |CUSTOM_TYPES|
- ``__extract_mlir_values__``: Generate a dynamic expression for the current object.
- ``__new_from_mlir_values__``: Create a new object from MLIR values.
Refer to `typing.py <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL/cutlass/base_dsl/typing.py>`__ for more details on these protocol APIs.
Refer to `typing.py <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL/base_dsl/typing.py>`__ for more details on these protocol APIs.
Depending on different cases of the |CUSTOM_TYPES|, |DSL| provides easy ways to adopt |CUSTOM_TYPES| for JIT function arguments.

View File

@ -18,9 +18,11 @@ Compilation options allow you to customize how your JIT-compiled functions are b
These options can be passed as keyword arguments to ``cute.compile`` or set globally for all JIT compilations. The available options and their effects are described in the following sections, along with usage examples to help you get started.
The |DSL| provides multiple ways to specify compilation options - either by specifying additional arguments to ``cute.compile`` or by using a more Pythonic approach with separate Python types for ``cute.compile``.
``cute.compile`` Compilation Options
------------------------------------
``cute.compile`` Compilation Options as strings
-----------------------------------------------
You can provide additional compilation options as a string when calling ``cute.compile``. The |DSL| uses ``argparse`` to parse these options and will raise an error if any invalid options are specified.
@ -36,10 +38,30 @@ You can provide additional compilation options as a string when calling ``cute.c
- Optimization level of compilation. The higher the level, the more optimizations are applied. The valid value range is [0, 3].
- 3 (highest level of optimization)
- int
* - ``enable-device-assertions``
- Enable device code assertions.
* - ``enable-assertions``
- Enable host and device code assertions.
- False
- bool
* - ``keep-cubin``
- Keep the generated CUBIN file.
- False
- bool
* - ``keep-ptx``
- Keep the generated PTX file.
- False
- bool
* - ``ptxas-options``
- The options to pass to the PTX Compiler library.
- ""
- str
* - ``generate-line-info``
- Generate line information for debugging.
- False
- bool
* - ``gpu-arch``
- The GPU architecture to compile for.
- ""
- str
You can use the following code to specify compilation options:
@ -47,4 +69,34 @@ You can use the following code to specify compilation options:
jit_executor_with_opt_level_2 = cute.compile(add, 1, 2, options="--opt-level 2")
jit_executor_with_opt_level_1 = cute.compile(add, 1, 2, options="--opt-level 1")
jit_executor_with_enable_device_assertions = cute.compile(add, 1, 2, options="--enable-device-assertions")
jit_executor_with_enable_device_assertions = cute.compile(add, 1, 2, options="--enable-assertions")
jit_executor_with_keep_cubin = cute.compile(add, 1, 2, options="--keep-cubin")
jit_executor_with_keep_ptx = cute.compile(add, 1, 2, options="--keep-ptx")
jit_executor_with_ptxas_options = cute.compile(add, 1, 2, options="--ptxas-options '--opt-level=2'")
``cute.compile`` Compilation Options as separate Python types
-------------------------------------------------------------
Alternatively, you can also use a more Pythonic way to specify compilation options with separate Python types.
Compilation options can be programmatically composed using tuple and passed to ``cute.compile`` separately.
.. code-block:: python
from cutlass.cute import OptLevel, EnableAssertions, GenerateLineInfo, KeepCUBIN, KeepPTX
my_debugging_options = (OptLevel(1), EnableAssertions, GenerateLineInfo, KeepCUBIN, KeepPTX)
compiled_kernel_1 = cute.compile[my_debugging_options](my_kernel_1, ...)
compiled_kernel_2 = cute.compile[my_debugging_options](my_kernel_2, ...)
This approach causes invalid options to raise errors immediately, making it much easier to detect typos when specifying multiple options.
Notebly, boolean options are automatically converted to True instances of the option type for convenience.
.. code-block:: python
jit_executor_with_opt_level_2 = cute.compile[OptLevel(2)](add, 1, 2)
jit_executor_with_opt_level_1 = cute.compile[OptLevel(1)](add, 1, 2)
jit_executor_with_enable_device_assertions = cute.compile[EnableAssertions](add, 1, 2)
jit_executor_with_keep_cubin = cute.compile[KeepCUBIN](add, 1, 2)
jit_executor_with_keep_ptx = cute.compile[KeepPTX](add, 1, 2)
jit_executor_with_ptxas_options = cute.compile[PtxasOptions("--opt-level=2")](add, 1, 2)

View File

@ -63,7 +63,7 @@ The full signature of from_dlpack is as follows:
.. code-block:: python
def from_dlpack(tensor, assumed_align=None):
def from_dlpack(tensor, assumed_align=None, use_32bit_stride=False):
The ``assumed_align`` integer parameter specifies the alignment of the tensor in unit of bytes.
The tensor's base address must be divisible by ``assumed_align``. When not provided explicitly,
@ -72,6 +72,13 @@ information is part of the pointer type in the generated IR. Therefore, programs
alignments have a different IR and identical IRs are required for hitting the kernel caching
mechanism of |DSL|.
The ``use_32bit_stride`` parameter determines whether to use 32-bit stride for the tensor's dynamic stride values.
By default, it is set to False (64bit) to ensure that address calculations do not risk overflow. For smaller
problem sizes (where ``cosize(layout_of_tensor) <= Int32_MAX``), users may set it to True (32bit) to improve performance
by reducing register usage and the number of address calculation instructions. When ``use_32bit_stride`` is set
to True, a runtime check is performed to ensure that the layout does not overflow. Please note that this parameter
only has an effect when the tensor's layout is marked as dynamic.
Code Example
~~~~~~~~~~~~
@ -242,6 +249,10 @@ The following example demonstrates how to use ``mark_layout_dynamic`` to specify
t7 = from_dlpack(b).mark_layout_dynamic(leading_dim=3)
# Expected strides[leading_dim] == 1, but got 4
c = torch.empty(1000000000, 1000000000)
t8 = from_dlpack(c, use_32bit_stride=True).mark_layout_dynamic()
# Layout in DLTensorWrapper has int32 overflow risk. Please set use_32bit_stride to False.
Mark the Tensor's Layout as Dynamic with ``mark_compact_shape_dynamic``
-----------------------------------------------------------------------
@ -398,6 +409,12 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
)
# The stride_order is not consistent with the layout
c = torch.empty(1000000000, 1000000000)
t13 = from_dlpack(c, use_32bit_stride=True).mark_compact_shape_dynamic(
mode=0, divisibility=1
)
# Layout in DLTensorWrapper has int32 overflow risk. Please set use_32bit_stride to False.
Bypass the DLPack Protocol
--------------------------

View File

@ -8,7 +8,15 @@ The CUTLASS DSL 4.0 release currently supports **Linux** and **Python 3.12** onl
Installation
-----------------------
To install the CUTLASS DSL, run:
To ensure compatibility with the examples and code on `GitHub <https://github.com/NVIDIA/cutlass/tree/main>`_,
use the `requirements.txt <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/requirements.txt>`_ file from the corresponding commit in the repository.
.. code-block:: bash
git clone https://github.com/NVIDIA/cutlass.git
pip install -r cutlass/python/CuTeDSL/requirements.txt
If you just want to try out the last known stable release of the CUTLASS DSL (may not compatible with the latest examples and code), run:
.. code-block:: bash
@ -18,9 +26,6 @@ The ``nvidia-cutlass-dsl`` wheel includes everything needed to generate GPU kern
the same NVIDIA driver version as the
`CUDA Toolkit 12.9 <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html>`_.
To ensure compatibility with the examples and code on `GitHub <https://github.com/NVIDIA/cutlass/tree/main/python>`_,
use the ``requirements.txt`` file from the corresponding commit in the repository.
Recommended Dependencies
---------------------------------