v4.3 update. (#2709)
* v4.3 update. * Update the cute_dsl_api changelog's doc link * Update version to 4.3.0 * Update the example link * Update doc to encourage user to install DSL from requirements.txt --------- Co-authored-by: Larry Wu <larwu@nvidia.com>
This commit is contained in:
@ -151,7 +151,7 @@ For example,
|
||||
* `(3,6,2,8) / 9 => (1,2,2,8)`
|
||||
* `(3,6,2,8) / 72 => (1,1,1,4)`
|
||||
|
||||
To compute the strides of the strided layout, the residues of the above operation are used to scale the strides of `A`. For instance, the last example `(3,6,2,8):(w,x,y,z) / 72` with strides `(w,x,y,z)` produces `(72*w,24*x,4*y,2*z)` as the strides of the strided layout.
|
||||
To compute the strides of the strided layout, the residues of the above operation are used to scale the strides of `A`. For instance, the last example `(3,6,2,8):(w,x,y,z) / 72` with strides `(w,x,y,z)` produces `(72*w,24*x,4*x,2*z)` as the strides of the strided layout.
|
||||
|
||||
As you may have noticed, we can only divide shapes by certain values and get a sensible result. This is called the **stride divisibility condition** and is statically checked in CuTe when possible.
|
||||
|
||||
@ -388,7 +388,7 @@ Informally, `logical_divide(A, B)` splits a layout `A` into two modes -- in the
|
||||
|
||||
Formally, this can be written as
|
||||
|
||||
$A \oslash B := A \circ (B,B^*)$
|
||||
$$A \oslash B := A \circ (B,B^*)$$
|
||||
|
||||
and implemented as
|
||||
```cpp
|
||||
|
||||
@ -8,6 +8,6 @@ CuTe DSL API
|
||||
|
||||
changelog <cute_dsl_api/changelog.rst>
|
||||
cute <cute_dsl_api/cute.rst>
|
||||
cute_arch <cute_dsl_api/cute_arch.rst>
|
||||
cute_nvgpu <cute_dsl_api/cute_nvgpu.rst>
|
||||
pipeline <cute_dsl_api/pipeline.rst>
|
||||
utils <cute_dsl_api/utils.rst>
|
||||
|
||||
@ -2,7 +2,32 @@
|
||||
Changelog for CuTe DSL API changes
|
||||
======================================
|
||||
|
||||
`4.2.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0>`_ (2025-09-15)
|
||||
`4.3.0 <https://github.com/NVIDIA/cutlass/releases/tree/main>`_ (2025-10-07)
|
||||
==============================================================================
|
||||
|
||||
* Debuggability improvements:
|
||||
- Supported source location tracking for DSL APIs
|
||||
- Supported dumping PTX and SASS code
|
||||
* Remove deprecated ``cutlass.<arch>_utils.SMEM_CAPACITY["<arch_str>"]`` and ``cutlass.utils.ampere_helpers``
|
||||
* Support calling nested functions without capturing variables inside dynamic control flow
|
||||
* Replace usage of ``cute.arch.barrier`` in examples with corresponding APIs in ``pipeline``
|
||||
- Use ``pipeline.sync`` for simple cases like synchronizing the whole CTA
|
||||
- Use ``pipeline.NamedBarrier`` to customize barriers with different participating threads and barrier id
|
||||
* Added new APIs ``repeat`` and ``repeat_as_tuple``
|
||||
* Added new APIs ``make_rmem_tensor`` to replace ``make_fragment`` with better naming
|
||||
* Added new APIs ``make_rmem_tensor_like`` which create rmem tensor from a tensor using the same shape with compact col-major strides
|
||||
* Added ``TmemAllocator`` for allocating tensor memory
|
||||
* Updated ``SmemAllocator.allocate`` to support allocation of a single scalar value
|
||||
* Fixed ``TensorSSA.reduce`` to support static value as initial value
|
||||
* Updated docstring for following APIs to be more concise and easier to understand:
|
||||
- ``make_layout_tv``
|
||||
- ``is_static``
|
||||
- ``PipelineAsync``
|
||||
- ``SmemAllocator``
|
||||
* Fixed documentation for ``pipeline``, ``utils`` and ``cute.math``
|
||||
|
||||
|
||||
`4.2.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0>`_ (2025-09-10)
|
||||
==============================================================================
|
||||
|
||||
* Added back ``cute.make_tiled_copy`` per the request from community
|
||||
@ -40,7 +65,7 @@ Changelog for CuTe DSL API changes
|
||||
- Introduce S2T CopyOps in `tcgen05/copy.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py>`_.
|
||||
- Introduce BlockScaled layout utilities in `blockscaled_layout.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/utils/blockscaled_layout.py>`_ for creating the required scale factor layouts in global memory, shared memory and tensor memory.
|
||||
|
||||
* ``cutlass.cute.compile`` now supports compilation options. Refer to `JIT compilation options <https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.html>`_ for more details.
|
||||
* ``cutlass.cute.compile`` now supports compilation options. Refer to `JIT compilation options <https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.html>`_ for more details.
|
||||
* ``cutlass.cute.testing.assert_`` now works for device JIT function. Specify ``--enable-device-assertions`` as compilation option to enable.
|
||||
* ``cutlass.cute.make_tiled_copy`` is now deprecated. Please use ``cutlass.cute.make_tiled_copy_tv`` instead.
|
||||
* Shared memory capacity query
|
||||
|
||||
@ -9,3 +9,9 @@ cutlass.cute
|
||||
:show-inheritance:
|
||||
:special-members: __init__
|
||||
:private-members:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
cute_arch
|
||||
|
||||
@ -1,17 +1,18 @@
|
||||
.. _cute_arch:
|
||||
|
||||
cutlass.cute.arch
|
||||
=================
|
||||
arch
|
||||
====
|
||||
|
||||
The ``cute.arch`` module contains wrappers around NVVM-level MLIR Op builders that seamlessly
|
||||
inter-operate with the Python types used in CUTLASS Python. Another benefit of wrapping these Op
|
||||
builders is that the source location can be tracked with the ``@dsl_user_op`` decorator. Available
|
||||
functions include
|
||||
The ``cute.arch`` module provides lightweight wrappers for NVVM Operation builders which implement CUDA built-in
|
||||
device functions such as ``thread_idx``. It integrates seamlessly with CuTe DSL types.
|
||||
|
||||
- basic API like ``thr_idx``;
|
||||
- functions related to the direct management of mbarriers;
|
||||
- low-level SMEM management (prefer using the ``SmemAllocator`` class);
|
||||
- TMEM management.
|
||||
These wrappers enable source location tracking through the ``@dsl_user_op``
|
||||
decorator. The module includes the following functionality:
|
||||
|
||||
- Core CUDA built-in functions such as ``thread_idx``, ``warp_idx``, ``block_dim``, ``grid_dim``, ``cluster_dim``, and related functions
|
||||
- Memory barrier management functions including ``mbarrier_init``, ``mbarrier_arrive``, ``mbarrier_wait``, and associated operations
|
||||
- Low-level shared memory (SMEM) management capabilities, with ``SmemAllocator`` as the recommended interface
|
||||
- Low-level tensor memory (TMEM) management capabilities, with ``TmemAllocator`` as the recommended interface
|
||||
|
||||
API documentation
|
||||
-----------------
|
||||
|
||||
9
media/docs/pythonDSL/cute_dsl_api/pipeline.rst
Normal file
9
media/docs/pythonDSL/cute_dsl_api/pipeline.rst
Normal file
@ -0,0 +1,9 @@
|
||||
cutlass.pipeline
|
||||
================
|
||||
|
||||
.. automodule:: cutlass.pipeline
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
:special-members: __init__
|
||||
:private-members:
|
||||
@ -1,9 +1,19 @@
|
||||
cutlass.utils
|
||||
=============
|
||||
|
||||
The ``cutlass.utils`` module contains utilities for developing kernels with CuTe DSL.
|
||||
|
||||
.. automodule:: cutlass.utils
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
:special-members: __init__
|
||||
:private-members:
|
||||
:exclude-members: sm90_make_smem_layout_a, sm90_make_smem_layout_b, sm90_make_smem_layout_epi
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
utils_sm90
|
||||
utils_sm100
|
||||
|
||||
10
media/docs/pythonDSL/cute_dsl_api/utils_sm100.rst
Normal file
10
media/docs/pythonDSL/cute_dsl_api/utils_sm100.rst
Normal file
@ -0,0 +1,10 @@
|
||||
.. _utils_sm100:
|
||||
|
||||
Utilities for SM100
|
||||
===================
|
||||
|
||||
.. automodule:: cutlass.utils.sm100
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
:special-members: __init__
|
||||
10
media/docs/pythonDSL/cute_dsl_api/utils_sm90.rst
Normal file
10
media/docs/pythonDSL/cute_dsl_api/utils_sm90.rst
Normal file
@ -0,0 +1,10 @@
|
||||
.. _utils_sm90:
|
||||
|
||||
Utilities for SM90
|
||||
==================
|
||||
|
||||
.. automodule:: cutlass.utils.sm90
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
:special-members: __init__
|
||||
@ -15,6 +15,14 @@ Understanding these limitations will help you avoid potential pitfalls from the
|
||||
Please refer to :doc:`../limitations` for more details.
|
||||
|
||||
|
||||
Source Code Correlation
|
||||
-----------------------
|
||||
|
||||
CuTe DSL provides Python code to PTX/SASS correlation to enable the profiling/debugging of generated kernels with debug symbols by generating line info when compiling the kernel.
|
||||
|
||||
You can enable that globally via the environment variable CUTE_DSL_LINEINFO=1. Alternative, you can use compilation options to enable that per kernel. Please refer to :doc:`./dsl_jit_compilation_options` for more details.
|
||||
|
||||
|
||||
DSL Debugging
|
||||
-------------
|
||||
|
||||
@ -75,6 +83,48 @@ This helps you verify whether the IR is generated as expected.
|
||||
export CUTE_DSL_KEEP_IR=1
|
||||
|
||||
|
||||
Dump the generated PTX & CUBIN
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For users familiar with PTX and SASS, CuTe DSL supports dumping the generated PTX and CUBIN.
|
||||
|
||||
.. code:: bash
|
||||
|
||||
# Dump generated PTX in a .ptx file (default: False)
|
||||
export CUTE_DSL_KEEP_PTX=1
|
||||
|
||||
# Dump generated cubin in a .cubin file (default: False)
|
||||
export CUTE_DSL_KEEP_CUBIN=1
|
||||
|
||||
To further get SASS from cubin, users can use ``nvdisasm`` (usually installed with CUDA toolkit) to disassemble the cubin.
|
||||
|
||||
.. code:: bash
|
||||
|
||||
nvdisasm your_dsl_code.cubin > your_dsl_code.sass
|
||||
|
||||
|
||||
Access the dumped contents programmatically
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For compiled kernels, the generated PTX/CUBIN/IR can be accessed programmatically as well through following attributes:
|
||||
|
||||
- ``__ptx__``: The generated PTX code of the compiled kernel.
|
||||
- ``__cubin__``: The generated CUBIN data of the compiled kernel.
|
||||
- ``__mlir__``: The generated IR code of the compiled kernel.
|
||||
|
||||
.. code:: python
|
||||
|
||||
compiled_foo = cute.compile(foo, ...)
|
||||
print(f"PTX: {compiled_foo.__ptx__}")
|
||||
with open("foo.cubin", "wb") as f:
|
||||
f.write(compiled_foo.__cubin__)
|
||||
|
||||
|
||||
Change the dump directory
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default, all dumped files are saved in the current working directory. To specify a different directory for the dumped files, please set the environment variable CUTE_DSL_DUMP_DIR accordingly.
|
||||
|
||||
|
||||
Kernel Functional Debugging
|
||||
----------------------------
|
||||
@ -122,6 +172,7 @@ For detecting memory errors and race conditions:
|
||||
|
||||
Please refer to the `compute-sanitizer documentation <https://developer.nvidia.com/compute-sanitizer>`_ for more details.
|
||||
|
||||
|
||||
Conclusion
|
||||
----------
|
||||
|
||||
|
||||
@ -124,7 +124,7 @@ JIT function arguments with |CUSTOM_TYPES|
|
||||
- ``__extract_mlir_values__``: Generate a dynamic expression for the current object.
|
||||
- ``__new_from_mlir_values__``: Create a new object from MLIR values.
|
||||
|
||||
Refer to `typing.py <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL/cutlass/base_dsl/typing.py>`__ for more details on these protocol APIs.
|
||||
Refer to `typing.py <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL/base_dsl/typing.py>`__ for more details on these protocol APIs.
|
||||
|
||||
Depending on different cases of the |CUSTOM_TYPES|, |DSL| provides easy ways to adopt |CUSTOM_TYPES| for JIT function arguments.
|
||||
|
||||
|
||||
@ -18,9 +18,11 @@ Compilation options allow you to customize how your JIT-compiled functions are b
|
||||
|
||||
These options can be passed as keyword arguments to ``cute.compile`` or set globally for all JIT compilations. The available options and their effects are described in the following sections, along with usage examples to help you get started.
|
||||
|
||||
The |DSL| provides multiple ways to specify compilation options - either by specifying additional arguments to ``cute.compile`` or by using a more Pythonic approach with separate Python types for ``cute.compile``.
|
||||
|
||||
``cute.compile`` Compilation Options
|
||||
------------------------------------
|
||||
|
||||
``cute.compile`` Compilation Options as strings
|
||||
-----------------------------------------------
|
||||
|
||||
You can provide additional compilation options as a string when calling ``cute.compile``. The |DSL| uses ``argparse`` to parse these options and will raise an error if any invalid options are specified.
|
||||
|
||||
@ -36,10 +38,30 @@ You can provide additional compilation options as a string when calling ``cute.c
|
||||
- Optimization level of compilation. The higher the level, the more optimizations are applied. The valid value range is [0, 3].
|
||||
- 3 (highest level of optimization)
|
||||
- int
|
||||
* - ``enable-device-assertions``
|
||||
- Enable device code assertions.
|
||||
* - ``enable-assertions``
|
||||
- Enable host and device code assertions.
|
||||
- False
|
||||
- bool
|
||||
* - ``keep-cubin``
|
||||
- Keep the generated CUBIN file.
|
||||
- False
|
||||
- bool
|
||||
* - ``keep-ptx``
|
||||
- Keep the generated PTX file.
|
||||
- False
|
||||
- bool
|
||||
* - ``ptxas-options``
|
||||
- The options to pass to the PTX Compiler library.
|
||||
- ""
|
||||
- str
|
||||
* - ``generate-line-info``
|
||||
- Generate line information for debugging.
|
||||
- False
|
||||
- bool
|
||||
* - ``gpu-arch``
|
||||
- The GPU architecture to compile for.
|
||||
- ""
|
||||
- str
|
||||
|
||||
You can use the following code to specify compilation options:
|
||||
|
||||
@ -47,4 +69,34 @@ You can use the following code to specify compilation options:
|
||||
|
||||
jit_executor_with_opt_level_2 = cute.compile(add, 1, 2, options="--opt-level 2")
|
||||
jit_executor_with_opt_level_1 = cute.compile(add, 1, 2, options="--opt-level 1")
|
||||
jit_executor_with_enable_device_assertions = cute.compile(add, 1, 2, options="--enable-device-assertions")
|
||||
jit_executor_with_enable_device_assertions = cute.compile(add, 1, 2, options="--enable-assertions")
|
||||
jit_executor_with_keep_cubin = cute.compile(add, 1, 2, options="--keep-cubin")
|
||||
jit_executor_with_keep_ptx = cute.compile(add, 1, 2, options="--keep-ptx")
|
||||
jit_executor_with_ptxas_options = cute.compile(add, 1, 2, options="--ptxas-options '--opt-level=2'")
|
||||
|
||||
|
||||
``cute.compile`` Compilation Options as separate Python types
|
||||
-------------------------------------------------------------
|
||||
|
||||
Alternatively, you can also use a more Pythonic way to specify compilation options with separate Python types.
|
||||
Compilation options can be programmatically composed using tuple and passed to ``cute.compile`` separately.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from cutlass.cute import OptLevel, EnableAssertions, GenerateLineInfo, KeepCUBIN, KeepPTX
|
||||
|
||||
my_debugging_options = (OptLevel(1), EnableAssertions, GenerateLineInfo, KeepCUBIN, KeepPTX)
|
||||
compiled_kernel_1 = cute.compile[my_debugging_options](my_kernel_1, ...)
|
||||
compiled_kernel_2 = cute.compile[my_debugging_options](my_kernel_2, ...)
|
||||
|
||||
This approach causes invalid options to raise errors immediately, making it much easier to detect typos when specifying multiple options.
|
||||
Notebly, boolean options are automatically converted to True instances of the option type for convenience.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
jit_executor_with_opt_level_2 = cute.compile[OptLevel(2)](add, 1, 2)
|
||||
jit_executor_with_opt_level_1 = cute.compile[OptLevel(1)](add, 1, 2)
|
||||
jit_executor_with_enable_device_assertions = cute.compile[EnableAssertions](add, 1, 2)
|
||||
jit_executor_with_keep_cubin = cute.compile[KeepCUBIN](add, 1, 2)
|
||||
jit_executor_with_keep_ptx = cute.compile[KeepPTX](add, 1, 2)
|
||||
jit_executor_with_ptxas_options = cute.compile[PtxasOptions("--opt-level=2")](add, 1, 2)
|
||||
@ -63,7 +63,7 @@ The full signature of from_dlpack is as follows:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def from_dlpack(tensor, assumed_align=None):
|
||||
def from_dlpack(tensor, assumed_align=None, use_32bit_stride=False):
|
||||
|
||||
The ``assumed_align`` integer parameter specifies the alignment of the tensor in unit of bytes.
|
||||
The tensor's base address must be divisible by ``assumed_align``. When not provided explicitly,
|
||||
@ -72,6 +72,13 @@ information is part of the pointer type in the generated IR. Therefore, programs
|
||||
alignments have a different IR and identical IRs are required for hitting the kernel caching
|
||||
mechanism of |DSL|.
|
||||
|
||||
The ``use_32bit_stride`` parameter determines whether to use 32-bit stride for the tensor's dynamic stride values.
|
||||
By default, it is set to False (64bit) to ensure that address calculations do not risk overflow. For smaller
|
||||
problem sizes (where ``cosize(layout_of_tensor) <= Int32_MAX``), users may set it to True (32bit) to improve performance
|
||||
by reducing register usage and the number of address calculation instructions. When ``use_32bit_stride`` is set
|
||||
to True, a runtime check is performed to ensure that the layout does not overflow. Please note that this parameter
|
||||
only has an effect when the tensor's layout is marked as dynamic.
|
||||
|
||||
Code Example
|
||||
~~~~~~~~~~~~
|
||||
|
||||
@ -242,6 +249,10 @@ The following example demonstrates how to use ``mark_layout_dynamic`` to specify
|
||||
t7 = from_dlpack(b).mark_layout_dynamic(leading_dim=3)
|
||||
# Expected strides[leading_dim] == 1, but got 4
|
||||
|
||||
c = torch.empty(1000000000, 1000000000)
|
||||
t8 = from_dlpack(c, use_32bit_stride=True).mark_layout_dynamic()
|
||||
# Layout in DLTensorWrapper has int32 overflow risk. Please set use_32bit_stride to False.
|
||||
|
||||
Mark the Tensor's Layout as Dynamic with ``mark_compact_shape_dynamic``
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
@ -398,6 +409,12 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
|
||||
)
|
||||
# The stride_order is not consistent with the layout
|
||||
|
||||
c = torch.empty(1000000000, 1000000000)
|
||||
t13 = from_dlpack(c, use_32bit_stride=True).mark_compact_shape_dynamic(
|
||||
mode=0, divisibility=1
|
||||
)
|
||||
# Layout in DLTensorWrapper has int32 overflow risk. Please set use_32bit_stride to False.
|
||||
|
||||
|
||||
Bypass the DLPack Protocol
|
||||
--------------------------
|
||||
|
||||
@ -8,7 +8,15 @@ The CUTLASS DSL 4.0 release currently supports **Linux** and **Python 3.12** onl
|
||||
Installation
|
||||
-----------------------
|
||||
|
||||
To install the CUTLASS DSL, run:
|
||||
To ensure compatibility with the examples and code on `GitHub <https://github.com/NVIDIA/cutlass/tree/main>`_,
|
||||
use the `requirements.txt <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/requirements.txt>`_ file from the corresponding commit in the repository.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone https://github.com/NVIDIA/cutlass.git
|
||||
pip install -r cutlass/python/CuTeDSL/requirements.txt
|
||||
|
||||
If you just want to try out the last known stable release of the CUTLASS DSL (may not compatible with the latest examples and code), run:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@ -18,9 +26,6 @@ The ``nvidia-cutlass-dsl`` wheel includes everything needed to generate GPU kern
|
||||
the same NVIDIA driver version as the
|
||||
`CUDA Toolkit 12.9 <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html>`_.
|
||||
|
||||
To ensure compatibility with the examples and code on `GitHub <https://github.com/NVIDIA/cutlass/tree/main/python>`_,
|
||||
use the ``requirements.txt`` file from the corresponding commit in the repository.
|
||||
|
||||
Recommended Dependencies
|
||||
---------------------------------
|
||||
|
||||
|
||||
Reference in New Issue
Block a user