v4.2 tag release. (#2638)

2025-09-16 00:21:53 +08:00
parent 56f0718a97
commit 6a35b4d22f
161 changed files with 14056 additions and 3793 deletions
--- a/media/docs/cpp/profiler.md
+++ b/media/docs/cpp/profiler.md
@ -79,7 +79,7 @@ Instruction shape levels control the selection of WGMMA shapes used in kernel ge
 - **Level 2**: Includes shapes that are powers of 2.
 - **Level 3**: Includes all other shapes.

-The detailed defination of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_shapes.py).
+The detailed definition of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_shapes.py).

 Schedule pruning levels decide the epilogue schedule and mainloop schedule to stamp out a kernel instance. As defined in `get_valid_schedules` in [sm90_utils.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_utils.py),

@ -122,6 +122,55 @@ For each mixed dtype kernel, the kernel generator will generate combinations of

 For {4-bits-dtype, 8-bits-dtype} x 16-bits-dtype, the kernel generator will further generate kernels using shuffled layouts for the narrow data type matrix, which may have a better performance compared to its non-shuffle counter parts.

+## Instantiating more kernels with Blackwell
+Blackwell (SM100) and Blackwell Ultra similarly support
+`CUTLASS_LIBRARY_INSTANTIATION_LEVEL`, in order to instantiate all possible combinations.
+Due to this, `CUTLASS_LIBRARY_KERNELS` must be non-empty, since generating and filtering these
+kernels alone can take hours.
+You must also exercise caution, because not all of these configs are tested, and some may fail to
+compile or fail to launch at runtime.
+
+```bash
+$ cmake .. \
+  -DCUTLASS_NVCC_ARCHS="100f" \
+  -DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm100_tensorop_gemm_f16_f16_f32_void_f32_*" \
+  -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="max" \
+  -DCUTLASS_UNITY_BUILD_ENABLED=ON
+```
+
+The CUTLASS profiler uses the same four-digit integer level (global instantiation level) mechanism to manage the generation of kernel configurations for Blackwell as well:
+
+0. **Instruction Shape**
+1. **MMA Shape Multiplier**
+2. **Cluster Shape**
+3. **Data Type and Schedule Pruning**
+
+Note for Blackwell kernels an MMA shape multiplier is no longer necessary since Blackwell kernels do not have a different
+ping pong or cooperative schedule. The profiler ignores this digit when instantiating.
+
+Cluster shape levels define the number of CTAs (Cooperative Thread Arrays) included in the kernel generation:
+
+- **Level 0**: Only dynamic cluster shapes.
+- **Level 1**: For 1SM kernels `(1, 1, 1)` and `(2, 1, 1)` for 2SM kernels.
+- **Level 2**: For 1SM kernels we also have `(1, 2, 1)` and for 2SM we have `(2, 2, 1)` and `(4, 1, 1)`.
+- **Level 3**: For 1SM kernels we have `(1, 4, 1)` and for 2SM we have `(2, 4, 1)` and `(4, 2, 1)`.
+- **Level 4**: For 1SM kernels we have `(4, 4, 1)` and for 2SM we have `(4, 4, 1)`.
+- **Level 5**: For 1SM kernels we have `(2, 1, 1)`.
+- **Level 6**: For 1SM kernels we have `(2, 2, 1)` and `(4, 1, 1)` and for 2SM kernels we have `(8, 1, 1)`.
+- **Level 7**: For 1SM kernels we have `(2, 4, 1)` and `(4, 2, 1)`
+- **Level 8**: For 1SM kernels we have `(1, 8, 1)` and `(8, 1, 1)`
+
+Instruction shape levels control the selection of MMA shapes used in kernel generation:
+
+- **Level 0**: Generates the "default" shape only.
+- **Level 1**: Includes additional shapes for FP8, FP6, and FP4 as well as MX and NVFP4.
+- **Level 2**: Includes small tile shapes.
+- **Level 3**: Includes some non-power of 2 shapes.
+- **Level 4**: Includes further small tile shapes and non-power of 2 shapes.
+- **Level 5**: Includes all shapes.
+
+The detailed definition of the three instantiation levels controlling cluster shape and instruction shape can be found in [sm100_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm100_shapes.py).
+
 ## CUTLASS Profiler usage

 The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
@ -577,6 +626,10 @@ cutlass3x_sm90_tensorop_gemm_f16_f16_f16_void_f16_128x128x64_1x1x1_0_nnn_align8_
 * `f16_f16_f16_void_f16`: In this case, C type is set to `void`, indicating that residual matrix support
 is disabled.

+## Further Documentation
+
+For documentation on profiling blockwise and groupwise (software scaled) GEMMs see the [example 81 README](https://github.com/NVIDIA/cutlass/blob/main/examples/81_blackwell_gemm_blockwise/README.md).
+
 # Convolution

 The CUTLASS Profiler is capable of executing 2-D and 3-D convolution problems for forwards and backwards
--- a/media/docs/pythonDSL/cute_dsl_api.rst
+++ b/media/docs/pythonDSL/cute_dsl_api.rst
@ -6,6 +6,7 @@ CuTe DSL API
 .. toctree::
  :maxdepth: 1

+  changelog <cute_dsl_api/changelog.rst>
  cute <cute_dsl_api/cute.rst>
  cute_arch <cute_dsl_api/cute_arch.rst>
  cute_nvgpu <cute_dsl_api/cute_nvgpu.rst>
--- a/media/docs/pythonDSL/cute_dsl_api/changelog.rst
+++ b/media/docs/pythonDSL/cute_dsl_api/changelog.rst
@ -0,0 +1,54 @@
+======================================
+Changelog for CuTe DSL API changes
+======================================
+
+`4.2.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0>`_ (2025-09-15)
+==============================================================================
+
+* Added back ``cute.make_tiled_copy`` per the request from community
+* Added support for explicit and implicit broadcast in ``TensorSSA``
+  - ``cutlass.cute.TensorSSA``: support ``broadcast_to`` and implicit broadcasting for binary operations.
+* Supported printing ``TensorSSA`` value in ``cutlass.cute.print_tensor``
+* Updated ``cute.gemm`` to support all dispatch patterns and improved checks for illegal inputs
+* Introduced automatic kernel smem usage calculation for launch config.
+* Introduced per op fast-math control for math ops(e.g. ``exp``, ``exp2``, ``log2``, ``log``)
+* Introduced ``CopyReduceBulkTensorTileS2GOp`` in `tcgen05/copy.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py>`_ to support TMA Reduce.
+
+
+`4.1.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.1.0>`_ (2025-07-16)
+==============================================================================
+
+* for loop
+
+  - Python built-in ``range`` now always generates codes and executes at runtime
+  - ``cutlass.range`` is advanced ``range`` with kernel code level unrolling and pipelining control
+  - Deprecated ``cutlass.range_dynamic``, please replace with ``range`` or ``cutlass.range``
+  - **Experimental** Added ``pipelining`` control for compiler generated software pipeline code
+
+* while/if
+
+  - ``while``/``if`` now by default generates codes and executes at runtime unless ``cutlass.const_expr`` is specified for the predicate
+  - Deprecated ``cutlass.dynamic_expr``, please remove it
+
+* Rename mbarrier functions to reduce ambiguity
+* Modify SyncObject API (``MbarrierArray``, ``NamedBarrier``, ``TmaStoreFence``) to match ``std::barrier``
+* Change pipeline ``create`` function to take only keyword arguments, and make ``barrier_storage`` optional.
+* Introduce ``cutlass.cute.arch.get_dyn_smem_size`` api to get runtime dynamic shared memory size.
+* Various API Support for SM100 BlockScaled Gemm
+
+  - Introduce BlockScaled MmaOps in `tcgen05/mma.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py>`_, and provide a ``make_blockscaled_trivial_tiled_mma`` function in `blackwell_helpers.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/utils/blackwell_helpers.py>`_ to help construct a BlockScaled TiledMma.
+  - Introduce S2T CopyOps in `tcgen05/copy.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py>`_.
+  - Introduce BlockScaled layout utilities in `blockscaled_layout.py <https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/cutlass/utils/blockscaled_layout.py>`_ for creating the required scale factor layouts in global memory, shared memory and tensor memory.
+
+* ``cutlass.cute.compile`` now supports compilation options. Refer to `JIT compilation options <https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_jit_compilation_options.html>`_ for more details.
+* ``cutlass.cute.testing.assert_`` now works for device JIT function. Specify ``--enable-device-assertions`` as compilation option to enable.
+* ``cutlass.cute.make_tiled_copy`` is now deprecated. Please use ``cutlass.cute.make_tiled_copy_tv`` instead.
+* Shared memory capacity query
+
+  - Introduce ``cutlass.utils.get_smem_capacity_in_bytes`` for querying the shared memory capacity.
+  - ``<arch>_utils.SMEM_CAPACITY["<arch_str>"]`` is now deprecated.
+
+`4.0.0 <https://github.com/NVIDIA/cutlass/releases/tag/v4.0.0>`_ (2025-06-03)
+==============================================================================
+
+* Fixed API mismatch in class ``cute.runtime.Pointer``: change ``element_type`` to ``dtype`` to match ``typing.Pointer``
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_control_flow.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_control_flow.rst
@ -72,6 +72,55 @@ All loop indices must be |Constexpr|.
        for i in cutlass.range(bound, unroll=2):
            cute.printf("%d\\n", i)

+Software Pipelining
+~~~~~~~~~~~~~~~~~~~
+
+Software pipelining is a technique used to optimize loops. Typically, this involves writing a prefetch loop and a main loop.
+
+.. code-block:: python
+
+    @cute.jit
+    def example():
+        ...
+        # build a circular buffer
+        buffer = ...
+
+        # prefetch loop
+        for i in range(prefetch_stages):
+            cute.copy(atom, gmem[i], buffer[i], ...)
+
+        # main loop
+        for i in range(bound):
+            if i + prefetch_stages < bound:
+                cute.copy(atom, gmem[i + prefetch_stages], buffer[(i + prefetch_stages) % total_stages], ...)
+
+            use(buffer[i % total_stages])
+
+        ...
+
+This can be tedious to write and tune. |DSL| provides a loop attribute to ask the compiler to do this.
+
+.. code-block:: python
+
+    @cute.jit
+    def example():
+        ...
+        # build a circular buffer
+        buffer = ... 
+
+        for i in cutlass.range(bound, prefetch_stages=prefetch_stages):
+            # Compiler automatically handles the pipelining:
+            # - Generates prefetch loop for initial stages
+            # - In main loop, prefetches future data while using current data
+            cute.copy(atom, gmem[i], buffer[i % total_stages], ...)
+            use(buffer[i % total_stages])  # Uses data from previous iterations
+        
+        ...
+
+Compiler will automatically generate the prefetch loop with `prefetch_stages` iterations and a corresponding main loop.
+
+This feature is experimental and only supported on sm90 and above.
+

 If-Else Statements
 ------------------
--- a/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst
@ -7,7 +7,8 @@ Integration with Frameworks
 In order to facilitate the integration of CUTLASS Python with popular frameworks, we leverage the
 `DLPack protocol <https://github.com/dmlc/dlpack>`_ and transform tensors originating from these
 frameworks to CuTe tensors. The present page documents the conventions, the API available to the
-user, and provide example code snippets for common usage patterns.
+user, and provide example code snippets for common usage patterns. We also provide a section on how to
+bypass the DLPack protocol and directly call the JIT function.

 Implicit Conversion
 -------------------
@ -396,3 +397,84 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
        mode=0, divisibility=1, stride_order=(2, 1, 3, 0, 4)
    )
    # The stride_order is not consistent with the layout
+
+
+Bypass the DLPack Protocol
+--------------------------
+
+In certain scenarios, users may wish to bypass the DLPack protocol and invoke the JIT function directly.  
+This can be accomplished by creating a lightweight JIT wrapper around the existing JIT function, 
+utilizing ``cute.ptr`` and ``cute.make_tensor`` to pass pointers and construct tensors directly.
+
+Typical use cases for bypassing DLPack include:
+1. Users want to call the JIT function directly to avoid the overhead introduced by the DLPack protocol.
+2. DLPack canonicalizes the stride of shape-1 dimensions to 1, which may result in incorrect alignment 
+propagation and affect memory access or performance.
+3. DLPack may lack support for some narrow data types.
+
+The following example illustrates how to bypass the DLPack protocol when invoking a JIT function.
+Assume we have a pre-defined ``TensorOpGemm`` kernel whose JIT interface expects three 
+arguments of type ``cute.Tensor``. To enable direct invocation without DLPack, we first define a JIT wrapper 
+function that accepts ``cute.Pointer`` types as parameters. Within this wrapper, we use ``cute.make_tensor`` 
+to construct tensors from the provided pointers, and then call the ``TensorOpGemm`` kernel as usual.
+
+.. code-block:: python
+
+    @cute.jit
+    def tensor_op_gemm_wrapper(
+        a_ptr: cute.Pointer,
+        b_ptr: cute.Pointer,
+        c_ptr: cute.Pointer,
+        m: cutlass.Int32,
+        n: cutlass.Int32,
+        k: cutlass.Int32,
+        l: cutlass.Int32,
+    ):
+
+        # Assume alignment of shape to call tensorop_gemm example
+        m = cute.assume(m, divby=8)
+        n = cute.assume(n, divby=8)
+
+        # Torch is row major
+        a_layout = cute.make_ordered_layout((m, k, l), order=(0, 1, 2))
+        b_layout = cute.make_ordered_layout((n, k, l), order=(0, 1, 2))
+        c_layout = cute.make_ordered_layout((m, n, l), order=(1, 0, 2))
+        mA = cute.make_tensor(a_ptr, layout=a_layout)
+        mB = cute.make_tensor(b_ptr, layout=b_layout)
+        mC = cute.make_tensor(c_ptr, layout=c_layout)
+        
+        # TensorOpGemm is a pre-defined kernel from our example
+        tensor_op_gemm = TensorOpGemm(
+            a_ptr.value_type, c_ptr.value_type, cutlass.Float32, (2, 2, 1)
+        )
+
+        tensor_op_gemm(mA, mB, mC)
+
+To pass a PyTorch tensor to this new JIT wrapper, we retrieve the raw pointer from the PyTorch tensor 
+and create a ``cute.Pointer`` instance using ``cute.make_ptr``.
+This approach allows us to bypass the DLPack protocol entirely, avoiding its overhead and potential 
+issues with shape-1 dimension handling.
+
+.. code-block:: python
+
+    a = torch.randn(
+        m, k, l, dtype=torch.float16, device="cuda"
+    ).permute(2, 1, 0)
+    b = torch.randn(
+        n, k, l, dtype=torch.float16, device="cuda"
+    ).permute(2, 1, 0)
+    c = torch.randn(
+        n, m, l, dtype=torch.float16, device="cuda"
+    ).permute(1, 2, 0)
+    
+    # from cutlass.cute.runtime import make_ptr
+    a_ptr = make_ptr(
+        cutlass.Float16, a.data_ptr(), cute.AddressSpace.gmem, assumed_align=32
+    )
+    b_ptr = make_ptr(
+        cutlass.Float16, b.data_ptr(), cute.AddressSpace.gmem, assumed_align=32
+    )
+    c_ptr = make_ptr(
+        cutlass.Float16, c.data_ptr(), cute.AddressSpace.gmem, assumed_align=32
+    )
+    tensor_op_gemm_wrapper(a_ptr, b_ptr, c_ptr, m, n, k, l)