v4.0 update. (#2371)

2025-06-06 14:39:20 +08:00
parent 2e2af190bd
commit 8bdbfca682
254 changed files with 29751 additions and 1980 deletions
--- a/media/docs/pythonDSL/cute_dsl.rst
+++ b/media/docs/pythonDSL/cute_dsl.rst
@ -6,12 +6,12 @@ CuTe DSL
 .. toctree::
  :maxdepth: 1

-  DSL Introduction <cute_dsl_general/dsl_introduction.rst>
-  DSL Code Generation <cute_dsl_general/dsl_code_generation.rst>
-  DSL Control Flow <cute_dsl_general/dsl_control_flow.rst>
-  DSL JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
-  DSL JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
-  DSL JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
+  Introduction <cute_dsl_general/dsl_introduction.rst>
+  Code Generation <cute_dsl_general/dsl_code_generation.rst>
+  Control Flow <cute_dsl_general/dsl_control_flow.rst>
+  JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
+  JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
+  JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
  Integration with Frameworks <cute_dsl_general/framework_integration.rst>
  Debugging with the DSL <cute_dsl_general/debugging.rst>
  Autotuning with the DSL <cute_dsl_general/autotuning_gemm.rst>
--- a/media/docs/pythonDSL/cute_dsl_general/autotuning_gemm.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/autotuning_gemm.rst
@ -3,10 +3,6 @@
 Guidance for Auto-Tuning
 ============================= 

-.. contents:: Table of Contents
-   :depth: 2
-   :local:
-
 Numerous GEMM kernel code examples are offered within our codebase. 
 When integrating these kernels into frameworks, auto-tuning becomes essential 
 for achieving optimal performance. This involves selecting the appropriate 
--- a/media/docs/pythonDSL/cute_dsl_general/debugging.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/debugging.rst
@ -3,10 +3,6 @@
 Debugging
 =========

-.. contents:: Table of Contents
-   :depth: 2
-   :local:
-
 This page provides an overview of debugging techniques and tools for CuTe DSL programs.


--- a/media/docs/pythonDSL/cute_dsl_general/dsl_code_generation.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_code_generation.rst
@ -6,10 +6,6 @@
 End-to-End Code Generation
 ==========================

-.. contents::
-   :depth: 2
-   :local:
-

 1. Techniques for Turning Python into |IR|
 ------------------------------------------
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_control_flow.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_control_flow.rst
@ -4,11 +4,8 @@
 .. |DSL|       replace:: CuTe DSL
 .. |Constexpr| replace:: **Constexpr** (compile-time Python value)

-|DSL| Control Flow
+Control Flow
 ==================
-.. contents::
-   :depth: 2
-   :local:


 Overview
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_dynamic_layout.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_dynamic_layout.rst
@ -3,10 +3,6 @@
 .. |SLAY| replace:: static layout
 .. |DLAY| replace:: dynamic layout

-.. contents:: Table of Contents
-   :depth: 2
-   :local:
-
 Static vs Dynamic layouts
 =========================

--- a/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
@ -4,12 +4,9 @@
 .. |DSL| replace:: CuTe DSL


-|DSL|
+Introduction
 ======================

-.. contents:: Table of Contents
-   :depth: 2
-   :local:

 Overview
 --------
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_jit_arg_generation.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_jit_arg_generation.rst
@ -2,12 +2,9 @@
 .. |DSL| replace:: CuTe DSL
 .. |CUSTOM_TYPES| replace:: customized types

-|DSL| JIT Function Argument Generation
+JIT Function Argument Generation
 =======================================

-.. contents:: Table of Contents
-   :depth: 2
-   :local:

 In a nutshell
 --------------
@ -39,7 +36,7 @@ By default, |DSL| assumes dynamic arguments and tries to infer the argument type
    import cutlass.cute as cute

    @cute.jit
-    def foo(x: cutlass.Int32, y: cute.Constexpr):
+    def foo(x: cutlass.Int32, y: cutlass.Constexpr):
        print("x = ", x)        # Prints x = ?
        print("y = ", y)        # Prints y = 2
        cute.printf("x: {}", x) # Prints x: 2
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_jit_caching.rst
@ -3,11 +3,9 @@

 .. _JIT_Caching:

-|DSL| JIT Caching
+JIT Caching
 ====================
-.. contents:: Table of Contents
-   :depth: 2
-   :local:
+

 Zero Compile and JIT Executor
 -----------------------------
--- a/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/framework_integration.rst
@ -4,10 +4,6 @@
 Integration with Frameworks
 =============================

-.. contents:: Table of Contents
-   :depth: 2
-   :local:
-
 In order to facilitate the integration of CUTLASS Python with popular frameworks, we leverage the
 `DLPack protocol <https://github.com/dmlc/dlpack>`_ and transform tensors originating from these
 frameworks to CuTe tensors. The present page documents the conventions, the API available to the
@ -257,8 +253,7 @@ layouts. The full signature of ``mark_compact_shape_dynamic`` is as follows:

 The ``mode`` parameter determines which shape dimension becomes dynamic. After calling this function,
 the specific shape dimension given by ``mode`` is marked as dynamic immediately. The stride will be
-updated accordingly but this process is delayed until the C ABI of the tensor is constructed.
-For modes that have a shape of size 1, their stride are canonicalized to 0.
+updated accordingly. For modes that have a shape of size 1, their stride are canonicalized to 0.

 The ``stride_order`` parameter specifies the ordering of strides in the tensor. It is consistent
 with ``torch.Tensor.dim_order()`` and defaults to ``None``. The parameter indicates the order of
@ -322,10 +317,6 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
    import torch
    from cutlass.cute.runtime import from_dlpack

-    @cute.jit
-    def kernel(t: cute.Tensor):
-        pass
-
    # (8,4,16,2):(2,16,64,1)
    a = torch.empty(16, 4, 8, 2).permute(2, 1, 0, 3)
    # (1,4,1,32,1):(4,1,4,4,4) => torch tensor when dimension has shape 1, its stride is degenerated to 1,
@ -337,14 +328,12 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
    t0 = from_dlpack(a).mark_compact_shape_dynamic(
        mode=0, divisibility=2
    )
-    kernel(t0)
    # (?{div=2},4,16,2):(2,?{div=4},?{div=16},1)
    print(t0)

    t1 = from_dlpack(a).mark_compact_shape_dynamic(
        mode=1, divisibility=2
    )
-    kernel(t1)
    # (8,?{div=2},16,2):(2,16,?{div=32},1)
    print(t1)

@ -353,21 +342,18 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
    ).mark_compact_shape_dynamic(
        mode=3, divisibility=2
    )
-    kernel(t2)
    # (8,?{div=2},16,?{div=2}):(?{div=2},?{div=16},?{div=32},1)
    print(t2)

    t3 = from_dlpack(b).mark_compact_shape_dynamic(
        mode=2, divisibility=1, stride_order=(3, 0, 2, 4, 1)
    )
-    kernel(t3)
    # (1,4,?,32,1):(0,1,4,?{div=4},0)
    print(t3)

    t4 = from_dlpack(b).mark_compact_shape_dynamic(
        mode=2, divisibility=1, stride_order=(2, 3, 4, 0, 1)
    )
-    kernel(t4)
    # (1,4,?,32,1):(0,1,128,4,0)
    print(t4)

--- a/media/docs/pythonDSL/faqs.rst
+++ b/media/docs/pythonDSL/faqs.rst
@ -124,7 +124,8 @@ Technical

 License
 ---------------------
-**Q:What is the license for CuTe DSL and the associated GitHub samples?**
+**What is the license for CuTe DSL and the associated GitHub samples?**
+
    CuTe DSL components available `on Github <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL>`__ and via the nvidia-cutlass-dsl Python pip wheel
    are released under the `"NVIDIA Software End User License Agreement (EULA)" <https://github.com/NVIDIA/cutlass/tree/main/EULA.txt>`__.
    Because the pip package includes a compiler that shares several components with the CUDA Toolkit,
--- a/media/docs/pythonDSL/limitations.rst
+++ b/media/docs/pythonDSL/limitations.rst
@ -3,9 +3,6 @@
 Limitations
 ====================

-.. contents::
-   :depth: 2
-   :local:

 Overview
 ---------------------
--- a/media/docs/pythonDSL/overview.rst
+++ b/media/docs/pythonDSL/overview.rst
@ -42,7 +42,7 @@ Core CuTe DSL Abstractions
 - **Atoms** – Represent fundamental hardware operations like matrix multiply-accumulate (MMA) or memory copy.
 - **Tiled Operations** – Define how atoms are applied across thread blocks and warps (e.g., ``TiledMma``, ``TiledCopy``).

-For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md>`__.
+For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/cute/00_quickstart.md>`__.

 **Pythonic Kernel Expression**

--- a/media/docs/pythonDSL/quick_start.rst
+++ b/media/docs/pythonDSL/quick_start.rst
@ -29,3 +29,12 @@ To run examples and begin development, we recommend installing:
 .. code-block:: bash

   pip install torch jupyter
+
+Recommended Python environment variables for jupyter notebooks
+--------------------------------------------------------------
+
+We recommend setting the following environment variable when running jupyter notebooks.
+
+.. code-block:: bash
+
+   export PYTHONUNBUFFERED=1