v4.0 update. (#2371)

This commit is contained in:
Junkai-Wu
2025-06-06 14:39:20 +08:00
committed by GitHub
parent 2e2af190bd
commit 8bdbfca682
254 changed files with 29751 additions and 1980 deletions

View File

@ -6,12 +6,12 @@ CuTe DSL
.. toctree::
:maxdepth: 1
DSL Introduction <cute_dsl_general/dsl_introduction.rst>
DSL Code Generation <cute_dsl_general/dsl_code_generation.rst>
DSL Control Flow <cute_dsl_general/dsl_control_flow.rst>
DSL JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
DSL JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
DSL JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
Introduction <cute_dsl_general/dsl_introduction.rst>
Code Generation <cute_dsl_general/dsl_code_generation.rst>
Control Flow <cute_dsl_general/dsl_control_flow.rst>
JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
Integration with Frameworks <cute_dsl_general/framework_integration.rst>
Debugging with the DSL <cute_dsl_general/debugging.rst>
Autotuning with the DSL <cute_dsl_general/autotuning_gemm.rst>

View File

@ -3,10 +3,6 @@
Guidance for Auto-Tuning
=============================
.. contents:: Table of Contents
:depth: 2
:local:
Numerous GEMM kernel code examples are offered within our codebase.
When integrating these kernels into frameworks, auto-tuning becomes essential
for achieving optimal performance. This involves selecting the appropriate

View File

@ -3,10 +3,6 @@
Debugging
=========
.. contents:: Table of Contents
:depth: 2
:local:
This page provides an overview of debugging techniques and tools for CuTe DSL programs.

View File

@ -6,10 +6,6 @@
End-to-End Code Generation
==========================
.. contents::
:depth: 2
:local:
1. Techniques for Turning Python into |IR|
------------------------------------------

View File

@ -4,11 +4,8 @@
.. |DSL| replace:: CuTe DSL
.. |Constexpr| replace:: **Constexpr** (compile-time Python value)
|DSL| Control Flow
Control Flow
==================
.. contents::
:depth: 2
:local:
Overview

View File

@ -3,10 +3,6 @@
.. |SLAY| replace:: static layout
.. |DLAY| replace:: dynamic layout
.. contents:: Table of Contents
:depth: 2
:local:
Static vs Dynamic layouts
=========================

View File

@ -4,12 +4,9 @@
.. |DSL| replace:: CuTe DSL
|DSL|
Introduction
======================
.. contents:: Table of Contents
:depth: 2
:local:
Overview
--------

View File

@ -2,12 +2,9 @@
.. |DSL| replace:: CuTe DSL
.. |CUSTOM_TYPES| replace:: customized types
|DSL| JIT Function Argument Generation
JIT Function Argument Generation
=======================================
.. contents:: Table of Contents
:depth: 2
:local:
In a nutshell
--------------
@ -39,7 +36,7 @@ By default, |DSL| assumes dynamic arguments and tries to infer the argument type
import cutlass.cute as cute
@cute.jit
def foo(x: cutlass.Int32, y: cute.Constexpr):
def foo(x: cutlass.Int32, y: cutlass.Constexpr):
print("x = ", x) # Prints x = ?
print("y = ", y) # Prints y = 2
cute.printf("x: {}", x) # Prints x: 2

View File

@ -3,11 +3,9 @@
.. _JIT_Caching:
|DSL| JIT Caching
JIT Caching
====================
.. contents:: Table of Contents
:depth: 2
:local:
Zero Compile and JIT Executor
-----------------------------

View File

@ -4,10 +4,6 @@
Integration with Frameworks
=============================
.. contents:: Table of Contents
:depth: 2
:local:
In order to facilitate the integration of CUTLASS Python with popular frameworks, we leverage the
`DLPack protocol <https://github.com/dmlc/dlpack>`_ and transform tensors originating from these
frameworks to CuTe tensors. The present page documents the conventions, the API available to the
@ -257,8 +253,7 @@ layouts. The full signature of ``mark_compact_shape_dynamic`` is as follows:
The ``mode`` parameter determines which shape dimension becomes dynamic. After calling this function,
the specific shape dimension given by ``mode`` is marked as dynamic immediately. The stride will be
updated accordingly but this process is delayed until the C ABI of the tensor is constructed.
For modes that have a shape of size 1, their stride are canonicalized to 0.
updated accordingly. For modes that have a shape of size 1, their stride are canonicalized to 0.
The ``stride_order`` parameter specifies the ordering of strides in the tensor. It is consistent
with ``torch.Tensor.dim_order()`` and defaults to ``None``. The parameter indicates the order of
@ -322,10 +317,6 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
import torch
from cutlass.cute.runtime import from_dlpack
@cute.jit
def kernel(t: cute.Tensor):
pass
# (8,4,16,2):(2,16,64,1)
a = torch.empty(16, 4, 8, 2).permute(2, 1, 0, 3)
# (1,4,1,32,1):(4,1,4,4,4) => torch tensor when dimension has shape 1, its stride is degenerated to 1,
@ -337,14 +328,12 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
t0 = from_dlpack(a).mark_compact_shape_dynamic(
mode=0, divisibility=2
)
kernel(t0)
# (?{div=2},4,16,2):(2,?{div=4},?{div=16},1)
print(t0)
t1 = from_dlpack(a).mark_compact_shape_dynamic(
mode=1, divisibility=2
)
kernel(t1)
# (8,?{div=2},16,2):(2,16,?{div=32},1)
print(t1)
@ -353,21 +342,18 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
).mark_compact_shape_dynamic(
mode=3, divisibility=2
)
kernel(t2)
# (8,?{div=2},16,?{div=2}):(?{div=2},?{div=16},?{div=32},1)
print(t2)
t3 = from_dlpack(b).mark_compact_shape_dynamic(
mode=2, divisibility=1, stride_order=(3, 0, 2, 4, 1)
)
kernel(t3)
# (1,4,?,32,1):(0,1,4,?{div=4},0)
print(t3)
t4 = from_dlpack(b).mark_compact_shape_dynamic(
mode=2, divisibility=1, stride_order=(2, 3, 4, 0, 1)
)
kernel(t4)
# (1,4,?,32,1):(0,1,128,4,0)
print(t4)

View File

@ -124,7 +124,8 @@ Technical
License
---------------------
**Q:What is the license for CuTe DSL and the associated GitHub samples?**
**What is the license for CuTe DSL and the associated GitHub samples?**
CuTe DSL components available `on Github <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL>`__ and via the nvidia-cutlass-dsl Python pip wheel
are released under the `"NVIDIA Software End User License Agreement (EULA)" <https://github.com/NVIDIA/cutlass/tree/main/EULA.txt>`__.
Because the pip package includes a compiler that shares several components with the CUDA Toolkit,

View File

@ -3,9 +3,6 @@
Limitations
====================
.. contents::
:depth: 2
:local:
Overview
---------------------

View File

@ -42,7 +42,7 @@ Core CuTe DSL Abstractions
- **Atoms** Represent fundamental hardware operations like matrix multiply-accumulate (MMA) or memory copy.
- **Tiled Operations** Define how atoms are applied across thread blocks and warps (e.g., ``TiledMma``, ``TiledCopy``).
For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md>`__.
For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/cute/00_quickstart.md>`__.
**Pythonic Kernel Expression**

View File

@ -29,3 +29,12 @@ To run examples and begin development, we recommend installing:
.. code-block:: bash
pip install torch jupyter
Recommended Python environment variables for jupyter notebooks
--------------------------------------------------------------
We recommend setting the following environment variable when running jupyter notebooks.
.. code-block:: bash
export PYTHONUNBUFFERED=1