v4.1 release

This commit is contained in:
Junkai-Wu
2025-07-03 20:07:53 +08:00
committed by GitHub
parent b995f93317
commit a1aaf2300a
155 changed files with 18407 additions and 6068 deletions

View File

@ -10,109 +10,130 @@ Control Flow
Overview
--------
|DSL| walks Pythons AST and converts each control-flow construct it finds into
|DSL| walks Python's AST and converts each control-flow construct it finds into
structured |IR|. You can therefore write ordinary Python loops and branches
while the compiler decides—statement by statement—whether to
* **evaluate at compile time** if the controlling value is a |Constexpr|, or
* **emit intermediate representation (IR)** when the value is dynamic.
* **evaluate at compile time** if it's a native Python control flow, or
* **emit intermediate representation (IR)** when the control flow is marked as dynamic.
Passing |IR| values to a native Python control flow will result in an error.
For a high-level discussion of the overall pipeline, see
:doc:`the code-generation overview <dsl_code_generation>`.
For Loops
---------
|DSL| recognises three kinds of ranges for ``for`` loops:
* ``range`` the Python built-in
* ``cutlass.range_dynamic`` always lowers to |IR|
* ``cutlass.range_constexpr`` always unrolls at compile time
* ``range`` the Python built-in, always lowered to |IR|
* ``cutlass.range`` - Same as Python built-in ``range``, but supports advanced unrolling and pipelining control
* ``cutlass.range_constexpr`` unrolled at compile time
range(...)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The AST rewriter inserts a small helper stub. At runtime the loop bounds are
inspected:
* **Constant bounds** → the loop is unrolled at compile time.
* **Dynamic bounds** → the loop is emitted as structured |IR|.
cutlass.range_dynamic(...)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use when you *always* want a loop in the generated |IR|, even if the bounds
look constant.
range(...)/cutlass.range(...)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use when you *always* want a loop in the generated |IR|, even if the inputs
are Python values.
cutlass.range_constexpr(...)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Runs in the Python interpreter and is fully unrolled before code generation.
All loop indices must be |Constexpr|.
Limitations of Dynamic For Loops
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Early-exit ``break``, ``continue``, or raising exception are not yet supported.
* Operations in the loop body are traced only when tracing is active in that
region.
**Example:**
.. code-block:: python
@cute.jit
def loop_example():
n = 10
@cute.jit
def control_flow_examples(bound: cutlass.Int32):
n = 10
# This loop is dynamic, early-exit isn't allowed.
for i in cutlass.range_dynamic(n):
if i == 5:
break # Early-exit
cute.printf("%d\\n", i)
# This loop is Python loop, evaluated at compile time.
for i in cutlass.range_constexpr(n):
cute.printf("%d\\n", i)
# ✅ This loop is dynamic, even when bound is Python value.
for i in range(n):
cute.printf("%d\\n", i)
# ❌ This loop bound is a dynamic value, not allowed in Python loop.
# Should use `range` instead.
for i in cutlass.range_constexpr(bound):
cute.printf("%d\\n", i)
# ✅ This loop is dynamic, emitted IR loop.
for i in range(bound):
cute.printf("%d\\n", i)
# ✅ This loop is dynamic, emitted IR loop with unrolling
for i in cutlass.range(bound, unroll=2):
cute.printf("%d\\n", i)
# ✅ This loop is constexpr, early-exit is allowed.
for i in cutlass.range_constexpr(n):
if i == 5:
break # Early-exit
cute.printf("%d\\n", i)
If-Else Statements
------------------
Standard Python ``if``/``else`` is supported.
Standard Python ``if``/``elif``/``else`` is supported.
* **Predicate is Constexpr (compile-time Python value)** → evaluated at compile time.
* **Predicate is dynamic** → lowered to |IR|.
* **Predicate without annotation** → lowered to |IR|.
* **Predicate annotated with `cutlass.const_expr`** → evaluated at compile time.
**Example:**
.. code-block:: python
@cute.jit
def main(const_var: cutlass.Constexpr, dynamic_var: cutlass.Int32):
if const_var: # compile-time branch
cute.printf("Const branch\\n")
else:
cute.printf("Const else\\n")
@cute.jit
def main(const_var: cutlass.Constexpr, dynamic_var: cutlass.Int32):
# ✅ This branch is Python branch, evaluated at compile time.
if cutlass.const_expr(const_var):
cute.printf("Const branch\\n")
else:
cute.printf("Const else\\n")
if dynamic_var == 10: # dynamic branch
cute.printf("Dynamic True\\n")
else:
cute.printf("Dynamic False\\n")
# ✅ This branch is dynamic branch, emitted IR branch.
if dynamic_var == 10:
cute.printf("Dynamic True\\n")
else:
cute.printf("Dynamic False\\n")
# ❌ Using a dynamic value with `cutlass.const_expr` is not allowed.
if cutlass.const_expr(dynamic_var == 10):
cute.printf("Bound is 10\\n")
Similarly to for-loops, the ``if cutlass.const_expr`` and ``if cutlass.dynamic_expr`` constructs can
be used to force the evaluation at compile-time or the generation of IR, respectively. Unstructured
control flow is only supported when using ``if cutlass.const_expr``.
While Loops
-----------
Python ``while`` loops are always treated as **dynamic** because the loop condition may become
dynamic after the first iteration. Similarly to for-loops and ``if``/``else``, the
``while cutlass.const_expr`` and ``while cutlass.dynamic_expr`` constructs are available.
Standard Python ``while`` is supported.
* **Condition without annotation** → lowered to |IR|.
* **Condition annotated with `cutlass.const_expr`** → evaluated at compile time.
**Example:**
.. code-block:: python
@cute.jit
def main(dynamic_var: cutlass.Int32):
n = 0
# ✅ This is Python while loop, evaluated at compile time.
while cutlass.const_expr(n < 10):
cute.printf("Const branch\\n")
n += 1
# ✅ This is dynamic while loop, emitted IR while loop.
while dynamic_var == 10:
cute.printf("Dynamic True\\n")
n += 1
# ❌ Using a dynamic value with `cutlass.const_expr` is not allowed.
while cutlass.const_expr(n < dynamic_var):
n += 1
Compile-Time Metaprogramming
----------------------------
@ -127,7 +148,7 @@ an optional **ReLU** epilogue:
def gemm(..., do_relu: cutlass.Constexpr):
# main GEMM work
...
if const_expr(do_relu): # compile-time guard
if cutlass.const_expr(do_relu): # compile-time guard
# ReLU code is emitted only when do_relu is True
...
@ -135,3 +156,45 @@ an optional **ReLU** epilogue:
gemm(..., False) # ReLU is omitted from the generated |IR|
gemm(..., True) # ReLU is included
Limitations of Dynamic Control Flow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Early-exit ``break``, ``continue``, ``pass`` or raising exception from
control flow body are not yet supported.
* Operations in the control flow body are traced only when tracing is active in
that region.
* Values originating in control flow body are not available outside the control
flow.
* Changing type of a variable in control flow body is not allowed.
**Example:**
.. code-block:: python
@cute.jit
def control_flow_negative_examples(predicate: cutlass.Boolean):
n = 10
# ❌ This loop is dynamic, early-exit isn't allowed.
for i in cutlass.range_dynamic(n):
if i == 5:
break # Early-exit
if predicate:
val = 10
# ❌ return from control flow body is not allowed.
return
# ❌ Raising exception from control flow body is not allowed.
raise ValueError("This is not allowed")
# ❌ Using pass in control flow body is not allowed.
pass
# ❌ val is not available outside the dynamic if
cute.printf("%d\\n", val)
if predicate:
# ❌ Changing type of a variable in control flow body is not allowed.
n = 10.0

View File

@ -39,7 +39,7 @@ General
the GitHub code only exists as a way for users to file issues and pull requests against.
While it can be used with the pip wheel, we do not recommend most users do so unless they are
hacking on the DSL itself. For all other users, we recommend they
simply ``pip install nvidia-cutlas-dsl`` and use the pip wheel as the single source
simply ``pip install nvidia-cutlass-dsl`` and use the pip wheel as the single source
of truth for the dialect compiler and DSL implementation. CUTLASS GitHub repository will
contain a ``requirements.txt`` file pinning the version of the wheel consistent with the state
of the OSS repository (please see :doc:`quick_start`). This means getting started with

View File

@ -18,7 +18,6 @@ Notable unsupported features
----------------------------
- GeForce RTX 50 Series support
- RS WGMMA (The input matrix A comes from register and the input matrix B comes from shared memory)
- Programmatic Dependent Launch (PDL)
- narrow-precision data type support, including related tensor core instructions
- convolutions
@ -31,6 +30,10 @@ Notable unsupported features
Programming Model
---------------------
**CuTe Layout Algebra Only support 32bit**
Today, we only support 32bit shapes/strides in CuTe layouts. 64bit or arbitrary
width support is planned for future releases.
**Python Native Data Types**
CuTe DSL supports Python data structures when used for "meta-programming,"
but these structures cannot be treated as dynamic values modifiable at runtime.