v4.0 update. (#2371)
This commit is contained in:
@ -6,12 +6,12 @@ CuTe DSL
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
DSL Introduction <cute_dsl_general/dsl_introduction.rst>
|
||||
DSL Code Generation <cute_dsl_general/dsl_code_generation.rst>
|
||||
DSL Control Flow <cute_dsl_general/dsl_control_flow.rst>
|
||||
DSL JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
|
||||
DSL JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
|
||||
DSL JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
|
||||
Introduction <cute_dsl_general/dsl_introduction.rst>
|
||||
Code Generation <cute_dsl_general/dsl_code_generation.rst>
|
||||
Control Flow <cute_dsl_general/dsl_control_flow.rst>
|
||||
JIT Argument Generation <cute_dsl_general/dsl_jit_arg_generation.rst>
|
||||
JIT Argument: Layouts <cute_dsl_general/dsl_dynamic_layout.rst>
|
||||
JIT Caching <cute_dsl_general/dsl_jit_caching.rst>
|
||||
Integration with Frameworks <cute_dsl_general/framework_integration.rst>
|
||||
Debugging with the DSL <cute_dsl_general/debugging.rst>
|
||||
Autotuning with the DSL <cute_dsl_general/autotuning_gemm.rst>
|
||||
|
||||
@ -3,10 +3,6 @@
|
||||
Guidance for Auto-Tuning
|
||||
=============================
|
||||
|
||||
.. contents:: Table of Contents
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
Numerous GEMM kernel code examples are offered within our codebase.
|
||||
When integrating these kernels into frameworks, auto-tuning becomes essential
|
||||
for achieving optimal performance. This involves selecting the appropriate
|
||||
|
||||
@ -3,10 +3,6 @@
|
||||
Debugging
|
||||
=========
|
||||
|
||||
.. contents:: Table of Contents
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
This page provides an overview of debugging techniques and tools for CuTe DSL programs.
|
||||
|
||||
|
||||
|
||||
@ -6,10 +6,6 @@
|
||||
End-to-End Code Generation
|
||||
==========================
|
||||
|
||||
.. contents::
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
|
||||
1. Techniques for Turning Python into |IR|
|
||||
------------------------------------------
|
||||
|
||||
@ -4,11 +4,8 @@
|
||||
.. |DSL| replace:: CuTe DSL
|
||||
.. |Constexpr| replace:: **Constexpr** (compile-time Python value)
|
||||
|
||||
|DSL| Control Flow
|
||||
Control Flow
|
||||
==================
|
||||
.. contents::
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
|
||||
Overview
|
||||
|
||||
@ -3,10 +3,6 @@
|
||||
.. |SLAY| replace:: static layout
|
||||
.. |DLAY| replace:: dynamic layout
|
||||
|
||||
.. contents:: Table of Contents
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
Static vs Dynamic layouts
|
||||
=========================
|
||||
|
||||
|
||||
@ -4,12 +4,9 @@
|
||||
.. |DSL| replace:: CuTe DSL
|
||||
|
||||
|
||||
|DSL|
|
||||
Introduction
|
||||
======================
|
||||
|
||||
.. contents:: Table of Contents
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
@ -2,12 +2,9 @@
|
||||
.. |DSL| replace:: CuTe DSL
|
||||
.. |CUSTOM_TYPES| replace:: customized types
|
||||
|
||||
|DSL| JIT Function Argument Generation
|
||||
JIT Function Argument Generation
|
||||
=======================================
|
||||
|
||||
.. contents:: Table of Contents
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
In a nutshell
|
||||
--------------
|
||||
@ -39,7 +36,7 @@ By default, |DSL| assumes dynamic arguments and tries to infer the argument type
|
||||
import cutlass.cute as cute
|
||||
|
||||
@cute.jit
|
||||
def foo(x: cutlass.Int32, y: cute.Constexpr):
|
||||
def foo(x: cutlass.Int32, y: cutlass.Constexpr):
|
||||
print("x = ", x) # Prints x = ?
|
||||
print("y = ", y) # Prints y = 2
|
||||
cute.printf("x: {}", x) # Prints x: 2
|
||||
|
||||
@ -3,11 +3,9 @@
|
||||
|
||||
.. _JIT_Caching:
|
||||
|
||||
|DSL| JIT Caching
|
||||
JIT Caching
|
||||
====================
|
||||
.. contents:: Table of Contents
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
|
||||
Zero Compile and JIT Executor
|
||||
-----------------------------
|
||||
|
||||
@ -4,10 +4,6 @@
|
||||
Integration with Frameworks
|
||||
=============================
|
||||
|
||||
.. contents:: Table of Contents
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
In order to facilitate the integration of CUTLASS Python with popular frameworks, we leverage the
|
||||
`DLPack protocol <https://github.com/dmlc/dlpack>`_ and transform tensors originating from these
|
||||
frameworks to CuTe tensors. The present page documents the conventions, the API available to the
|
||||
@ -257,8 +253,7 @@ layouts. The full signature of ``mark_compact_shape_dynamic`` is as follows:
|
||||
|
||||
The ``mode`` parameter determines which shape dimension becomes dynamic. After calling this function,
|
||||
the specific shape dimension given by ``mode`` is marked as dynamic immediately. The stride will be
|
||||
updated accordingly but this process is delayed until the C ABI of the tensor is constructed.
|
||||
For modes that have a shape of size 1, their stride are canonicalized to 0.
|
||||
updated accordingly. For modes that have a shape of size 1, their stride are canonicalized to 0.
|
||||
|
||||
The ``stride_order`` parameter specifies the ordering of strides in the tensor. It is consistent
|
||||
with ``torch.Tensor.dim_order()`` and defaults to ``None``. The parameter indicates the order of
|
||||
@ -322,10 +317,6 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
|
||||
import torch
|
||||
from cutlass.cute.runtime import from_dlpack
|
||||
|
||||
@cute.jit
|
||||
def kernel(t: cute.Tensor):
|
||||
pass
|
||||
|
||||
# (8,4,16,2):(2,16,64,1)
|
||||
a = torch.empty(16, 4, 8, 2).permute(2, 1, 0, 3)
|
||||
# (1,4,1,32,1):(4,1,4,4,4) => torch tensor when dimension has shape 1, its stride is degenerated to 1,
|
||||
@ -337,14 +328,12 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
|
||||
t0 = from_dlpack(a).mark_compact_shape_dynamic(
|
||||
mode=0, divisibility=2
|
||||
)
|
||||
kernel(t0)
|
||||
# (?{div=2},4,16,2):(2,?{div=4},?{div=16},1)
|
||||
print(t0)
|
||||
|
||||
t1 = from_dlpack(a).mark_compact_shape_dynamic(
|
||||
mode=1, divisibility=2
|
||||
)
|
||||
kernel(t1)
|
||||
# (8,?{div=2},16,2):(2,16,?{div=32},1)
|
||||
print(t1)
|
||||
|
||||
@ -353,21 +342,18 @@ The following example demonstrates how to use ``mark_compact_shape_dynamic`` to
|
||||
).mark_compact_shape_dynamic(
|
||||
mode=3, divisibility=2
|
||||
)
|
||||
kernel(t2)
|
||||
# (8,?{div=2},16,?{div=2}):(?{div=2},?{div=16},?{div=32},1)
|
||||
print(t2)
|
||||
|
||||
t3 = from_dlpack(b).mark_compact_shape_dynamic(
|
||||
mode=2, divisibility=1, stride_order=(3, 0, 2, 4, 1)
|
||||
)
|
||||
kernel(t3)
|
||||
# (1,4,?,32,1):(0,1,4,?{div=4},0)
|
||||
print(t3)
|
||||
|
||||
t4 = from_dlpack(b).mark_compact_shape_dynamic(
|
||||
mode=2, divisibility=1, stride_order=(2, 3, 4, 0, 1)
|
||||
)
|
||||
kernel(t4)
|
||||
# (1,4,?,32,1):(0,1,128,4,0)
|
||||
print(t4)
|
||||
|
||||
|
||||
@ -124,7 +124,8 @@ Technical
|
||||
|
||||
License
|
||||
---------------------
|
||||
**Q:What is the license for CuTe DSL and the associated GitHub samples?**
|
||||
**What is the license for CuTe DSL and the associated GitHub samples?**
|
||||
|
||||
CuTe DSL components available `on Github <https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL>`__ and via the nvidia-cutlass-dsl Python pip wheel
|
||||
are released under the `"NVIDIA Software End User License Agreement (EULA)" <https://github.com/NVIDIA/cutlass/tree/main/EULA.txt>`__.
|
||||
Because the pip package includes a compiler that shares several components with the CUDA Toolkit,
|
||||
|
||||
@ -3,9 +3,6 @@
|
||||
Limitations
|
||||
====================
|
||||
|
||||
.. contents::
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
Overview
|
||||
---------------------
|
||||
|
||||
@ -42,7 +42,7 @@ Core CuTe DSL Abstractions
|
||||
- **Atoms** – Represent fundamental hardware operations like matrix multiply-accumulate (MMA) or memory copy.
|
||||
- **Tiled Operations** – Define how atoms are applied across thread blocks and warps (e.g., ``TiledMma``, ``TiledCopy``).
|
||||
|
||||
For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md>`__.
|
||||
For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/cute/00_quickstart.md>`__.
|
||||
|
||||
**Pythonic Kernel Expression**
|
||||
|
||||
|
||||
@ -29,3 +29,12 @@ To run examples and begin development, we recommend installing:
|
||||
.. code-block:: bash
|
||||
|
||||
pip install torch jupyter
|
||||
|
||||
Recommended Python environment variables for jupyter notebooks
|
||||
--------------------------------------------------------------
|
||||
|
||||
We recommend setting the following environment variable when running jupyter notebooks.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
export PYTHONUNBUFFERED=1
|
||||
|
||||
Reference in New Issue
Block a user