Files
cutlass/media/docs/pythonDSL/overview.rst
2025-05-13 15:55:29 -04:00

108 lines
5.3 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

.. _overview:
Overview
===========================
CUTLASS 4.x bridges the gap between productivity and performance for CUDA kernel development.
By providing Python-based DSLs to the powerful CUTLASS C++ template library, it enables
faster iteration, easier prototyping, and a gentler learning curve for high-performance linear
algebra on NVIDIA GPUs.
Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs).
With the release of 4.0, we are releasing the first of these in CuTe DSL.
This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing
core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.
Why CUTLASS DSLs?
============================
While CUTLASS offers exceptional performance through its C++ template abstractions, the complexity
can present challenges for many developers. CUTLASS 4.x addresses this by:
- **Simplifying metaprogramming**: Metaprogramming in Python is a lot more intuitive than with C++
- **Accelerating Iteration**: Rapid prototyping with familiar Python syntax and blazing fast compile times
- **Lowering Barriers**: Reduced learning curve for GPU programming concepts and consistency between CuTe C++ and DSL
- **Maintaining Performance**: Generated code leverages optimized CUTLASS primitives
Students can learn GPU programming concepts without the complexity of C++ templates.
Researchers and performance engineers can rapidly explore algorithms, prototype, and tune
kernels before moving to production implementations.
Key Concepts and Approach
================================
CUTLASS DSLs translate Python code into a custom intermediate representation (IR),
which is then Just-In-Time (JIT) compiled into optimized CUDA kernels using MLIR and `ptxas`.
Core CuTe DSL Abstractions
-----------------------------------
- **Layouts** Describe how data is organized in memory and across threads.
- **Tensors** Combine data pointers or iterators with layout metadata.
- **Atoms** Represent fundamental hardware operations like matrix multiply-accumulate (MMA) or memory copy.
- **Tiled Operations** Define how atoms are applied across thread blocks and warps (e.g., ``TiledMma``, ``TiledCopy``).
For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md>`__.
**Pythonic Kernel Expression**
Developers express kernel logic, data movement, and computation using familiar Python syntax and control flow.
The DSLs simplify expressing loop tiling, threading strategies, and data transformations using concise Python code.
**JIT Compilation**
Python kernels are compiled at runtime into CUDA device code using MLIR infrastructure and NVIDIAs ``ptxas`` toolchain,
enabling rapid iteration and interactive debugging.
Relationship to CUTLASS C++
=================================
CUTLASS DSLs are not a replacement for the CUTLASS C++ library or its 2.x and 3.x APIs. Instead, it aims to be a high-productivity kernel
authoring framework that shares all concepts with CUTLASS 3.x C++ API such as CuTe, pipelines, schedulers etc.
- **Performance**: Generated kernels aim to match CUTLASS C++ kernels in performance; however, some performance gaps
may exist due to missing optimizations that have been added over the years to CUTLASS C++ and may be missing in the DSLs examples.
- **Library**: The CUTLASS DSLs do not currently ship with a full GEMM/Conv autotuning profiler or library interface
akin to CUTLASS C++. Instead, it focuses on generating and autotuning individual kernel instances (for example: via tile size exploration) and via native integration DL frameworks that support auto-tuning.
Getting Started
================================
- :doc:`quick_start` Initial setup and installation.
- :doc:`cute_dsl` Overview of the typical development and workflow using CuTe DSL.
- :doc:`cute_dsl_api` Refer to the full API documentation.
- :doc:`limitations` Understand current CuTe DSL constraints and differences from C++.
- :doc:`faqs` Common questions and known issues.
Current Status & Roadmap
=================================
CuTe DSL is in public beta and actively evolving. Interfaces and features are subject to
change as we improve the system.
Upcoming Milestones
----------------------------------
- Public release targeted for **Summer 2025**
- Expanded support for additional data types and kernel types
- Usability improvements: better error messages, debugging tools, and streamlined APIs
- Broader integration of CUTLASS primitives and features
For known issues and workarounds, please consult the :doc:`limitations` and :doc:`faqs`.
Community & Feedback
==================================
We welcome contributions and feedback from the developer community!
You can:
- Submit bug reports or feature requests via our `GitHub Issues page <https://github.com/NVIDIA/cutlass/issues>`__
- Join the CUTLASS community on `Discord <https://discord.com/channels/1019361803752456192/1150868614921064590>`__ to ask questions and share ideas
- Contribute examples, tutorials, or enhancements to the DSLs
- Report unclear or missing documentation
- Propose support for additional data types or kernel variants
- Help prioritize roadmap features by upvoting GitHub issues
Thank you for helping shape the future of CUTLASS DSLs!