Release v4.0.0 (#2294)

2025-05-13 15:55:29 -04:00
parent ad7b2f5e84
commit f115c3f854
299 changed files with 51495 additions and 4413 deletions
--- a/media/docs/pythonDSL/overview.rst
+++ b/media/docs/pythonDSL/overview.rst
@ -0,0 +1,108 @@
+.. _overview:
+
+Overview
+===========================
+
+CUTLASS 4.x bridges the gap between productivity and performance for CUDA kernel development. 
+By providing Python-based DSLs to the powerful CUTLASS C++ template library, it enables 
+faster iteration, easier prototyping, and a gentler learning curve for high-performance linear 
+algebra on NVIDIA GPUs.
+
+Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). 
+With the release of 4.0, we are releasing the first of these in CuTe DSL. 
+This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing 
+core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.
+
+Why CUTLASS DSLs?
+============================
+
+While CUTLASS offers exceptional performance through its C++ template abstractions, the complexity 
+can present challenges for many developers. CUTLASS 4.x addresses this by:
+
+- **Simplifying metaprogramming**: Metaprogramming in Python is a lot more intuitive than with C++
+- **Accelerating Iteration**: Rapid prototyping with familiar Python syntax and blazing fast compile times
+- **Lowering Barriers**: Reduced learning curve for GPU programming concepts and consistency between CuTe C++ and DSL
+- **Maintaining Performance**: Generated code leverages optimized CUTLASS primitives
+
+Students can learn GPU programming concepts without the complexity of C++ templates. 
+Researchers and performance engineers can rapidly explore algorithms, prototype, and tune 
+kernels before moving to production implementations.
+
+Key Concepts and Approach
+================================
+
+CUTLASS DSLs translate Python code into a custom intermediate representation (IR), 
+which is then Just-In-Time (JIT) compiled into optimized CUDA kernels using MLIR and `ptxas`.
+
+Core CuTe DSL Abstractions
+-----------------------------------
+
+- **Layouts** – Describe how data is organized in memory and across threads.
+- **Tensors** – Combine data pointers or iterators with layout metadata.
+- **Atoms** – Represent fundamental hardware operations like matrix multiply-accumulate (MMA) or memory copy.
+- **Tiled Operations** – Define how atoms are applied across thread blocks and warps (e.g., ``TiledMma``, ``TiledCopy``).
+
+For more on CuTe abstractions, refer to the `CuTe C++ library documentation <https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md>`__.
+
+**Pythonic Kernel Expression**
+
+Developers express kernel logic, data movement, and computation using familiar Python syntax and control flow.
+
+The DSLs simplify expressing loop tiling, threading strategies, and data transformations using concise Python code.
+
+**JIT Compilation**
+
+Python kernels are compiled at runtime into CUDA device code using MLIR infrastructure and NVIDIA’s ``ptxas`` toolchain, 
+enabling rapid iteration and interactive debugging.
+
+Relationship to CUTLASS C++
+=================================
+
+CUTLASS DSLs are not a replacement for the CUTLASS C++ library or its 2.x and 3.x APIs. Instead, it aims to be a high-productivity kernel 
+authoring framework that shares all concepts with CUTLASS 3.x C++ API such as CuTe, pipelines, schedulers etc.
+
+- **Performance**: Generated kernels aim to match CUTLASS C++ kernels in performance; however, some performance gaps 
+  may exist due to missing optimizations that have been added over the years to CUTLASS C++ and may be missing in the DSLs examples.
+- **Library**: The CUTLASS DSLs do not currently ship with a full GEMM/Conv autotuning profiler or library interface 
+  akin to CUTLASS C++. Instead, it focuses on generating and autotuning individual kernel instances (for example: via tile size exploration) and via native integration DL frameworks that support auto-tuning.
+
+Getting Started
+================================
+
+- :doc:`quick_start` – Initial setup and installation.
+- :doc:`cute_dsl` – Overview of the typical development and workflow using CuTe DSL.
+- :doc:`cute_dsl_api` – Refer to the full API documentation.
+- :doc:`limitations` – Understand current CuTe DSL constraints and differences from C++.
+- :doc:`faqs` – Common questions and known issues.
+
+Current Status & Roadmap
+=================================
+
+CuTe DSL is in public beta and actively evolving. Interfaces and features are subject to 
+change as we improve the system.
+
+Upcoming Milestones
+----------------------------------
+
+- Public release targeted for **Summer 2025**
+- Expanded support for additional data types and kernel types
+- Usability improvements: better error messages, debugging tools, and streamlined APIs
+- Broader integration of CUTLASS primitives and features
+
+For known issues and workarounds, please consult the :doc:`limitations` and :doc:`faqs`.
+
+Community & Feedback
+==================================
+
+We welcome contributions and feedback from the developer community!
+
+You can:
+
+- Submit bug reports or feature requests via our `GitHub Issues page <https://github.com/NVIDIA/cutlass/issues>`__
+- Join the CUTLASS community on `Discord <https://discord.com/channels/1019361803752456192/1150868614921064590>`__ to ask questions and share ideas
+- Contribute examples, tutorials, or enhancements to the DSLs
+- Report unclear or missing documentation
+- Propose support for additional data types or kernel variants
+- Help prioritize roadmap features by upvoting GitHub issues
+
+Thank you for helping shape the future of CUTLASS DSLs!