v4.3 update. (#2709)

* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>
This commit is contained in:
Junkai-Wu
2025-10-22 02:26:30 +08:00
committed by GitHub
parent e6e2cc29f5
commit b1d6e2c9b3
244 changed files with 59272 additions and 10455 deletions

View File

@ -0,0 +1,25 @@
# CUTLASS Tutorial Examples for Blackwell GEMM
This folder contains tutorial examples demonstrating how to write performant GEMM (General Matrix Multiplication) kernels using Tensor Cores on NVIDIA Blackwell GPUs.
## Overview
The examples showcase different scenarios and optimization techniques for implementing GEMM operations:
- Basic FP16 GEMM implementation
- Software Pipeline optimizations
- Tensor Core utilization
- Thread/warp/block level parallelism
## Examples
### tutorial_fp16_gemm_0.py
A basic example showing:
- FP16 GEMM implementation using Tensor Cores
- TMA (Tensor Memory Access) for efficient data loading
- SMEM (Shared Memory) layouts and access patterns
- Usage of ``cutlass.range(..., prefetch_stages=...)`` to replace boilerplate code for multi-stage software pipeline
With some minor optimization tricks
- Tiling Epilogue to avoid bursty write out and reduce register pressure