* v4.3 update. * Update the cute_dsl_api changelog's doc link * Update version to 4.3.0 * Update the example link * Update doc to encourage user to install DSL from requirements.txt --------- Co-authored-by: Larry Wu <larwu@nvidia.com>
CUTLASS Tutorial Examples for Blackwell GEMM
This folder contains tutorial examples demonstrating how to write performant GEMM (General Matrix Multiplication) kernels using Tensor Cores on NVIDIA Blackwell GPUs.
Overview
The examples showcase different scenarios and optimization techniques for implementing GEMM operations:
- Basic FP16 GEMM implementation
- Software Pipeline optimizations
- Tensor Core utilization
- Thread/warp/block level parallelism
Examples
tutorial_fp16_gemm_0.py
A basic example showing:
- FP16 GEMM implementation using Tensor Cores
- TMA (Tensor Memory Access) for efficient data loading
- SMEM (Shared Memory) layouts and access patterns
- Usage of
cutlass.range(..., prefetch_stages=...)to replace boilerplate code for multi-stage software pipeline
With some minor optimization tricks
- Tiling Epilogue to avoid bursty write out and reduce register pressure