Files
cutlass/examples/python/CuTeDSL/blackwell/tutorial_gemm/README.md
Junkai-Wu b1d6e2c9b3 v4.3 update. (#2709)
* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>
2025-10-21 14:26:30 -04:00

913 B

CUTLASS Tutorial Examples for Blackwell GEMM

This folder contains tutorial examples demonstrating how to write performant GEMM (General Matrix Multiplication) kernels using Tensor Cores on NVIDIA Blackwell GPUs.

Overview

The examples showcase different scenarios and optimization techniques for implementing GEMM operations:

  • Basic FP16 GEMM implementation
  • Software Pipeline optimizations
  • Tensor Core utilization
  • Thread/warp/block level parallelism

Examples

tutorial_fp16_gemm_0.py

A basic example showing:

  • FP16 GEMM implementation using Tensor Cores
  • TMA (Tensor Memory Access) for efficient data loading
  • SMEM (Shared Memory) layouts and access patterns
  • Usage of cutlass.range(..., prefetch_stages=...) to replace boilerplate code for multi-stage software pipeline

With some minor optimization tricks

  • Tiling Epilogue to avoid bursty write out and reduce register pressure