Files

Junkai-Wu b1d6e2c9b3 v4.3 update. (#2709 )

* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>

2025-10-21 14:26:30 -04:00

fp16_gemm_0.py

v4.3 update. (#2709 )

2025-10-21 14:26:30 -04:00

README.md

v4.3 update. (#2709 )

2025-10-21 14:26:30 -04:00

README.md

CUTLASS Tutorial Examples for Blackwell GEMM

This folder contains tutorial examples demonstrating how to write performant GEMM (General Matrix Multiplication) kernels using Tensor Cores on NVIDIA Blackwell GPUs.

Overview

The examples showcase different scenarios and optimization techniques for implementing GEMM operations:

Basic FP16 GEMM implementation
Software Pipeline optimizations
Tensor Core utilization
Thread/warp/block level parallelism

Examples

tutorial_fp16_gemm_0.py

A basic example showing:

FP16 GEMM implementation using Tensor Cores
TMA (Tensor Memory Access) for efficient data loading
SMEM (Shared Memory) layouts and access patterns
Usage of cutlass.range(..., prefetch_stages=...) to replace boilerplate code for multi-stage software pipeline

With some minor optimization tricks

Tiling Epilogue to avoid bursty write out and reduce register pressure