# CUTLASS Tutorial Examples for Blackwell GEMM

This folder contains tutorial examples demonstrating how to write performant GEMM (General Matrix Multiplication) kernels using Tensor Cores on NVIDIA Blackwell GPUs.

## Overview

The examples showcase different scenarios and optimization techniques for implementing GEMM operations:

- Basic FP16 GEMM implementation
- Software Pipeline optimizations
- Tensor Core utilization
- Thread/warp/block level parallelism

## Examples

### tutorial_fp16_gemm_0.py

A basic example showing:
- FP16 GEMM implementation using Tensor Cores
- TMA (Tensor Memory Access) for efficient data loading
- SMEM (Shared Memory) layouts and access patterns
- Usage of ``cutlass.range(..., prefetch_stages=...)`` to replace boilerplate code for multi-stage software pipeline

With some minor optimization tricks
- Tiling Epilogue to avoid bursty write out and reduce register pressure