v4.3 update. (#2709)
* v4.3 update. * Update the cute_dsl_api changelog's doc link * Update version to 4.3.0 * Update the example link * Update doc to encourage user to install DSL from requirements.txt --------- Co-authored-by: Larry Wu <larwu@nvidia.com>
This commit is contained in:
25
examples/python/CuTeDSL/blackwell/tutorial_gemm/README.md
Normal file
25
examples/python/CuTeDSL/blackwell/tutorial_gemm/README.md
Normal file
@ -0,0 +1,25 @@
|
||||
# CUTLASS Tutorial Examples for Blackwell GEMM
|
||||
|
||||
This folder contains tutorial examples demonstrating how to write performant GEMM (General Matrix Multiplication) kernels using Tensor Cores on NVIDIA Blackwell GPUs.
|
||||
|
||||
## Overview
|
||||
|
||||
The examples showcase different scenarios and optimization techniques for implementing GEMM operations:
|
||||
|
||||
- Basic FP16 GEMM implementation
|
||||
- Software Pipeline optimizations
|
||||
- Tensor Core utilization
|
||||
- Thread/warp/block level parallelism
|
||||
|
||||
## Examples
|
||||
|
||||
### tutorial_fp16_gemm_0.py
|
||||
|
||||
A basic example showing:
|
||||
- FP16 GEMM implementation using Tensor Cores
|
||||
- TMA (Tensor Memory Access) for efficient data loading
|
||||
- SMEM (Shared Memory) layouts and access patterns
|
||||
- Usage of ``cutlass.range(..., prefetch_stages=...)`` to replace boilerplate code for multi-stage software pipeline
|
||||
|
||||
With some minor optimization tricks
|
||||
- Tiling Epilogue to avoid bursty write out and reduce register pressure
|
||||
Reference in New Issue
Block a user