v4.3 update. (#2709)

* v4.3 update. * Update the cute_dsl_api changelog's doc link * Update version to 4.3.0 * Update the example link * Update doc to encourage user to install DSL from requirements.txt --------- Co-authored-by: Larry Wu <larwu@nvidia.com>
2025-10-22 02:26:30 +08:00
parent e6e2cc29f5
commit b1d6e2c9b3
244 changed files with 59272 additions and 10455 deletions
--- a/examples/python/CuTeDSL/blackwell/tutorial_gemm/README.md
+++ b/examples/python/CuTeDSL/blackwell/tutorial_gemm/README.md
@ -0,0 +1,25 @@
+# CUTLASS Tutorial Examples for Blackwell GEMM
+
+This folder contains tutorial examples demonstrating how to write performant GEMM (General Matrix Multiplication) kernels using Tensor Cores on NVIDIA Blackwell GPUs.
+
+## Overview
+
+The examples showcase different scenarios and optimization techniques for implementing GEMM operations:
+
+- Basic FP16 GEMM implementation
+- Software Pipeline optimizations
+- Tensor Core utilization
+- Thread/warp/block level parallelism
+
+## Examples
+
+### tutorial_fp16_gemm_0.py
+
+A basic example showing:
+- FP16 GEMM implementation using Tensor Cores
+- TMA (Tensor Memory Access) for efficient data loading
+- SMEM (Shared Memory) layouts and access patterns
+- Usage of ``cutlass.range(..., prefetch_stages=...)`` to replace boilerplate code for multi-stage software pipeline
+
+With some minor optimization tricks
+- Tiling Epilogue to avoid bursty write out and reduce register pressure