Files
cutlass/examples/README.md
mihir-awatramani 389e493055 CUTLASS 3.8 Release (#2059)
* CUTLASS 3.8 Release

* update

* Update README.md

* Revert "Update README.md"

This reverts commit b353e36fe8.

* update

* update

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-25 02:44:06 -05:00

294 lines
10 KiB
Markdown

# CUTLASS - Programming Examples
* [00_basic_gemm](00_basic_gemm/)
launches a basic GEMM with single precision inputs and outputs
* [01_cutlass_utilities](01_cutlass_utilities/)
demonstrates CUTLASS Utilities for allocating and initializing tensors
* [02_dump_reg_smem](02_dump_reg_smem/)
debugging utilities for printing register and shared memory contents
* [03_visualize_layout](03_visualize_layout/)
utility for visualizing all layout functions in CUTLASS
* [04_tile_iterator](04_tile_iterator/)
example demonstrating an iterator over tiles in memory
* [05_batched_gemm](05_batched_gemm/)
example demonstrating CUTLASS's batched strided GEMM operation
* [06_splitK_gemm](06_splitK_gemm/)
example demonstrating CUTLASS's Split-K parallel reduction kernel
* [07_volta_tensorop_gemm](07_volta_tensorop_gemm/)
example demonstrating mixed precision GEMM using Volta Tensor Cores
* [08_turing_tensorop_gemm](08_turing_tensorop_gemm/)
example demonstrating integer GEMM using Turing Tensor Cores
* [09_turing_tensorop_conv2dfprop](09_turing_tensorop_conv2dfprop/)
example demonstrating integer implicit GEMM convolution (forward propagation) using Turing Tensor Cores
* [10_planar_complex](10_planar_complex/)
example demonstrating planar complex GEMM kernels
* [11_planar_complex_array](11_planar_complex_array/)
example demonstrating planar complex kernels with batch-specific problem sizes
* [12_gemm_bias_relu](12_gemm_bias_relu/)
example demonstrating GEMM fused with bias and relu
* [13_two_tensor_op_fusion](13_two_tensor_op_fusion/)
example demonstrating two GEMMs or convolutions fused in one kernel
* [14_ampere_tf32_tensorop_gemm](14_ampere_tf32_tensorop_gemm/)
example demonstrating FP32 GEMM with implicit TF32 conversion
* [15_ampere_sparse_tensorop_gemm](15_ampere_sparse_tensorop_gemm/)
example demonstrating usage of Sparse Tensor cores
* [16_ampere_tensorop_conv2dfprop](16_ampere_tensorop_conv2dfprop/)
example demonstrating forward convolution on tensors of layout NHWC
* [17_fprop_per_channel_bias](17_fprop_per_channel_bias/)
example demonstrating convolution fused with per channel bias and relu
* [18_ampere_fp64_tensorop_affine2_gemm](18_ampere_fp64_tensorop_affine2_gemm/)
example demonstrating Affine-2 GEMM
* [19_tensorop_canonical](19_tensorop_canonical/)
Canonical GEMM using tensor cores
* [20_simt_canonical](20_simt_canonical/)
Canonical GEMM using SIMT
* [21_quaternion_gemm](21_quaternion_gemm/)
example demonstrating Quaternion GEMM computations
* [22_quaternion conv](22_quaternion_conv/)
example demonstrating Quaternion convolution
* [23_ampere_gemm_operand_reduction_fusion](23_ampere_gemm_operand_reduction_fusion/)
example demonstrating how to reduce one of the operands of the GEMM along the k-dimension when computing GEMM
* [24_gemm_grouped](24_gemm_grouped/)
example demonstrating batch of GEMM operations with distinct problem sizes
* [25_ampere_fprop_mainloop_fusion](25_ampere_fprop_mainloop_fusion/)
example demonstrating fusing activation's per channel scale+bias+relu into the fgrad mainloop
* [26_ampere_wgrad_mainloop_fusion](26_ampere_wgrad_mainloop_fusion/)
example demonstrating fusing activation's per channel scale+bias+relu into the wgrad mainloop
* [27_ampere_3xtf32_fast_accurate_tensorop_gemm](27_ampere_3xtf32_fast_accurate_tensorop_gemm/)
example demonstrating emulation of a fast accurate SGEMM with TF32 operations
* [28_ampere_3xtf32_fast_accurate_tensorop_fprop](28_ampere_3xtf32_fast_accurate_tensorop_fprop/)
example demonstrating emulation of a fast accurate FP32 convolution with TF32 operation
* [29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm](29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm/)
example demonstrating emulation of a fast accurate CGEMM with TF32 operation
* [30_wgrad_split_k](30_wgrad_split_k/)
example demonstrating how to compute conv2d gradient with respect to weight (wgrad) together with split-K
* [31_basic_syrk](31_basic_syrk/)
example demonstrating Symmetric Rank-K update
* [32_basic_trmm](32_basic_trmm/)
example demonstrating Triangular Matrix-Matrix multiplication
* [33_ampere_3xtf32_tensorop_symm](33_ampere_3xtf32_tensorop_symm/)
example demonstrating Symmetric Matrix-Matrix multiplication with FP32 emulation
* [34_transposed_conv2d](34_transposed_conv2d/)
example demonstrating how to compute 2d transposed convolution, also known as deconvolution, using CUTLASS conv2d Dgrad kernels
* [35_gemm_softmax](35_gemm_softmax/)
example demonstrating GEMM fused with Softmax in mixed precision using Ampere Tensor Cores
* [36_gather_scatter_fusion](36_gather_scatter_fusion/)
example demonstrating fuses gather before GEMM and scatter after GEMM into the same GEMM kernel
* [37_gemm_layernorm_gemm_fusion](37_gemm_layernorm_gemm_fusion/)
example demonstrating fuses gemm->layernorm->gemm into one kernel.
* [38_syr2k_grouped](38_syr2k_grouped/)
example demonstrating a batch of SYR2K operations with distinct problem sizes
* [39_gemm_permute](39_gemm_permute/)
example demonstrating batched GEMM operations with output results permuted as reshaped tensors
* [40_cutlass_py](40_cutlass_py/)
example demonstrating CUTLASS with Python interface
* [41_multi_head_attention](41_multi_head_attention/)
example demonstrating attention example with non-fixed sequence length input
* [42_ampere_tensorop_group_conv](42_ampere_tensorop_group_conv/)
example demonstrating how to run group convolution kernels using functions and data structures provided by CUTLASS using tensor cores
* [43_ell_block_sparse_gemm](43_ell_block_sparse_gemm/)
example demonstrating a Block-Ell sparse gemm
* [44_fused_multi_head_attention](44_fused_multi_head_attention/)
example demonstrating fused multihead attention (fixed & variable) using shared memory
* [45_dual_gemm](45_dual_gemm/)
example demonstrating how to fuse two GEMMs sharing the same left input matrix into one kernel
* [46_depthwise_simt_conv2dfprop](46_depthwise_simt_conv2dfprop/)
example demonstrating depthwise 2d convolution kernels using functions and data structures provided by CUTLASS using SIMT instruction
* [47_ampere_gemm_universal_streamk](47_ampere_gemm_universal_streamk/)
example contrasting the Stream-K parallel decomposition for GEMM threadblocks versus the
"classic data-parallel" and "Split-K" decompositions.
* [48_hopper_warp_specialized_gemm](48_hopper_warp_specialized_gemm/)
Simple tensorop GEMM example using CUTLASS 3.0 APIs targeting NVIDIA Hopper architecture
* [49_hopper_gemm_schedules_with_collective_builder](49_hopper_gemm_schedules_with_collective_builder/)
Hopper GEMM example leveraging collective operation builders to showcase the builder API and the various kernel scheduled supported in CUTLASS 3.0 such as warp specialized persistent mainloops.
* [50_hopper_gemm_with_epilogue_swizzle](50_hopper_gemm_with_epilogue_swizzle/)
Hopper GEMM example to create a GEMM kernel with custom a collective mainloop and a custom vectorized epilogue.
* [51_hopper_gett](51_hopper_gett/)
Hopper GETT example illustrating the ease with which GETTs can be run due to CUTLASS 3.0's unified micro-kernels and CuTe's hierarchical layouts.
* [52_hopper_gather_scatter_fusion](52_hopper_gather_scatter_fusion/)
Hopper example that fuses gather before GEMM and scatter after GEMM into the same kernel
* [53_hopper_gemm_permute](53_hopper_gemm_permute/)
Hopper example demonstrating the fusion of tensor permutation operations with a GEMM kernel
* [54_hopper_fp8_warp_specialized_gemm](54_hopper_fp8_warp_specialized_gemm/)
Hopper example of instantiating and running an FP8 GEMM kernel
* [55_hopper_mixed_dtype_gemm](55_hopper_mixed_dtype_gemm/)
Hopper GEMM example with different A and B data types using CUTLASS 3.x APIs for DL kernels with fused dequantization.
* [56_hopper_ptr_array_batched_gemm](56_hopper_ptr_array_batched_gemm/)
Hopper Ptr-Array Batched GEMM example using CUTLASS 3.x API.
* [57_hopper_grouped_gemm](57_hopper_grouped_gemm/)
Hopper Grouped GEMM using CUTLASS 3.x API.
* [58_ada_fp8_gemm](58_ada_fp8_gemm/)
Ada GEMM kernel targetting Ada FP8 tensor cores via the CUTLASS 2.x API.
* [59_ampere_gather_scatter_conv](59_ampere_gather_scatter_conv/)
CuTe and CUTLASS 3.x based Ampere convolution fprop kernel capable of operating on both affine and gather/scatter tensors,
showing how kernel authors can re-use CUTLASS 3.x collectives in their custom kernels.
* [61_hopper_gemm_with_topk_and_softmax](61_hopper_gemm_with_topk_and_softmax/)
Hopper GEMM kernel with Top-K and softmax epilogue fusion.
[//]: #
* [70_blackwell_gemm](70_blackwell_gemm)
Simple dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
* [71_blackwell_gemm_with_collective_builder](71_blackwell_gemm_with_collective_builder)
Blackwell SM100 GEMM example demonstrating compatible mainloop+epilogue builder schedules and epilogue visitor tree (EVT) construction
* [72a_blackwell_narrow_precision_gemm](72a_blackwell_narrow_precision_gemm)
Block-scaled dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
* [73_blackwell_gemm_preferred_cluster](73_blackwell_gemm_preferred_cluster/)
Blackwell SM100 GEMM kernel with preferred cluster feature.
* [74_blackwell_gemm_streamk](74_blackwell_gemm_streamk/)
Blackwell SM100 GEMM kernel using the Stream-K scheduler
* [75_blackwell_grouped_gemm](75_blackwell_grouped_gemm)
Blackwell SM100 grouped GEMM kernel
* [76_blackwell_conv](76_blackwell_conv/)
Simple convolution(fprop/dgrad/wgrad) example targeting NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
* [77_blackwell_fmha](77_blackwell_fmha)
Blackwell SM100 FMHA kernel
[//]: #
# CuTe - Programming Examples
Examples that do not rely on CUTLASS and directly showcase the features of CuTe are located in [cutlass/examples/cute](./cute/).
Additionally, CuTe's core layout and layout algebra have their own test cases within [cutlass/test/unit/cute/core/](../test/unit/cute/core/) that users might find useful as examples of CuTe.
# Python Interface Examples
Examples leveraging CUTLASS's [Python interface](../python/README.md) are located in [cutlass/examples/python](python/).