cutlass/examples/README.md

# CUTLASS - Programming Examples

* [00_basic_gemm](00_basic_gemm/)

    launches a basic GEMM with single precision inputs and outputs

* [01_cutlass_utilities](01_cutlass_utilities/)

    demonstrates CUTLASS Utilities for allocating and initializing tensors

* [02_dump_reg_smem](02_dump_reg_smem/)

    debugging utilities for printing register and shared memory contents

* [03_visualize_layout](03_visualize_layout/)

    utility for visualizing all layout functions in CUTLASS

* [04_tile_iterator](04_tile_iterator/)

    example demonstrating an iterator over tiles in memory

* [05_batched_gemm](05_batched_gemm/)

    example demonstrating CUTLASS's batched strided GEMM operation

* [06_splitK_gemm](06_splitK_gemm/)

    example demonstrating CUTLASS's Split-K parallel reduction kernel

* [07_volta_tensorop_gemm](07_volta_tensorop_gemm/)

    example demonstrating mixed precision GEMM using Volta Tensor Cores

* [08_turing_tensorop_gemm](08_turing_tensorop_gemm/)

    example demonstrating integer GEMM using Turing Tensor Cores

* [09_turing_tensorop_conv2dfprop](09_turing_tensorop_conv2dfprop/)

    example demonstrating integer implicit GEMM convolution (forward propagation) using Turing Tensor Cores

* [10_planar_complex](10_planar_complex/)

    example demonstrating planar complex GEMM kernels

* [11_planar_complex_array](11_planar_complex_array/)

    example demonstrating planar complex kernels with batch-specific problem sizes

* [12_gemm_bias_relu](12_gemm_bias_relu/)

    example demonstrating GEMM fused with bias and relu

* [13_two_tensor_op_fusion](13_two_tensor_op_fusion/)

    example demonstrating two GEMMs or convolutions fused in one kernel

* [14_ampere_tf32_tensorop_gemm](14_ampere_tf32_tensorop_gemm/)

    example demonstrating FP32 GEMM with implicit TF32 conversion

* [15_ampere_sparse_tensorop_gemm](15_ampere_sparse_tensorop_gemm/)

    example demonstrating usage of Sparse Tensor cores

* [16_ampere_tensorop_conv2dfprop](16_ampere_tensorop_conv2dfprop/)

    example demonstrating forward convolution on tensors of layout NHWC

* [17_fprop_per_channel_bias](17_fprop_per_channel_bias/)

    example demonstrating convolution fused with per channel bias and relu

* [18_ampere_fp64_tensorop_affine2_gemm](18_ampere_fp64_tensorop_affine2_gemm/)

    example demonstrating Affine-2 GEMM

* [19_tensorop_canonical](19_tensorop_canonical/)

    Canonical GEMM using tensor cores

* [20_simt_canonical](20_simt_canonical/)

    Canonical GEMM using SIMT

* [21_quaternion_gemm](21_quaternion_gemm/)

    example demonstrating Quaternion GEMM computations

* [22_quaternion conv](22_quaternion_conv/)

    example demonstrating Quaternion convolution

* [23_ampere_gemm_operand_reduction_fusion](23_ampere_gemm_operand_reduction_fusion/)

    example demonstrating how to reduce one of the operands of the GEMM along the k-dimension when computing GEMM

* [24_gemm_grouped](24_gemm_grouped/)

    example demonstrating batch of GEMM operations with distinct problem sizes

* [25_ampere_fprop_mainloop_fusion](25_ampere_fprop_mainloop_fusion/)

    example demonstrating fusing activation's per channel scale+bias+relu into the fgrad mainloop

* [26_ampere_wgrad_mainloop_fusion](26_ampere_wgrad_mainloop_fusion/)

    example demonstrating fusing activation's per channel scale+bias+relu into the wgrad mainloop

* [27_ampere_3xtf32_fast_accurate_tensorop_gemm](27_ampere_3xtf32_fast_accurate_tensorop_gemm/)

    example demonstrating emulation of a fast accurate SGEMM with TF32 operations

* [28_ampere_3xtf32_fast_accurate_tensorop_fprop](28_ampere_3xtf32_fast_accurate_tensorop_fprop/)

    example demonstrating emulation of a fast accurate FP32 convolution with TF32 operation

* [29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm](29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm/)

    example demonstrating emulation of a fast accurate CGEMM with TF32 operation

* [30_wgrad_split_k](30_wgrad_split_k/)

    example demonstrating how to compute conv2d gradient with respect to weight (wgrad) together with split-K

* [31_basic_syrk](31_basic_syrk/)

    example demonstrating Symmetric Rank-K update

* [32_basic_trmm](32_basic_trmm/)

    example demonstrating Triangular Matrix-Matrix multiplication

* [33_ampere_3xtf32_tensorop_symm](33_ampere_3xtf32_tensorop_symm/)

    example demonstrating Symmetric Matrix-Matrix multiplication with FP32 emulation

* [34_transposed_conv2d](34_transposed_conv2d/)

    example demonstrating how to compute 2d transposed convolution, also known as deconvolution, using CUTLASS conv2d Dgrad kernels

* [35_gemm_softmax](35_gemm_softmax/)

    example demonstrating GEMM fused with Softmax in mixed precision using Ampere Tensor Cores

* [36_gather_scatter_fusion](36_gather_scatter_fusion/)

    example demonstrating fuses gather before GEMM and scatter after GEMM into the same GEMM kernel

* [37_gemm_layernorm_gemm_fusion](37_gemm_layernorm_gemm_fusion/)

    example demonstrating fuses gemm->layernorm->gemm into one kernel.

* [38_syr2k_grouped](38_syr2k_grouped/)

    example demonstrating a batch of SYR2K operations with distinct problem sizes

* [39_gemm_permute](39_gemm_permute/)

    example demonstrating batched GEMM operations with output results permuted as reshaped tensors

* [40_cutlass_py](40_cutlass_py/)

    example demonstrating CUTLASS with Python interface

* [41_multi_head_attention](41_multi_head_attention/)

    example demonstrating attention example with non-fixed sequence length input

* [42_ampere_tensorop_group_conv](42_ampere_tensorop_group_conv/)

    example demonstrating how to run group convolution kernels using functions and data structures provided by CUTLASS using tensor cores

* [43_ell_block_sparse_gemm](43_ell_block_sparse_gemm/)

    example demonstrating a Block-Ell sparse gemm

* [44_fused_multi_head_attention](44_fused_multi_head_attention/)

    example demonstrating fused multihead attention (fixed & variable) using shared memory

* [45_dual_gemm](45_dual_gemm/)

    example demonstrating how to fuse two GEMMs sharing the same left input matrix into one kernel

* [46_depthwise_simt_conv2dfprop](46_depthwise_simt_conv2dfprop/)

    example demonstrating depthwise 2d convolution kernels using functions and data structures provided by CUTLASS using SIMT instruction

* [47_ampere_gemm_universal_streamk](47_ampere_gemm_universal_streamk/)

    example contrasting the Stream-K parallel decomposition for GEMM threadblocks versus the
 "classic data-parallel" and "Split-K" decompositions.

* [48_hopper_warp_specialized_gemm](48_hopper_warp_specialized_gemm/)

    Simple tensorop GEMM example using CUTLASS 3.0 APIs targeting NVIDIA Hopper architecture

* [49_hopper_gemm_schedules_with_collective_builder](49_hopper_gemm_schedules_with_collective_builder/)

    Hopper GEMM example leveraging collective operation builders to showcase the builder API and the various kernel scheduled supported in CUTLASS 3.0 such as warp specialized persistent mainloops.

* [50_hopper_gemm_with_epilogue_swizzle](50_hopper_gemm_with_epilogue_swizzle/)

    Hopper GEMM example to create a GEMM kernel with custom a collective mainloop and a custom vectorized epilogue.

* [51_hopper_gett](51_hopper_gett/)

    Hopper GETT example illustrating the ease with which GETTs can be run due to CUTLASS 3.0's unified micro-kernels and CuTe's hierarchical layouts.

* [52_hopper_gather_scatter_fusion](52_hopper_gather_scatter_fusion/)

    Hopper example that fuses gather before GEMM and scatter after GEMM into the same kernel

* [53_hopper_gemm_permute](53_hopper_gemm_permute/)

    Hopper example demonstrating the fusion of tensor permutation operations with a GEMM kernel

* [54_hopper_fp8_warp_specialized_gemm](54_hopper_fp8_warp_specialized_gemm/)

    Hopper example of instantiating and running an FP8 GEMM kernel

* [55_hopper_mixed_dtype_gemm](55_hopper_mixed_dtype_gemm/)

    Hopper GEMM example with different A and B data types using CUTLASS 3.x APIs for DL kernels with fused dequantization.

* [56_hopper_ptr_array_batched_gemm](56_hopper_ptr_array_batched_gemm/)

    Hopper Ptr-Array Batched GEMM example using CUTLASS 3.x API.

* [57_hopper_grouped_gemm](57_hopper_grouped_gemm/)

    Hopper Grouped GEMM using CUTLASS 3.x API.

* [58_ada_fp8_gemm](58_ada_fp8_gemm/)

    Ada GEMM kernel targetting Ada FP8 tensor cores via the CUTLASS 2.x API.

* [59_ampere_gather_scatter_conv](59_ampere_gather_scatter_conv/)

    CuTe and CUTLASS 3.x based Ampere convolution fprop kernel capable of operating on both affine and gather/scatter tensors,
        showing how kernel authors can re-use CUTLASS 3.x collectives in their custom kernels.

* [61_hopper_gemm_with_topk_and_softmax](61_hopper_gemm_with_topk_and_softmax/)

    Hopper GEMM kernel with Top-K and softmax epilogue fusion.

[//]: #

* [70_blackwell_gemm](70_blackwell_gemm)

    Simple dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.

* [71_blackwell_gemm_with_collective_builder](71_blackwell_gemm_with_collective_builder)

    Blackwell SM100 GEMM example demonstrating compatible mainloop+epilogue builder schedules and epilogue visitor tree (EVT) construction

* [72a_blackwell_narrow_precision_gemm](72a_blackwell_narrow_precision_gemm)

    Block-scaled dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.

* [73_blackwell_gemm_preferred_cluster](73_blackwell_gemm_preferred_cluster/)

    Blackwell SM100 GEMM kernel with preferred cluster feature.

* [74_blackwell_gemm_streamk](74_blackwell_gemm_streamk/)

    Blackwell SM100 GEMM kernel using the Stream-K scheduler

* [75_blackwell_grouped_gemm](75_blackwell_grouped_gemm)

    Blackwell SM100 grouped GEMM kernel

* [76_blackwell_conv](76_blackwell_conv/)

    Simple convolution(fprop/dgrad/wgrad) example targeting NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.

* [77_blackwell_fmha](77_blackwell_fmha)

    Blackwell SM100 FMHA kernel

[//]: #

# CuTe - Programming Examples

Examples that do not rely on CUTLASS and directly showcase the features of CuTe are located in [cutlass/examples/cute](./cute/).

Additionally, CuTe's core layout and layout algebra have their own test cases within [cutlass/test/unit/cute/core/](../test/unit/cute/core/) that users might find useful as examples of CuTe.

# Python Interface Examples

Examples leveraging CUTLASS's [Python interface](../python/README.md) are located in [cutlass/examples/python](python/).