# CUTLASS - Programming Examples * [00_basic_gemm](00_basic_gemm/) launches a basic GEMM with single precision inputs and outputs * [01_cutlass_utilities](01_cutlass_utilities/) demonstrates CUTLASS Utilities for allocating and initializing tensors * [02_dump_reg_smem](02_dump_reg_smem/) debugging utilities for printing register and shared memory contents * [03_visualize_layout](03_visualize_layout/) utility for visualizing all layout functions in CUTLASS * [04_tile_iterator](04_tile_iterator/) example demonstrating an iterator over tiles in memory * [05_batched_gemm](05_batched_gemm/) example demonstrating CUTLASS's batched strided GEMM operation * [06_splitK_gemm](06_splitK_gemm/) example demonstrating CUTLASS's Split-K parallel reduction kernel * [07_volta_tensorop_gemm](07_volta_tensorop_gemm/) example demonstrating mixed precision GEMM using Volta Tensor Cores * [08_turing_tensorop_gemm](08_turing_tensorop_gemm/) example demonstrating integer GEMM using Turing Tensor Cores * [09_turing_tensorop_conv2dfprop](09_turing_tensorop_conv2dfprop/) example demonstrating integer implicit GEMM convolution (forward propagation) using Turing Tensor Cores * [10_planar_complex](10_planar_complex/) example demonstrating planar complex GEMM kernels * [11_planar_complex_array](11_planar_complex_array/) example demonstrating planar complex kernels with batch-specific problem sizes * [12_gemm_bias_relu](12_gemm_bias_relu/) example demonstrating GEMM fused with bias and relu * [13_two_tensor_op_fusion](13_two_tensor_op_fusion/) example demonstrating two GEMMs or convolutions fused in one kernel * [14_ampere_tf32_tensorop_gemm](14_ampere_tf32_tensorop_gemm/) example demonstrating FP32 GEMM with implicit TF32 conversion * [15_ampere_sparse_tensorop_gemm](15_ampere_sparse_tensorop_gemm/) example demonstrating usage of Sparse Tensor cores * [16_ampere_tensorop_conv2dfprop](16_ampere_tensorop_conv2dfprop/) example demonstrating forward convolution on tensors of layout NHWC * [17_fprop_per_channel_bias](17_fprop_per_channel_bias/) example demonstrating convolution fused with per channel bias and relu * [18_ampere_fp64_tensorop_affine2_gemm](18_ampere_fp64_tensorop_affine2_gemm/) example demonstrating Affine-2 GEMM * [19_tensorop_canonical](19_tensorop_canonical/) Canonical GEMM using tensor cores * [20_simt_canonical](20_simt_canonical/) Canonical GEMM using SIMT * [21_quaternion_gemm](21_quaternion_gemm/) example demonstrating Quaternion GEMM computations * [22_quaternion conv](22_quaternion_conv/) example demonstrating Quaternion convolution * [23_ampere_gemm_operand_reduction_fusion](23_ampere_gemm_operand_reduction_fusion/) example demonstrating how to reduce one of the operands of the GEMM along the k-dimension when computing GEMM * [24_gemm_grouped](24_gemm_grouped/) example demonstrating batch of GEMM operations with distinct problem sizes * [25_ampere_fprop_mainloop_fusion](25_ampere_fprop_mainloop_fusion/) example demonstrating fusing activation's per channel scale+bias+relu into the fgrad mainloop * [26_ampere_wgrad_mainloop_fusion](26_ampere_wgrad_mainloop_fusion/) example demonstrating fusing activation's per channel scale+bias+relu into the wgrad mainloop * [27_ampere_3xtf32_fast_accurate_tensorop_gemm](27_ampere_3xtf32_fast_accurate_tensorop_gemm/) example demonstrating emulation of a fast accurate SGEMM with TF32 operations * [28_ampere_3xtf32_fast_accurate_tensorop_fprop](28_ampere_3xtf32_fast_accurate_tensorop_fprop/) example demonstrating emulation of a fast accurate FP32 convolution with TF32 operation * [29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm](29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm/) example demonstrating emulation of a fast accurate CGEMM with TF32 operation * [30_wgrad_split_k](30_wgrad_split_k/) example demonstrating how to compute conv2d gradient with respect to weight (wgrad) together with split-K * [31_basic_syrk](31_basic_syrk/) example demonstrating Symmetric Rank-K update * [32_basic_trmm](32_basic_trmm/) example demonstrating Triangular Matrix-Matrix multiplication * [33_ampere_3xtf32_tensorop_symm](33_ampere_3xtf32_tensorop_symm/) example demonstrating Symmetric Matrix-Matrix multiplication with FP32 emulation * [34_transposed_conv2d](34_transposed_conv2d/) example demonstrating how to compute 2d transposed convolution, also known as deconvolution, using CUTLASS conv2d Dgrad kernels * [35_gemm_softmax](35_gemm_softmax/) example demonstrating GEMM fused with Softmax in mixed precision using Ampere Tensor Cores * [36_gather_scatter_fusion](36_gather_scatter_fusion/) example demonstrating fuses gather before GEMM and scatter after GEMM into the same GEMM kernel * [37_gemm_layernorm_gemm_fusion](37_gemm_layernorm_gemm_fusion/) example demonstrating fuses gemm->layernorm->gemm into one kernel. * [38_syr2k_grouped](38_syr2k_grouped/) example demonstrating a batch of SYR2K operations with distinct problem sizes * [39_gemm_permute](39_gemm_permute/) example demonstrating batched GEMM operations with output results permuted as reshaped tensors * [40_cutlass_py](40_cutlass_py/) example demonstrating CUTLASS with Python interface * [41_multi_head_attention](41_multi_head_attention/) example demonstrating attention example with non-fixed sequence length input * [42_ampere_tensorop_group_conv](42_ampere_tensorop_group_conv/) example demonstrating how to run group convolution kernels using functions and data structures provided by CUTLASS using tensor cores * [43_ell_block_sparse_gemm](43_ell_block_sparse_gemm/) example demonstrating a Block-Ell sparse gemm * [44_fused_multi_head_attention](44_fused_multi_head_attention/) example demonstrating fused multihead attention (fixed & variable) using shared memory * [45_dual_gemm](45_dual_gemm/) example demonstrating how to fuse two GEMMs sharing the same left input matrix into one kernel * [46_depthwise_simt_conv2dfprop](46_depthwise_simt_conv2dfprop/) example demonstrating depthwise 2d convolution kernels using functions and data structures provided by CUTLASS using SIMT instruction * [47_ampere_gemm_universal_streamk](47_ampere_gemm_universal_streamk/) example contrasting the Stream-K parallel decomposition for GEMM threadblocks versus the "classic data-parallel" and "Split-K" decompositions. * [48_hopper_warp_specialized_gemm](48_hopper_warp_specialized_gemm/) Simple tensorop GEMM example using CUTLASS 3.0 APIs targeting NVIDIA Hopper architecture * [49_hopper_gemm_schedules_with_collective_builder](49_hopper_gemm_schedules_with_collective_builder/) Hopper GEMM example leveraging collective operation builders to showcase the builder API and the various kernel scheduled supported in CUTLASS 3.0 such as warp specialized persistent mainloops. * [50_hopper_gemm_with_epilogue_swizzle](50_hopper_gemm_with_epilogue_swizzle/) Hopper GEMM example to create a GEMM kernel with custom a collective mainloop and a custom vectorized epilogue. * [51_hopper_gett](51_hopper_gett/) Hopper GETT example illustrating the ease with which GETTs can be run due to CUTLASS 3.0's unified micro-kernels and CuTe's hierarchical layouts. * [52_hopper_gather_scatter_fusion](52_hopper_gather_scatter_fusion/) Hopper example that fuses gather before GEMM and scatter after GEMM into the same kernel * [53_hopper_gemm_permute](53_hopper_gemm_permute/) Hopper example demonstrating the fusion of tensor permutation operations with a GEMM kernel * [54_hopper_fp8_warp_specialized_gemm](54_hopper_fp8_warp_specialized_gemm/) Hopper example of instantiating and running an FP8 GEMM kernel * [55_hopper_mixed_dtype_gemm](55_hopper_mixed_dtype_gemm/) Hopper GEMM example with different A and B data types using CUTLASS 3.x APIs for DL kernels with fused dequantization. * [56_hopper_ptr_array_batched_gemm](56_hopper_ptr_array_batched_gemm/) Hopper Ptr-Array Batched GEMM example using CUTLASS 3.x API. * [57_hopper_grouped_gemm](57_hopper_grouped_gemm/) Hopper Grouped GEMM using CUTLASS 3.x API. * [58_ada_fp8_gemm](58_ada_fp8_gemm/) Ada GEMM kernel targetting Ada FP8 tensor cores via the CUTLASS 2.x API. * [59_ampere_gather_scatter_conv](59_ampere_gather_scatter_conv/) CuTe and CUTLASS 3.x based Ampere convolution fprop kernel capable of operating on both affine and gather/scatter tensors, showing how kernel authors can re-use CUTLASS 3.x collectives in their custom kernels. * [61_hopper_gemm_with_topk_and_softmax](61_hopper_gemm_with_topk_and_softmax/) Hopper GEMM kernel with Top-K and softmax epilogue fusion. [//]: # * [70_blackwell_gemm](70_blackwell_gemm) Simple dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs. * [71_blackwell_gemm_with_collective_builder](71_blackwell_gemm_with_collective_builder) Blackwell SM100 GEMM example demonstrating compatible mainloop+epilogue builder schedules and epilogue visitor tree (EVT) construction * [72a_blackwell_narrow_precision_gemm](72a_blackwell_narrow_precision_gemm) Block-scaled dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs. * [73_blackwell_gemm_preferred_cluster](73_blackwell_gemm_preferred_cluster/) Blackwell SM100 GEMM kernel with preferred cluster feature. * [74_blackwell_gemm_streamk](74_blackwell_gemm_streamk/) Blackwell SM100 GEMM kernel using the Stream-K scheduler * [75_blackwell_grouped_gemm](75_blackwell_grouped_gemm) Blackwell SM100 grouped GEMM kernel * [76_blackwell_conv](76_blackwell_conv/) Simple convolution(fprop/dgrad/wgrad) example targeting NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs. * [77_blackwell_fmha](77_blackwell_fmha) Blackwell SM100 FMHA kernel [//]: # # CuTe - Programming Examples Examples that do not rely on CUTLASS and directly showcase the features of CuTe are located in [cutlass/examples/cute](./cute/). Additionally, CuTe's core layout and layout algebra have their own test cases within [cutlass/test/unit/cute/core/](../test/unit/cute/core/) that users might find useful as examples of CuTe. # Python Interface Examples Examples leveraging CUTLASS's [Python interface](../python/README.md) are located in [cutlass/examples/python](python/).