v4.2 tag release. (#2638)

2025-09-16 00:21:53 +08:00
parent 56f0718a97
commit 6a35b4d22f
161 changed files with 14056 additions and 3793 deletions
--- a/media/docs/cpp/profiler.md
+++ b/media/docs/cpp/profiler.md
@ -79,7 +79,7 @@ Instruction shape levels control the selection of WGMMA shapes used in kernel ge
 - **Level 2**: Includes shapes that are powers of 2.
 - **Level 3**: Includes all other shapes.

-The detailed defination of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_shapes.py).
+The detailed definition of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_shapes.py).

 Schedule pruning levels decide the epilogue schedule and mainloop schedule to stamp out a kernel instance. As defined in `get_valid_schedules` in [sm90_utils.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_utils.py),

@ -122,6 +122,55 @@ For each mixed dtype kernel, the kernel generator will generate combinations of

 For {4-bits-dtype, 8-bits-dtype} x 16-bits-dtype, the kernel generator will further generate kernels using shuffled layouts for the narrow data type matrix, which may have a better performance compared to its non-shuffle counter parts.

+## Instantiating more kernels with Blackwell
+Blackwell (SM100) and Blackwell Ultra similarly support
+`CUTLASS_LIBRARY_INSTANTIATION_LEVEL`, in order to instantiate all possible combinations.
+Due to this, `CUTLASS_LIBRARY_KERNELS` must be non-empty, since generating and filtering these
+kernels alone can take hours.
+You must also exercise caution, because not all of these configs are tested, and some may fail to
+compile or fail to launch at runtime.
+
+```bash
+$ cmake .. \
+  -DCUTLASS_NVCC_ARCHS="100f" \
+  -DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm100_tensorop_gemm_f16_f16_f32_void_f32_*" \
+  -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="max" \
+  -DCUTLASS_UNITY_BUILD_ENABLED=ON
+```
+
+The CUTLASS profiler uses the same four-digit integer level (global instantiation level) mechanism to manage the generation of kernel configurations for Blackwell as well:
+
+0. **Instruction Shape**
+1. **MMA Shape Multiplier**
+2. **Cluster Shape**
+3. **Data Type and Schedule Pruning**
+
+Note for Blackwell kernels an MMA shape multiplier is no longer necessary since Blackwell kernels do not have a different
+ping pong or cooperative schedule. The profiler ignores this digit when instantiating.
+
+Cluster shape levels define the number of CTAs (Cooperative Thread Arrays) included in the kernel generation:
+
+- **Level 0**: Only dynamic cluster shapes.
+- **Level 1**: For 1SM kernels `(1, 1, 1)` and `(2, 1, 1)` for 2SM kernels.
+- **Level 2**: For 1SM kernels we also have `(1, 2, 1)` and for 2SM we have `(2, 2, 1)` and `(4, 1, 1)`.
+- **Level 3**: For 1SM kernels we have `(1, 4, 1)` and for 2SM we have `(2, 4, 1)` and `(4, 2, 1)`.
+- **Level 4**: For 1SM kernels we have `(4, 4, 1)` and for 2SM we have `(4, 4, 1)`.
+- **Level 5**: For 1SM kernels we have `(2, 1, 1)`.
+- **Level 6**: For 1SM kernels we have `(2, 2, 1)` and `(4, 1, 1)` and for 2SM kernels we have `(8, 1, 1)`.
+- **Level 7**: For 1SM kernels we have `(2, 4, 1)` and `(4, 2, 1)`
+- **Level 8**: For 1SM kernels we have `(1, 8, 1)` and `(8, 1, 1)`
+
+Instruction shape levels control the selection of MMA shapes used in kernel generation:
+
+- **Level 0**: Generates the "default" shape only.
+- **Level 1**: Includes additional shapes for FP8, FP6, and FP4 as well as MX and NVFP4.
+- **Level 2**: Includes small tile shapes.
+- **Level 3**: Includes some non-power of 2 shapes.
+- **Level 4**: Includes further small tile shapes and non-power of 2 shapes.
+- **Level 5**: Includes all shapes.
+
+The detailed definition of the three instantiation levels controlling cluster shape and instruction shape can be found in [sm100_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm100_shapes.py).
+
 ## CUTLASS Profiler usage

 The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
@ -577,6 +626,10 @@ cutlass3x_sm90_tensorop_gemm_f16_f16_f16_void_f16_128x128x64_1x1x1_0_nnn_align8_
 * `f16_f16_f16_void_f16`: In this case, C type is set to `void`, indicating that residual matrix support
 is disabled.

+## Further Documentation
+
+For documentation on profiling blockwise and groupwise (software scaled) GEMMs see the [example 81 README](https://github.com/NVIDIA/cutlass/blob/main/examples/81_blackwell_gemm_blockwise/README.md).
+
 # Convolution

 The CUTLASS Profiler is capable of executing 2-D and 3-D convolution problems for forwards and backwards