v4.2 tag release. (#2638)

This commit is contained in:
Junkai-Wu
2025-09-16 00:21:53 +08:00
committed by GitHub
parent 56f0718a97
commit 6a35b4d22f
161 changed files with 14056 additions and 3793 deletions

View File

@ -79,7 +79,7 @@ Instruction shape levels control the selection of WGMMA shapes used in kernel ge
- **Level 2**: Includes shapes that are powers of 2.
- **Level 3**: Includes all other shapes.
The detailed defination of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_shapes.py).
The detailed definition of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_shapes.py).
Schedule pruning levels decide the epilogue schedule and mainloop schedule to stamp out a kernel instance. As defined in `get_valid_schedules` in [sm90_utils.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_utils.py),
@ -122,6 +122,55 @@ For each mixed dtype kernel, the kernel generator will generate combinations of
For {4-bits-dtype, 8-bits-dtype} x 16-bits-dtype, the kernel generator will further generate kernels using shuffled layouts for the narrow data type matrix, which may have a better performance compared to its non-shuffle counter parts.
## Instantiating more kernels with Blackwell
Blackwell (SM100) and Blackwell Ultra similarly support
`CUTLASS_LIBRARY_INSTANTIATION_LEVEL`, in order to instantiate all possible combinations.
Due to this, `CUTLASS_LIBRARY_KERNELS` must be non-empty, since generating and filtering these
kernels alone can take hours.
You must also exercise caution, because not all of these configs are tested, and some may fail to
compile or fail to launch at runtime.
```bash
$ cmake .. \
-DCUTLASS_NVCC_ARCHS="100f" \
-DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm100_tensorop_gemm_f16_f16_f32_void_f32_*" \
-DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="max" \
-DCUTLASS_UNITY_BUILD_ENABLED=ON
```
The CUTLASS profiler uses the same four-digit integer level (global instantiation level) mechanism to manage the generation of kernel configurations for Blackwell as well:
0. **Instruction Shape**
1. **MMA Shape Multiplier**
2. **Cluster Shape**
3. **Data Type and Schedule Pruning**
Note for Blackwell kernels an MMA shape multiplier is no longer necessary since Blackwell kernels do not have a different
ping pong or cooperative schedule. The profiler ignores this digit when instantiating.
Cluster shape levels define the number of CTAs (Cooperative Thread Arrays) included in the kernel generation:
- **Level 0**: Only dynamic cluster shapes.
- **Level 1**: For 1SM kernels `(1, 1, 1)` and `(2, 1, 1)` for 2SM kernels.
- **Level 2**: For 1SM kernels we also have `(1, 2, 1)` and for 2SM we have `(2, 2, 1)` and `(4, 1, 1)`.
- **Level 3**: For 1SM kernels we have `(1, 4, 1)` and for 2SM we have `(2, 4, 1)` and `(4, 2, 1)`.
- **Level 4**: For 1SM kernels we have `(4, 4, 1)` and for 2SM we have `(4, 4, 1)`.
- **Level 5**: For 1SM kernels we have `(2, 1, 1)`.
- **Level 6**: For 1SM kernels we have `(2, 2, 1)` and `(4, 1, 1)` and for 2SM kernels we have `(8, 1, 1)`.
- **Level 7**: For 1SM kernels we have `(2, 4, 1)` and `(4, 2, 1)`
- **Level 8**: For 1SM kernels we have `(1, 8, 1)` and `(8, 1, 1)`
Instruction shape levels control the selection of MMA shapes used in kernel generation:
- **Level 0**: Generates the "default" shape only.
- **Level 1**: Includes additional shapes for FP8, FP6, and FP4 as well as MX and NVFP4.
- **Level 2**: Includes small tile shapes.
- **Level 3**: Includes some non-power of 2 shapes.
- **Level 4**: Includes further small tile shapes and non-power of 2 shapes.
- **Level 5**: Includes all shapes.
The detailed definition of the three instantiation levels controlling cluster shape and instruction shape can be found in [sm100_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm100_shapes.py).
## CUTLASS Profiler usage
The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
@ -577,6 +626,10 @@ cutlass3x_sm90_tensorop_gemm_f16_f16_f16_void_f16_128x128x64_1x1x1_0_nnn_align8_
* `f16_f16_f16_void_f16`: In this case, C type is set to `void`, indicating that residual matrix support
is disabled.
## Further Documentation
For documentation on profiling blockwise and groupwise (software scaled) GEMMs see the [example 81 README](https://github.com/NVIDIA/cutlass/blob/main/examples/81_blackwell_gemm_blockwise/README.md).
# Convolution
The CUTLASS Profiler is capable of executing 2-D and 3-D convolution problems for forwards and backwards