v4.2 release. (#2587)

* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.

* v4.2 release.
This commit is contained in:
Junkai-Wu
2025-08-23 06:11:24 +08:00
committed by GitHub
parent 11cad1f67b
commit a49a78ffef
351 changed files with 28182 additions and 2032 deletions

View File

@ -93,6 +93,12 @@ An instantiation level `500`, which is padded to `0500`, thus indicates:
- **Cluster Sizes**: At level 5, allowing for clusters with 1, 2, 4, 8, or 16 CTAs.
- **Schedule Pruning**: At level 0, where pruning is applied according to the existing `generator.py` behavior.
## Instantiating more MMA shapes with Hopper
When instantiating more tile shapes, specially non-power-of-2 Tile-N shapes, make sure to enable `CUTLASS_ENABLE_SM90_EXTENDED_MMA_SHAPES`.
This may lead to some increase in per-kernel compilation times.
When `CUTLASS_LIBRARY_INSTANTIATION_LEVEL` is set, then `CUTLASS_ENABLE_SM90_EXTENDED_MMA_SHAPES` is enabled by default.
## Mixed input data type kernels for Hopper
With Hopper (SM90), the kernel generator will generate the following combinations of mixed input data types ("mixed dtype"):