v4.2 release. (#2587)

* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line. * v4.2 release.
2025-08-23 06:11:24 +08:00
parent 11cad1f67b
commit a49a78ffef
351 changed files with 28182 additions and 2032 deletions
--- a/media/docs/cpp/cute/02_layout_algebra.md
+++ b/media/docs/cpp/cute/02_layout_algebra.md
@ -153,7 +153,6 @@ For example,

 To compute the strides of the strided layout, the residues of the above operation are used to scale the strides of `A`. For instance, the last example `(3,6,2,8):(w,x,y,z) / 72` with strides `(w,x,y,z)` produces `(3*w,6*x,2*x,2*z)` as the strides of the strided layout.

-
 As you may have noticed, we can only divide shapes by certain values and get a sensible result. This is called the **stride divisibility condition** and is statically checked in CuTe when possible.

 2. Keep the first `s` elements of the newly strided `A` so that the result has a compatible shape with `B`. This can be computed by "modding out" the first `s` elements from the shape of `A` starting from the left.
@ -175,11 +174,8 @@ Again, this operation must satisfy a **shape divisibility condition** to yield a
 From the above examples, we can construct the composition `(3,6,2,8):(w,x,y,z) o 16:9 = (1,2,2,4):(3*w,3*x,y,z)`.

 ---
-
 #### Example 1 -- Worked Example of Calculating a Composition
-
 We provide a more complex example of composition, where both operand layouts are multi-modal to illustrate the concepts introduced above.
-
 ```
 Functional composition, R := A o B
 R(c) := (A o B)(c) := A(B(c))
@ -223,7 +219,6 @@ Putting this together and coalescing each mode, we obtain the result
 R = A o B
  = ((2, 2), 3): ((24, 2), 8)
 ```
-
 #### Example 2 -- Reshape a layout into a matrix

 `20:2  o  (5,4):(4,1)`. Composition formulation.
--- a/media/docs/cpp/cute/0z_tma_tensors.md
+++ b/media/docs/cpp/cute/0z_tma_tensors.md
@ -138,13 +138,15 @@ In principle, layout strides may be any integer-module.
 CuTe's basis elements live in the header file `cute/numeric/arithmetic_tuple.hpp`.
 To make it easy to create `ArithmeticTuple`s that can be used as strides, CuTe defines normalized basis elements using the `E` type alias. "Normalized" means that the scaling factor of the basis element is the compile-time integer 1.

-| C++ object | Description           | String representation |
-| ---        | ---                   | ---                   |
-| `E<>{}`    | `1`                   | `1`                   |
-| `E<0>{}`   | `(1,0,...)`           | `1@0`                 |
-| `E<1>{}`   | `(0,1,0,...)`         | `1@1`                 |
-| `E<0,1>{}` | `((0,1,0,...),0,...)` | `1@1@0`               |
-| `E<1,0>{}` | `(0,(1,0,...),0,...)` | `1@0@1`               |
+| C++ object | Description             | String representation |
+| ---        | ---                     | ---                   |
+| `E<>{}`    | `1`                     | `1`                   |
+| `E<0>{}`   | `(1,0,...)`             | `1@0`                 |
+| `E<1>{}`   | `(0,1,0,...)`           | `1@1`                 |
+| `E<0,0>{}` | `((1,0,...),0,...)`     | `1@0@0`               |
+| `E<0,1>{}` | `((0,1,0,...),0,...)`   | `1@1@0`               |
+| `E<1,0>{}` | `(0,(1,0,...),0,...)`   | `1@0@1`               |
+| `E<1,1>{}` | `(0,(0,1,0,...),0,...)` | `1@1@1`               |

 The "description" column in the above table
 interprets each basis element as an infinite tuple of integers,
@ -155,7 +157,9 @@ For example, `E<1>{}` has a 1 in position 1: `(0,1,0,...)`.

 Basis elements can be *nested*.
 For instance, in the above table, `E<0,1>{}` means that
-in position 0 there is a `E<1>{}`: `((0,1,0,...),0,...)`.
+in position 0 there is a `E<1>{}`: `((0,1,0,...),0,...)`. Similarly,
+`1@1@0` means that `1` is lifted to position 1 to create `1@1`: `(0,1,0,...)`
+which is then lifted again to position 0.

 Basis elements can be *scaled*.
 That is, they can be multiplied by an integer *scaling factor*.
--- a/media/docs/cpp/getting_started.rst
+++ b/media/docs/cpp/getting_started.rst
@ -13,4 +13,5 @@ Getting Started
  Terminology<terminology.md>
  Fundamental Types<fundamental_types.md>
  Programming Guidelines<programming_guidelines.md>
+  GEMM Heuristics <heuristics.md>
  
--- a/media/docs/cpp/heuristics.md
+++ b/media/docs/cpp/heuristics.md
@ -0,0 +1,102 @@
+
+# GEMM Heuristics
+
+## Overview
+
+Gemm heuristics in `cutlass_library` aim to reduce the search space for runtime autotuning, so that only a subset of valid kernels need to be built and profiled for a given set of GEMM problems. This implementation uses Nvidia's `nvidia-matmul-heuristics`, an analytical heuristic that ranks GEMM kernels by estimated performance given a problem size and hardware SKU. You can find more info in [the docs](https://docs.nvidia.com/cuda/nvidia-matmul-heuristics).
+
+## Coverage
+
+Gemm heuristics in `cutlass_library` is an experimental feature and exhaustive functional or performance coverage is not guaranteed. It currently supports the following.
+
+Problem space:
+- Plain dense gemm for `f8`, `f16`, `f32`
+
+Hardware:
+- Hopper (sm9x)
+- Blackwell (sm10x)
+
+## Usage / Quick Start
+
+### Install Dependencies
+
+Using the wheel is recommended:
+```
+pip install nvidia-matmul-heuristics
+```
+
+### Prepare Input File
+
+Prepare a list of gemm problem definitions, in the form of a json list, to be evaluated by the heuristic. Here is a sample file with two problems:
+```
+[
+{
+     "m" : 4096,
+     "n" : 4096,
+     "k" : 4096,
+     "batch_count" : 1,
+     "layout" : "tnn",
+     "dtype_a" : "f16",
+     "dtype_b" : "f16",
+     "dtype_c" : "f16",
+     "dtype_acc" : "f32",
+     "dtype_d" : "f16",
+     "beta" : 0.0,
+     "use_fast_acc": false
+},
+{
+     "m" : 4096,
+     "n" : 4096,
+     "k" : 4096,
+     "batch_count" : 1,
+     "layout": "tnn",
+     "dtype_a" : "e5m2",
+     "dtype_b" : "e5m2",
+     "dtype_c" : "f32",
+     "dtype_acc" : "f32",
+     "dtype_d" : "e5m2",
+     "beta" : 0.0,
+     "use_fast_acc": true
+}
+]
+```
+
+Note: `use_fast_acc` only needs to be specified for FP8 kernels on SM90. Otherwise, it is ignored.
+
+### Build
+
+Build CUTLASS using CMake as normal, providing heuristics-specific options to CMake. Note that hardware details are detected automatically. For offline builds, use `-DCUTLASS_LIBRARY_HEURISTICS_GPU`.
+For example, here is a minimal command for Nvidia's Hopper Architecture (sm90):
+
+```
+$ cmake .. \
+    -DCUTLASS_NVCC_ARCHS=90a \
+    -DCUTLASS_LIBRARY_HEURISTICS_PROBLEMS_FILE=<path_to_your_problem_list.json> \
+    -DCUTLASS_LIBRARY_HEURISTICS_CONFIGS_PER_PROBLEM=<number of configurations to build per problem> 
+...
+...
+
+$ make cutlass_profiler -j
+
+```
+
+This will produce a csv testlist which provides all testcases that need be run to perform autotuning over the built configurations, including kernel runtime options. The location of this file can be changed by the CMake option `-DCUTLASS_LIBRARY_HEURISTICS_TESTLIST_FILE`.
+
+CUTLASS CMake currently supports the following for heuristics:
+- `CUTLASS_LIBRARY_HEURISTICS_PROBLEMS_FILE`: Path to the file containing a json list of GEMM problems
+- `CUTLASS_LIBRARY_HEURISTICS_CONFIGS_PER_PROBLEM`: Max number of configurations the heuristic will return for each GEMM problem. The same configuration or kernel can be suggested for multiple problems.
+- `CUTLASS_LIBRARY_HEURISTICS_RESTRICT_KERNELS`: Limits the build to only the set of kernels instantiated by the default CUTLASS CMake build flow, composing with other options such as `CUTLASS_LIBRARY_INSTANTIATION_LEVEL`. Set this to `ON` as a workaround if the heuristic suggests kernel configurations that do not build on your platform (possible for some unsupported or experimental use cases). This option is set to `OFF` by default, which builds all of the suggested configurations.
+- `CUTLASS_LIBRARY_HEURISTICS_TESTLIST_FILE`: Path to the output CSV which will contain the testcases to be used for autotuning, consumable by `cutlass_profiler`.
+- `CUTLASS_LIBRARY_HEURISTICS_GPU`: The GPU to use for heuristics; for instance, `H100_SXM5`. Used for offline builds. If unset, the hardware properties will be auto-detected using the Cuda Driver APIs. See `generator.py` for valid GPU strings
+
+### Profile
+
+Use the emitted testlist CSV with `cutlass_profiler` to collect performance data, which can be used to determine the fastest built kernel configuration for each of the input problems. Example which profiles each testcase for a fixed 50ms:
+```
+cutlass_profiler --operation=Gemm --testlist-file=<path_to_your_testlist.csv> --profiling-iterations=0 --profiling-duration=50 --verification-enabled=false --output=<path_to_outfile>
+```
+
+## Direct Usage in Python
+
+If you have pre-built CUTLASS kernels or custom CUTLASS emitters, you can use the Python APIs directly to select kernels to build or profile. See `filter_manifest_and_write_heuristics_file()` in `heuristics.py` for example usage.
+
--- a/media/docs/cpp/profiler.md
+++ b/media/docs/cpp/profiler.md
@ -93,6 +93,12 @@ An instantiation level `500`, which is padded to `0500`, thus indicates:
 - **Cluster Sizes**: At level 5, allowing for clusters with 1, 2, 4, 8, or 16 CTAs.
 - **Schedule Pruning**: At level 0, where pruning is applied according to the existing `generator.py` behavior.

+## Instantiating more MMA shapes with Hopper
+
+When instantiating more tile shapes, specially non-power-of-2 Tile-N shapes, make sure to enable `CUTLASS_ENABLE_SM90_EXTENDED_MMA_SHAPES`. 
+This may lead to some increase in per-kernel compilation times.
+When `CUTLASS_LIBRARY_INSTANTIATION_LEVEL` is set, then `CUTLASS_ENABLE_SM90_EXTENDED_MMA_SHAPES` is enabled by default. 
+
 ## Mixed input data type kernels for Hopper

 With Hopper (SM90), the kernel generator will generate the following combinations of mixed input data types ("mixed dtype"):