CUTLASS 3.6.0 (#1850)

* v3.6 * update changelog * update readme * fix typo * fixing typos * hopper gemm with weight prefetch --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-10-09 12:33:27 -07:00
parent 0837a2a00a
commit cc3c29a81a
354 changed files with 105943 additions and 8203 deletions
--- a/media/docs/dependent_kernel_launch.md
+++ b/media/docs/dependent_kernel_launch.md
@ -0,0 +1,32 @@
+[README](../../README.md#documentation) > **Dependent kernel launch**
+
+# Dependent kernel launches
+
+The Hopper architecture supports a new feature through which two kernels in the same stream can
+overlap their execution, named 
+[Programmatic Dependent Launch (PDL)](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization).
+This allows kernels with conflict in global memory to programmatically and safely overlap portions
+of their execution. Primary kernel can signal it is about to finish execution, and the next kernel can
+optionally wait on the previous kernel to finish flushing its memory.
+
+For more information, we refer you to the [PDL section in the CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization).
+
+## Using dependent launch in CUTLASS
+
+When building CUTLASS, you can use the `CUTLASS_ENABLE_GDC_FOR_SM90` macro to 
+enable PDL-related instructions in Hopper kernels:
+
+```
+cmake . -DCUTLASS_ENABLE_GDC_FOR_SM90=1
+```
+
+Note that this only adds PDL-related instructions to the _kernels_, but to actually allow a dependent
+launch, you must also run your GEMM kernel with PDL:
+
+```
+gemm.run(
+  /* stream = */ stream,
+  /* cuda_adapter = */ nullptr,
+  /* launch_with_pdl = */ true
+);_
+```
--- a/media/docs/profiler.md
+++ b/media/docs/profiler.md
@ -5,7 +5,7 @@
 # CUTLASS Profiler

 The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
-defined in the CUTLASS Instance Library. The CUTLASS Profiler is capable of executing each GEMM, Sparse Gemm, 
+defined in the CUTLASS Instance Library. The CUTLASS Profiler is capable of executing each GEMM, Sparse Gemm,
 Conv2d, and Conv3d kernel.

 The CUTLASS Profiler may be compiled with:
@ -13,8 +13,8 @@ The CUTLASS Profiler may be compiled with:
 $ make cutlass_profiler -j
 ```

-To limit compilation time, only one tile size (typically 128x128) and threadblock cluster size (typically 2x1x1) is instantiated for each data type, 
-math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an 
+To limit compilation time, only one tile size (typically 128x128) and threadblock cluster size (typically 2x1x1) is instantiated for each data type,
+math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an
 empty `build/` directory.
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all  -DCUTLASS_UNITY_BUILD_ENABLED=ON
@ -24,7 +24,68 @@ $ make cutlass_profiler -j
 Enabling the unity build places multiple kernel instances in one compilation unit, thereby reducing size of the compiled
 binary and avoiding linker limitations on some platforms.

-The CUTLASS Profiler sources are stored in 
+### Instantiating more kernels with Hopper
+With Hopper (SM90), you will need to use an additional flag,
+`CUTLASS_LIBRARY_INSTANTIATION_LEVEL`, in order to instantiate all possible combinations,
+which unlike previous architectures, will be in the order of millions of kernels.
+Due to this, `CUTLASS_LIBRARY_KERNELS` must be non-empty, since generating and filtering these
+kernels alone can take hours.
+You must also exercise caution, because not all of these configs are tested, and some may fail to
+compile or fail to launch at runtime.
+
+```bash
+$ cmake .. \
+  -DCUTLASS_NVCC_ARCHS="90a" \
+  -DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm90_tensorop_s64x64x16gemm_f16_f16_f32_void_f32_*" \
+  -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="max" \
+  -DCUTLASS_UNITY_BUILD_ENABLED=ON
+```
+
+The CUTLASS profiler employs a four-digit integer level (global instantiation level) mechanism to manage the generation of kernel configurations. This global instantiation level decides the behavior of multiple "generators" by defining how many and which combinations of configurations are produced. If a global instantiation level contains fewer than four digits, it can be padded with leading zeros to ensure it is four digits long. Each of the four digits in the global level corresponds to a specific category that influences kernel generation, from right to left:
+
+0. **Instruction Shape**
+1. **MMA Shape Multiplier**
+2. **Cluster Shape**
+3. **Schedule Pruning**
+
+Cluster shape levels define the number of CTAs (Cooperative Thread Arrays) included in the kernel generation:
+
+- **Level 0**: Only `(1, 2, 1)` cluster shape.
+- **Level 1**: Clusters with 2 CTAs.
+- **Level 2**: Clusters with 1 or 2 CTAs.
+- **Level 3**: Clusters with 1, 2, or 4 CTAs.
+- **Level 4**: Clusters with 1, 2, 4, or 8 CTAs.
+- **Level 5**: Clusters with 1, 2, 4, 8, or 16 CTAs.
+
+The MMA multipliers are combined with MMA instruction shapes (WGMMA shapes) to form CTA shapes. The levels for MMA multipliers determine the configurations generated for different data types.
+- **Levels [0, 3]**: Control the specific configurations generated for various data types.
+- **Level 9**: Activates exhaustive mode, generating all possible configurations.
+
+Higher levels encompass a broader range of CTA configurations, resulting in more comprehensive kernel generation.
+
+Instruction shape levels control the selection of WGMMA shapes used in kernel generation:
+
+- **Level 0**: Generates the "default" shape only.
+- **Level 1**: Includes additional shapes for unpruned cases, specifically for TF32 data type.
+- **Level 2**: Includes shapes that are powers of 2.
+- **Level 3**: Includes all other shapes.
+
+The detailed defination of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](../../python/cutlass_library/sm90_shapes.py).
+
+Schedule pruning levels decide the epilogue schedule and mainloop schedule to stamp out a kernel instance. As defined in `get_valid_schedules` in [sm90_utils.py](../../python/cutlass_library/sm90_utils.py),
+
+- **Level >= 1**: Indicates that no pruning is being applied.
+- **Level 0**: Indicates pruning according to existing [generator.py](../../python/cutlass_library/generator.py) behavior.
+
+An instantiation level `500`, which is padded to `0500`, thus indicates:
+
+- **Instruction Shapes**: At level 0, generating only the "default" shape.
+- **MMA Multipliers**: At level 0, generating only one multiplier, `(2, 1, 4)`.
+- **Cluster Sizes**: At level 5, allowing for clusters with 1, 2, 4, 8, or 16 CTAs.
+- **Schedule Pruning**: At level 0, where pruning is applied according to the existing `generator.py` behavior.
+
+The CUTLASS Profiler sources are stored in:
+
 ```bash
 tools/
  profiler/
@ -65,6 +126,9 @@ Device:
                                                   profiling phases cycle through different input tensors to induce
                                                   capacity misses in the L2.

+  --allocations=<name>:<device>,<name>:<device>    Pairs of allocation names to devices. If <device> is negative,
+                                                   the execution device is used
+

 Initialization:
  --initialization=<bool>                          Enables initialization (default: true). If false, device memory is
@ -90,8 +154,8 @@ Library:


 Profiling:
-  --workspace-count=<workspace count>              Number of discrete workspaces maintained to avoid cache-resident 
-                                                 If zero (default), the amount is chosen for each workload based on 
+  --workspace-count=<workspace count>              Number of discrete workspaces maintained to avoid cache-resident
+                                                 If zero (default), the amount is chosen for each workload based on
                                                 capacity of the last-level cache.

  --profiling-iterations=<iterations>              Number of iterations to profile each kernel. If zero, kernels
@ -123,7 +187,7 @@ Verification:


 Report:
-  --append=<bool>                                  If true, result is appended to possibly existing file. Otherwise, 
+  --append=<bool>                                  If true, result is appended to possibly existing file. Otherwise,
                                                   any existing file is overwritten.

  --output=<path>                                  Path to output file for machine readable results. Operation kind and '.csv' is appended.
@ -244,6 +308,9 @@ Test your changes to gemm kernels with a quick functional test and save results
   --k=8,16,32,64,128,256,288,384,504,512,520 \
   --beta=0,1,2 --profiling-iterations=1 \
   --providers=cutlass --output=functional-test.csv
+
+Profile when execution is performed on device 0 and the C tensor is located on a device 1 and D on device 2:
+  $ cutlass_profiler --device=0 --allocations=C:1,D:2 --operation=Gemm --m=1024 --n=1024 --k=128
 ```

 The format of tensor argument is followed by `<type>:<layout>`. The type could be `f32` as 32-bit floating point, `s8` as 8-bit signed integer, etc. The available types can be referred to the `NumericTypeID_enumerants` in [util.cu](tools/library/src/util.cu). The layout could be `row` or `column`.
@ -322,7 +389,7 @@ $ ./tools/profiler/cutlass_profiler --op_class=tensorop --m=3456 --n=4096 --k=81
 ## Covering the problem space

 All arguments may have single values or comma-delimited set of values. Integers may also be specified
-as an inclusive range with the following syntax `start:end:increment` or simply `start:end`. 
+as an inclusive range with the following syntax `start:end:increment` or simply `start:end`.

 For example, the following sweeps over the range of the GEMM K dimension from 8 to 4096 in increments
 of 8 elements.
@ -402,7 +469,7 @@ cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_1x1x1_0_n
 ```

 * `warpspecialized_cooperative`: Mainloop employs a persistent warp-specialized mainloop and kernel schedule.
-* `epi_tma`: Kernel epilogue employs TMA based vectorization. 
+* `epi_tma`: Kernel epilogue employs TMA based vectorization.
 * `f16_f16_f16_void_f16`: In this case, C type is set to `void`, indicating that residual matrix support
 is disabled.

@ -413,7 +480,7 @@ operator variants.

 The CUTLASS Profiler can be built with cuDNN enabled to use as a reference implementation. If CMake detects
 the cuDNN library available in the system, it is included as a dependency. This may be explicitly overridden
-with CMake flag `CUTLASS_ENABLE_CUDNN`. 
+with CMake flag `CUTLASS_ENABLE_CUDNN`.

 ```bash
 $ cmake .. -DCUTLASS_LIBRARY_OPERATIONS=conv2d -DCUTLASS_ENABLE_CUDNN=OFF
@ -521,7 +588,7 @@ reference_device: Passed

 Example command line for profiling forward propagation convolution kernels runing on Tensor Cores is as follows:
 ```bash
-$ ./tools/profiler/cutlass_profiler --kernels=tensorop*fprop  --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 
+$ ./tools/profiler/cutlass_profiler --kernels=tensorop*fprop  --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3



--- a/media/docs/programming_guidelines.md
+++ b/media/docs/programming_guidelines.md
@ -69,6 +69,9 @@ of child objects are known to be non-overlapping, `union`s may be used to alias
 shared memory region and reduce overall shared memory capacity.  Developers should carefully note that C++ `union` rules
 require that they only access the most recently written ("active") member of the `union`; this differs from C rules.

+For host to device ABI compatibility, inheritance from a class is only permitted if the superclass is unique to the
+child class. This is most easily achieved by templating the parent class by the child class (CRTP).
+
 ### Loop Unrolling

 CUTLASS requires tiles of data to be stored in registers for high-bandwidth access. Simultaneously, high-throughput math instructions
@ -1060,7 +1063,7 @@ constexpr auto second_form(T t) {

 In this form, the `else` branch had a `static_assert` that was meant always to fail if the `else` branch were taken, such as `static_assert(sizeof(T) < 0)`.  (Note that we cannot use `static_assert(false)` here, because it will ALWAYS fail at compile time, even if the `else` branch is not taken.  C++23 fixes this behavior, but CUTLASS currently requires that its code be compatible with C++17.  As a result, CUTLASS includes a `dependent_false<T>` library function that you can use in place of the always-`false` test `sizeof(T) < 0`.)

-One can suppress "missing return statement" warnings for both forms by invoking CUTLASS' function-like macro `CUTE_GCC_UNREACHABLE()`.  When building with GCC, this invokes the GCC-specific built-in function `__builtin_unreachable()`.  Actually calling this function is undefined behavior, so using this lets the programmer declare that the code path calling that function will never be taken.  (C++23 introduces the `std::unreachable()` function, which achieves the same goal.  Again, though, CUTLASS cannot currently use C++23 library functions.)  Here is an example of how to use `CUTE_GCC_UNREACHABLE()`.
+One can suppress "missing return statement" warnings for both forms by invoking CUTLASS' function-like macro `CUTE_GCC_UNREACHABLE`.  When building with GCC, this invokes the GCC-specific built-in function `__builtin_unreachable()`.  Actually calling this function is undefined behavior, so using this lets the programmer declare that the code path calling that function will never be taken.  (C++23 introduces the `std::unreachable()` function, which achieves the same goal.  Again, though, CUTLASS cannot currently use C++23 library functions.)  Here is an example of how to use `CUTE_GCC_UNREACHABLE`.

 ```c++
 template<class T>
@ -1074,7 +1077,7 @@ constexpr auto second_form(T t) {
  else {
    static_assert(sizeof(T) < 0, "This branch always fails");
  }
-  CUTE_GCC_UNREACHABLE();
+  CUTE_GCC_UNREACHABLE;
 }
 ```

--- a/media/docs/utilities.md
+++ b/media/docs/utilities.md
@ -384,6 +384,54 @@ int main() {
 }
 ```

+## Debugging Asynchronous Kernels with CUTLASS's Built-in `synclog` Tool
+
+CUTLASS provides a built-in tool called `synclog` that enables printing runtime information useful for debugging asynchronous CUTLASS kernels. With the introduction of Warp Specialization in CUTLASS 3.0 for Hopper GPUs, kernel designs now incorporate synchronization among warps. The `synclog` tool simplifies debugging efforts for these asynchronous programs by recording and displaying timing information for synchronization events.
+
+### Enabling `synclog`
+To enable `synclog`, add the -DCUTLASS_ENABLE_SYNCLOG=1 flag during compilation. From the CUTLASS root directory:
+
+```
+$ mkdir build && cd build && 
+$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_ENABLE_SYNCLOG=1
+```
+
+### Building and Running with `synclog`
+After enabling `synclog`, build your CUTLASS example. For instance, to build example 54:
+
+```
+$ cd examples/54_hopper_fp8_warp_specialized_gemm
+$ make
+```
+
+Run the example, setting the profiling iteration count to 0 to ensure `synclog` information is printed only for the reference run:
+
+```
+$ ./54_hopper_fp8_warp_specialized_gemm --iterations=0 &> synclog.txt
+```
+
+### Interpreting `synclog` output
+The synclog.txt file will contain runtime information about synchronization events. Here's a sample output snippet:
+
+```
+synclog start
+synclog at 1: cluster_barrier_init line=281 time=1725400116233388736 thread=0,0,0 block=0,0,0 smem_addr=197632 arrive_count=1
+synclog at 13: fence_barrier_init line=583 time=1725400116233388768 thread=32,0,0 block=0,0,0 
+...
+```
+
+Each line in the main body follows this format:
+```
+synclog at [synclog_at]: [header] line=[line] thread=[threadIdx.xyz] block=[blockIdx.xyz] 
+```
+* `synclog at`: Address in the `synclog` output buffer (in bytes). Output exceeding 2^26 bytes is discarded.
+* `header`: Name of the synchronization event.
+* `line`: Code line number of the synchronization operation calling into `synclog`.
+
+Additional information may appear at the end of each line, such as shared memory address, phase bit, and arrive count. For more detailed information on `synclog` output, refer to [synclog.hpp](../../include/cutlass/arch/synclog.hpp) in the CUTLASS source code. 
+
+Please note that `synclog` is an experimental feature, and its functionality is not always guaranteed. We encourage its use in custom kernels and CUTLASS examples, though it is known to be incompatible with profiler kernels.
+
 # Copyright

 Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.