CUTLASS 3.5.1 (#1623)

* CUTLASS 3.5.1 * updates, optimizations, fixes
2024-07-29 08:46:24 -04:00
parent 56b46e2d13
commit be60a0b272
312 changed files with 19793 additions and 6775 deletions
--- a/media/docs/build/building_in_windows_with_visual_studio.md
+++ b/media/docs/build/building_in_windows_with_visual_studio.md
@ -50,18 +50,6 @@ before attempting to clone or build CUTLASS.
 [This Microsoft help article](https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry)
 explains different ways to change the registry setting.

-# Limitations
-
-Currently, it's possible to build examples and tests.
-Building the CUTLASS library (e.g., for profiling) with default settings does not currently work,
-because Visual Studio's linker cannot handle more than 65535 symbols in a library.
-(The symptom of this issue is a LNK1189 linker error.)
-The known way to work around this Visual Studio limitation is to disable building CUTLASS's library,
-by setting the CMake option `CUTLASS_ENABLE_LIBRARY` to `OFF`.
-Another approach may be to limit the number of kernels in the library
-by setting the CMake option `CUTLASS_LIBRARY_KERNELS`
-so that CUTLASS tries to put fewer kernels in the library.
-
 # Set up build environment

 1. Run "git bash" to get a familiar command-line interface
@ -72,7 +60,7 @@ so that CUTLASS tries to put fewer kernels in the library.

 4. Create the `build` subdirectory in the CUTLASS clone directory, and run CMake in it,
    specifying whatever CMake options are desired, e.g.,
-    `cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_ENABLE_LIBRARY=OFF`
+    `cmake .. -DCUTLASS_NVCC_ARCHS=90a`

 Alternate approaches may rely on the CMake GUI and/or Windows' native command line.

@ -91,3 +79,12 @@ Unlike with CMake's Makefile or Ninja generators,
 `CMAKE_BUILD_TYPE` has no effect on the Visual Studio generator,
 because the Visual Studio generator creates all build configurations.

+# Tips
+
+With Windows builds, one may find that CMake reruns unnecessarily.
+For example, cancelling a build and starting it again may rerun CMake.
+This will in turn touch build files that result in unnecessary rebuilds.
+One work-around is to set the CMake option `CMAKE_SUPPRESS_REGENERATION=ON`.
+However, this turns off CMake's ability to detect on its own when it needs to rerun.
+As a result, one will need to know when to rerun CMake by hand.
+
--- a/media/docs/build/building_with_clang_as_host_compiler.md
+++ b/media/docs/build/building_with_clang_as_host_compiler.md
@ -9,7 +9,7 @@ Clang as both host and device compiler ("CUDA Clang").

 # Software prerequisites

-1. Clang (regularly tested with Clang 14;
+1. Clang (regularly tested with Clang 17;
   occasionally tested with Clang 10 and greater)

 2. CUDA Toolkit (tested with 12.2; other versions likely work)
--- a/media/docs/cute/03_tensor.md
+++ b/media/docs/cute/03_tensor.md
@ -166,10 +166,10 @@ The `make_tensor_like` function makes an owning Tensor of register memory with t
 Calling `print` on each of the above tensors produces similar output

 ```
-rmem_4x8_col  : ptr[32b](0x7ff1c8fff820) o (_4,_8):(_1,_4)
-rmem_4x8_row  : ptr[32b](0x7ff1c8fff8a0) o (_4,_8):(_8,_1)
-rmem_4x8_pad  : ptr[32b](0x7ff1c8fff920) o (_4,_8):(_32,_2)
-rmem_4x8_like : ptr[32b](0x7f4158fffc60) o (_4,_8):(_8,_1)
+rmem_4x8_col  : ptr[32b](0x7fff48929460) o (_4,_8):(_1,_4)
+rmem_4x8_row  : ptr[32b](0x7fff489294e0) o (_4,_8):(_8,_1)
+rmem_4x8_pad  : ptr[32b](0x7fff489295e0) o (_4,_8):(_32,_2)
+rmem_4x8_like : ptr[32b](0x7fff48929560) o (_4,_8):(_8,_1)
 ```

 and we can see that each pointer address is unique indicating that each `Tensor` is a unique array-like allocation.
@ -195,7 +195,7 @@ For example, we can read and write to `Tensor`s using natural coordinates, using

 ```c++
 Tensor A = make_tensor<float>(Shape <Shape < _4,_5>,Int<13>>{},
-                              Stride<Stride<_12,_1>,_64>{});
+                              Stride<Stride<_12,_1>,    _64>{});
 float* b_ptr = ...;
 Tensor B = make_tensor(b_ptr, make_shape(13, 20));

@ -317,7 +317,7 @@ Another common partitioning strategy is called a thread-value partitioning. In t
 //   to 1D coordinates within a 4x8 tensor
 // (T8,V4) -> (M4,N8)
 auto tv_layout = Layout<Shape <Shape <_2,_4>,Shape <_2, _2>>,
-                        Stride<Stride<_8,_1>,Stride<_4,_16>>>{};  // (8,4)
+                        Stride<Stride<_8,_1>,Stride<_4,_16>>>{}; // (8,4)

 // Construct a 4x8 tensor with any layout
 Tensor A = make_tensor<float>(Shape<_4,_8>{}, LayoutRight{});    // (4,8)
--- a/media/docs/cute/0x_gemm_tutorial.md
+++ b/media/docs/cute/0x_gemm_tutorial.md
@ -195,7 +195,7 @@ As is evident, these smem layouts can be almost anything. Inside the kernel, the
  CUTE_STATIC_ASSERT_V(size<1>(BSmemLayout{}) == size<2>(cta_tiler));  // BLK_K
 ```

-Use of static layouts has a few advantages. 
+Use of static layouts has a few advantages.
 * Static layouts let us statically allocate shared memory as shown below.
 * Static layouts are often more efficient and allow CuTe to dispatch to optimized implementations.
 * Static layouts makes it easier to prove correctness of the algorithm and provide checks like the above -- the smem layout sizes are the same as the CTA tile sizes.
@ -227,7 +227,7 @@ if (thread0()) {
 ```
 This would work, but we have lots of threads to use inside this CTA, so let's use them!

-If we partition the two tiles of data across the threads in the CTA, then each thread can copy its own subtensor of data. There are lots of ways this partitioning could occur, however. 
+If we partition the two tiles of data across the threads in the CTA, then each thread can copy its own subtensor of data. There are lots of ways this partitioning could occur, however.

 The `gemm_nt` function defines two layouts of *threads* as
 ```c++
@ -295,7 +295,7 @@ if (thread0()) {
 ```
 This would work, but we have lots of threads to use inside this CTA, so let's use them!

-If we partition the output tile `gC` across the threads in the CTA, then each thread can compute its own subtensor. There are lots of ways this partitioning could occur, however. 
+If we partition the output tile `gC` across the threads in the CTA, then each thread can compute its own subtensor. There are lots of ways this partitioning could occur, however.

 The `gemm_nt` and `gemm_tn` functions define one more layout of *threads*:
 ```cpp
@ -332,7 +332,7 @@ These thread layouts are then used to partition the tiles of data in global memo
  CUTE_STATIC_ASSERT_V(size<1>(tCrC) == size<0>(tCsB));                // THR_N
  CUTE_STATIC_ASSERT_V(size<1>(tCsA) == size<1>(tCsB));                // BLK_K
 ```
-where we've used the same projection-style interface to avoid applying the `N`-mode of `tC` to the `(BLK_M,BLK_K)` shape of `sA` and avoid applying the `M`-mode of `tC` to the `(BLK_N,BLK_K)` shape of `sB`. 
+where we've used the same projection-style interface to avoid applying the `N`-mode of `tC` to the `(BLK_M,BLK_K)` shape of `sA` and avoid applying the `M`-mode of `tC` to the `(BLK_N,BLK_K)` shape of `sB`.

 <p align="center">
  <img src="../../images/cute/tC_partitioning.png" alt="tC_partitioning.png" height="300"/>
--- a/media/docs/ide_setup.md
+++ b/media/docs/ide_setup.md
@ -0,0 +1,122 @@
+[README](../../README.md#documentation) > **IDE Setup for CUTLASS Development**
+
+# IDE Setup for CUTLASS Development
+
+This document outlines instructions and tips for setting up a local editor for CUTLASS development, including support
+for intellisense, go-to-definition, code formatting, and so on.
+
+## Overview
+In order for any intellisense tool to work with CUTLASS, the following things need to be configured with it:
+* Include paths, i.e. where the compiler (or in this case, the intellisense tool) should look for header files
+* Compiler flags; especially the C++ standard (`--std`)
+* Preprocessor variables; especially CUDA-related ones
+
+One usually needs to configure the above variables in a settings file. Below, two config approaches are described:
+for VSCode, and for any editor that uses the clangd language server, which includes
+Vim, Emacs, NeoVim, Sublime Text, and so on. Note that VSCode can also be configured to use clangd.
+It might be worth setting up clangd for VSCode rather than the default intellisense,
+and you might see faster responses and more stable performance with clangd.
+
+## VSCode Setup
+
+1. Install the [Official C/C++ extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode.cpptools)
+1. Open settings...
+    1. `Ctrl+Shift+P` to open the command palette
+    1. Enter "C/C++" to filter results
+    1. Select "C/C++ Edit Configurations (UI)" (or "... (JSON)" if you feel like editing the raw JSON)
+    1. View the documentation for these settings
+       [here](https://code.visualstudio.com/docs/cpp/c-cpp-properties-schema-reference)
+1. Edit "Include Path" to set up **include paths**. For CUTLASS, this includes the following:
+    * `${workspaceFolder}/include`
+    * `${workspaceFolder}/tools/util/include`
+    * `${workspaceFolder}/examples/common`
+    * ...others, depending on which files you edit
+1. Edit C++ standard to be `c++17`, `gnu++17`, or equivalent.
+1. Edit `defines` to define preprocessor variables. See
+[Global Config below](#Global-Config) for examples. The important
+   ones include `__CUDACC_VER_MAJOR__`, `__CUDA_ARCH__`, `__CUDA_ARCH_FEAT_SM90_ALL__`. But configure
+   them according to your target architecture.
+1. ...and possible edit any other fields for your specific setup.
+
+## clangd Setup
+
+`clangd` is a C++ language server that is part of the LLVM project. You must first set it up your specific IDE:
+* `clangd` official [documentation](https://clangd.llvm.org/installation#editor-plugins) for editor setup.
+* NeoVim setup is possible through [lsp](https://neovim.io/doc/user/lsp.html) and either manually installing clangd or
+using an installation manager like Mason.
+
+Then, one needs to edit the config ([documentation](https://clangd.llvm.org/config)). One typically has a
+**global** and a **per-project** config.
+
+### Global Config
+
+Here is one example for a global config.
+On linux this is usually located at `~/.config/clangd/config.yaml` . Here is one example config for CUDA projects on SM90.
+The key settings here are the preprocessor vars (`-D__CUDACC_VER_MAJOR__` , `-D__CUDA_ARCH__`)
+
+```
+CompileFlags:
+  Compiler: /usr/local/cuda/bin/nvcc
+  Add:
+    - --cuda-path=/usr/local/cuda
+    - --cuda-gpu-arch=sm_90a
+    - -I/usr/local/cuda/include
+    - "-xcuda"
+    # report all errors
+    - "-ferror-limit=0"
+    - --cuda-gpu-arch=sm_90a
+    - --std=c++17
+    - "-D__INTELLISENSE__"
+    - "-D__CLANGD__"
+    - "-DCUDA_12_0_SM90_FEATURES_SUPPORTED"
+    - "-DCUTLASS_ARCH_MMA_SM90_SUPPORTED=1"
+    - "-D_LIBCUDACXX_STD_VER=12"
+    - "-D__CUDACC_VER_MAJOR__=12"
+    - "-D__CUDACC_VER_MINOR__=3"
+    - "-D__CUDA_ARCH__=900"
+    - "-D__CUDA_ARCH_FEAT_SM90_ALL"
+    - "-Wno-invalid-constexpr"
+  Remove:
+    # strip CUDA fatbin args
+    - "-Xfatbin*"
+    # strip CUDA arch flags
+    - "-gencode*"
+    - "--generate-code*"
+    # strip CUDA flags unknown to clang
+    - "-ccbin*"
+    - "--compiler-options*"
+    - "--expt-extended-lambda"
+    - "--expt-relaxed-constexpr"
+    - "-forward-unknown-to-host-compiler"
+    - "-Werror=cross-execution-space-call"
+Hover:
+  ShowAKA: No
+InlayHints:
+  Enabled: No
+Diagnostics:
+  Suppress:
+    - "variadic_device_fn"
+    - "attributes_not_allowed"
+```
+
+### Local Config
+Local config is needed to specify per-project settings, especially include paths. An example is:
+```
+CompileFlags:
+  Add:
+    - -I</absolute/path/to/cutlass>/include/
+    - -I</absolute/path/to/cutlass>/tools/util/include/
+    - -I</absolute/path/to/cutlass>/cutlass/examples/common/
+```
+
+Note that absolute paths are needed since clangd doesn't support relative paths.
+
+### Note on compile_commands.json
+For typical C++ projects, clangd can *automatically* configure itself by parsing the `compile_commands.json`
+generated by your CMake build. The path to such a file is by default `build/compile_commands.json` and is
+configured by the `CompilationDatabase` config.
+
+This is usually a convenient way to configure projects, but it's not as simple for CUDA/nvcc projects, since
+clang doesn't understand many of the compiler flags used by nvcc. Hence, for now, we don't recommend using
+`compile_commands.json` to configure your CUDA project.
+
--- a/media/docs/profiler.md
+++ b/media/docs/profiler.md
@ -210,6 +210,8 @@ GEMM
  [int]       --inst_k,--instruction-shape::k                   Math instruction shape in the K dimension
  [int]       --min_cc,--minimum-compute-capability             Minimum device compute capability
  [int]       --max_cc,--maximum-compute-capability             Maximum device compute capability
+  [enum]      --raster_order={H|M|N}                            If supported by kernel, sets the tile raster direction
+  [int]       --swizzle_size                                    If supported by kernel, sets the 2D tile swizzle extent
 Examples:

 Profile a particular problem size:
@ -229,6 +231,9 @@ Using various input value distribution:
  $ cutlass_profiler --operation=Gemm --dist=gaussian,mean:0,stddev:3
  $ cutlass_profiler --operation=Gemm --dist=sequential,start:0,delta:1

+Using CUTLASS 3.x GEMM kernel with a tile scheduler that supports runtime tile remapping and raster mode order:
+  $ cutlass_profiler --operation=Gemm --m=2048 --n=2048 --k=2048 --raster_order=M --swizzle_size=2
+
 Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect (note that --cta-tile::k=32 is default cta-tile size):
 $ cutlass_profiler --operation=Gemm --cta_m=256 --cta_n=128  --cta_k=32 --save-workspace=incorrect

--- a/media/docs/programming_guidelines.md
+++ b/media/docs/programming_guidelines.md
@ -92,9 +92,13 @@ for (int idx = 0; idx < kN; ++idx) {      // Loop has constant number of iterati
                                          // direct register access.
 }
 ```
-
 ## Style

+### If you see an issue in code formatting, fix it
+
+You are empowered to reformat code.
+Please, however, consider making reformatting changes separately from content-related changes.
+
 ### No automatic code formatting

 Do not use any kind of automatic code formatting,
@ -128,48 +132,111 @@ and we should always strive to eliminate them.

 * [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html)

+#### C is not a subset of C++
+
+C is not a subset of C++.
+Some valid C is not valid C++, and some valid "C-looking" C++ is not valid C.
+See e.g., the informative C++ Standard Committee (WG21) document
+[P2735R0](https://isocpp.org/files/papers/P2735R0.pdf),
+which explains ways in which the same code has different behavior in C vs. C++.
+In some cases, code that compiles in both C and C++,
+and is correct in C, has undefined behavior (can crash or worse) in C++.
+The "type.punning" section of P2735R0 specifically relates to unions.
+
 #### Spacing and line length

 * Use spaces, not tabs.

 * Use 2 spaces to indent.

-* Max 100 characters per line.
+* Use at most 100 characters per line.

+(Right-align tensor shape layout comments at column 120.
+Please see below.)
 Lines longer than 100 characters typically wrap unfavorably
 when viewed in Github's pretty printer.

-#### Function indentation
+#### Formatting function declarations and definitions
+
+Short function headers can go on one line.
+
+Do not insert a newline between the parenthesis
+that closes the function's parameters and
+the curly bracket that opens the function's body.
+
+```c++
+int short_name(int x, int y) {
+  return x + y;
+}
+```
+
+If the function name and its parameters are too long to fit on one line,
+break the line immediately after the opening parenthesis
+that starts the parameter list.  Then, double-indent the parameters
+to distinguish them from the body of the function.
+
+```c++
+void indeed_my_fellowbeings_this_function_name_is_unusually_long(
+    std::uint32_t foo, // parameters are double-indented
+    std::uint32_t const* bar,
+    TypeA a,
+    TypeB b,
+    TypeC c) { // the ) and { go on the same line still
+  auto d = body_of_the_function(a, b, c); // body is single-indented
+  // ... more code ...
+}
+```
+
+For a constructor with a long parameter list,
+break the line after the parentheses, just as with other functions.
+Align the colon that starts the constructor's initializer list
+flush with the comma on the next line.
+
+As with functions, double-indent the parameters
+to distinguish them from the constructor body.
+Here is an example.
+
+```c++
+class YesTheCommunityAgreesThatTheNameOfThisClassIsIndeedExtremelyLong {
+public:
+  CUTLASS_HOST_DEVICE
+  YesTheCommunityAgreesThatTheNameOfThisClassIsIndeedExtremelyLong(
+      int this_is_the_first_parameter_and_its_name_is_long,
+      int this_is_the_second_parameter_and_its_name_is_also_long,
+      int this_is_the_third_parameter_and_its_name_is_long_too)
+  : x_(this_is_the_first_parameter_and_its_name_is_long)
+  , y_(this_is_the_second_parameter_and_its_name_is_also_long)
+  , z_(this_is_the_third_parameter_and_its_name_is_long_too) {
+    // constructor body
+    // more of the constructor body
+  }
+
+private:
+  int x_ = 0;
+  int y_ = 0;
+  int z_ = 0;
+};
+```
+
+#### Formatting function calls

 When calling a function or function object with a long name,
 break the line right after the invoking open parenthesis.
-Here is an example.
+Here are some examples.

 ```c++
 detail::very_long_function_object_name<TemplateArgument>{}(
  params.long_parameter_name, some_operator.another_long_function_name());
+
+detail::an_even_longer_function_object_name<TemplateArgument1, TemplateArgument2>{}(
+  params.long_parameter_name, some_operator.long_member_function_name(),
+  another_operator.another_long_member_function_name(x, y, z));
 ```

-When declaring functions, indent function parameters like this.
-
-```c++
-void possibly_an_unusually_long_function_name(
-    std::uint32_t foo
-    std::uint32_t const* bar,
-    TypeA a,
-    TypeB b,
-    TypeC c) {
-  // ... the function's body ...
-}
-```
-
-A newline should not be inserted between the parenthesis
-that closes the function's parameters and the curly bracket
-that opens the function's body. Note the double indent for function parameters.
-
 #### If-else brackets and spacing

-* Always use braces with conditionals such as `if`.
+* Always use braces with conditionals such as `if`,
+  even if the body is a single line.

 * Use a space after control flow keywords
  such as `if`, `for`, and `while`.
@ -181,13 +248,14 @@ that opens the function's body. Note the double indent for function parameters.
  of an `if` branch, and the `else` keyword.

 ```c++
-if (condition) {
+if (condition) { // space after if, and between ) and {
  // ... code ...
-}
+} // newline after }
 else {
  // ... other code ...
 }

+// space after keyword for
 for (int k = 0; k < num_iters; ++k) {
  // ... still more code ...
 }
@ -244,7 +312,6 @@ and not this.
 int const &var;
 int const *var;
 ```
-
 #### Avoid calling functions "fast" or "optimized"

 Putting words like "fast" or "optimized"
@ -395,6 +462,9 @@ Sometimes a function needs to return multiple values.  In that case, consider th
   for all the types that work in `std::tuple`.
   CuTe's documentation explains.)

+3. Resort to "returning" multiple values by output references
+   only if performance requires it.
+
 Here is an example of the struct approach for named values.
 For a comparable example in the C++ Standard,
 please see [`std::allocate_at_least`](https://en.cppreference.com/w/cpp/memory/allocate_at_least),
@ -655,6 +725,158 @@ private:
 };
 ```

+#### For code reuse, prefer composition over inheritance
+
+* [C++ Core Guidelines C.129](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c129-when-designing-a-class-hierarchy-distinguish-between-implementation-inheritance-and-interface-inheritance): "When designing a class hierarchy, distinguish between implementation inheritance and interface inheritance"
+* [C++ Core Guidelines ES.63](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-slice): "Don't slice"
+
+Suppose that a class hierarchy exists entirely for implementation convenience, so that implementers can reuse code and "program by difference" (changing or adding only what's different from the base class).  In the example below, both `PipelineA` and `PipelineB` are used by themselves.  `PipelineB` inherits from `PipelineA` just to avoid duplicating code.  There are no virtual member functions, and users don't expect to rely on run-time polymorphism.
+
+```c++
+class PipelineA {
+public:
+  PipelineA(Arg0 arg0, Arg1 arg1)
+    : arg0_(arg0), arg1_(arg1)
+  {}
+
+  void producer_acquire(uint32_t stage, uint32_t phase, uint32_t skip_wait) {
+    // ... implementation ... 
+  }
+
+  void consumer_release(uint32_t stage, uint32_t skip) {
+    // ... implementation ...
+  }
+
+private:
+  Arg0 arg0_;
+  Arg1 arg1_;
+};
+
+class PipelineB : public PipelineA {
+public:
+  PipelineB(Arg0 arg0, Arg1 arg1, Arg2 arg2) :
+    PipelineA(arg0, arg1), arg2_(arg2)
+  {}
+
+  // Reuse PipelineA::producer_acquire via inheritance
+
+  // Override PipelineA::consumer_release
+  void consumer_release(uint32_t stage, uint32_t skip) {
+    // ... some other implementation, not invoking parent ...
+  }
+
+private:
+  Arg2 arg2_;
+};
+```
+
+The problem with public inheritance here is that `PipelineB` is NOT a (versus "is-a," i.e., substitutable-as) `PipelineA`. In particular, the following code would be incorrect.
+
+```c++
+void consume_and_release_pipeline(PipelineA* parent) {
+  // ... code ...
+  parent->consumer_release(stage, skip);
+  // ... code ...
+}
+
+void use_pipeline( /* other args */ ) {
+  // ... code ...
+  PipelineB child{arg0, arg1, arg2};
+  // ... code ...
+
+  // WRONG!!! SLICES CHILD TO PARENT!!!
+  consume_and_release_pipeline(&child); // BAD
+
+  // ... code ...
+}
+```
+
+`PipelineA::consumer_release` is not a virtual member function, so `consume_and_release_pipeline` would not actually be polymorphic, as callers might have expected from an interface that takes a base class pointer. What's worse is that the resulting slicing could violate `PipelineB`'s invariants, thus putting it in an incorrect state.
+
+The most straightforward way to reuse code would be by changing from inheritance (is-a) to composition (has-a).
+
+```c++
+namespace detail {
+
+// Implementation class; not for users
+class PipelineImpl {
+public:
+  PipelineImpl(Arg0 arg0, Arg1 arg1)
+    : arg0_(arg0), arg1_(arg1)
+  {}
+
+  void producer_acquire(uint32_t stage, uint32_t phase, uint32_t skip_wait) {
+    // ... implementation ...
+  }
+
+  void consumer_release(uint32_t stage, uint32_t skip) {
+    // ... implementation ...
+  }
+
+private:
+  Arg0 arg0_;
+  Arg1 arg1_;
+};
+
+} // namespace detail
+
+class PipelineA {
+public:
+  PipelineA(Arg0 arg0, Arg1 arg1) :
+    impl_(arg0, arg1)
+  {}
+
+  void producer_acquire(uint32_t stage, uint32_t phase, uint32_t skip_wait) {
+    impl_.producer_acquire(stage, phase, skip_wait);
+  }
+
+  void consumer_release(uint32_t stage, uint32_t skip) {
+    impl_.consumer_release(stage, skip);
+  }
+
+private:
+  detail::PipelineImpl impl_;
+};
+
+// A second kind of pipeline.
+// Note that this does NOT inherit from PipelineB!
+// The two pipeline classes have the same compile-time interface
+// (for compile-time polymorphism), but do not belong in an 
+// inheritance hierarchy (as would imply run-time polymorphism).
+class PipelineB {
+public:
+  PipelineB(Arg0 arg0, Arg1 arg1, Arg2 arg2) :
+    impl_(arg0, arg1), otherTwo_(arg2)
+  {}
+
+  void producer_acquire(uint32_t stage, uint32_t phase, uint32_t skip_wait) {
+    impl_.producer_acquire(stage, phase, skip_wait);
+  }
+
+  void consumer_release(uint32_t stage, uint32_t skip) {
+    // this class doesn't actually use impl_ here
+    otherTwo_.other_action(stage, skip);
+    // ... some other code not using impl_ ...
+  }
+
+private:
+  detail::PipelineImpl impl_;
+  OtherTwo otherTwo_;
+  // ... other member data ...
+};
+```
+
+This design prevents users at compile time from incorrectly assuming that `PipelineB` is a `PipelineA`.  Implementers continue to get compile-time polymorphism, as long as `PipelineA` and `PipelineB` implement the same compile-time interface.
+
+##### Behavioral subtyping
+
+Another reason to avoid public inheritance would be if the public member functions of `PipelineA` and `PipelineB` have different behavior, such that the invariants satisfied by the member functions of the base class `PipelineA` are not satisfied by the correspondingly named member functions of the subclass `PipelineB`.  For example, suppose that both classes have a public `producer_arrive` member function.  However, for `PipelineA`, this issues a producer arrival only for its own block, whereas for `PipelineB`, this issues a producer arrival for all blocks in the cluster.  Again, PipelineB "is-not-a" PipelineA.  The child class doesn't just add behavior onto the parent class; it has completely different behavior. Thus, it fails to satisfy behavioral subtyping: invariants of the parent class's member functions are not satisfied by the child class.  Behavioral subtyping is especially important when reasoning about already difficult things like parallel synchronization.  The inheritance design would give developers the false impression that `PipelineB` just adds behavior atop `PipelineA`, whereas in fact, developers would need to understand both pipeline classes completely to build a correct mental model about their behavior.
+
+The fix is the same: Use composition, not inheritance.  As [C++ Core Guidelines C.120](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c120-use-class-hierarchies-to-represent-concepts-with-inherent-hierarchical-structure-only) explains: "Use class hierarchies to represent concepts with inherent hierarchical structure (only)."
+
+1. "Make sure the idea represented in the base class exactly matches all derived types and there is not a better way to express it than using the tight coupling of inheritance."
+2. "Do not use inheritance when simply having a data member will do."
+
 #### Use scoped enums

 Use scoped enums (a C++11 feature) for enumerated types.
@ -765,18 +987,119 @@ Use `#pragma once` to guard all headers.

 ### CuTe Layout Comments

-* Right align CuTe layout comments at column 120. 
+* Right-align tensor shape layout comments at column 120. 
 * If layout comment is too long do your best to align it.
-* If layout comment is too long and there are many related tensors that reader should read together, try to align the layout comments of related tensors.
+* If layout comment is too long and there are many related tensors
+  that the reader should read together,
+  try to align the layout comments of related tensors.
+
+Here are a couple examples.

 ```c++
-    Tensor my_tensor = make_tensor<Type>(Layout<Shape<_2,_2>{}, Stride<_1,_2>>{});                       // (2,2):(1,2)
-    
-    // Related tensors
-    Tensor my_tensor1 = make_tensor<Type>(ThisIsAVeryComplicatedLayoutWithAVeryLongName);         // ((Mode0_0,Mode0_1,Mode0_2),Mode1,Mode2,Mode3)
-    Tensor my_tensor2_related = make_tensor<Type>(ThisIsAVeryComplicatedLayoutWithAVeryLongName); // ((Mode0_0,Mode0_1,Mode0_2),Mode1,Mode2,Mode3)
+Tensor mC = make_tensor(make_gmem_ptr(params.ptr_C), make_shape(M,N), params.dC);                              // (M,N)
+Tensor mD = make_tensor(make_gmem_ptr(params.ptr_D), make_shape(M,N), params.dD);                              // (M,N)
+Tensor mAux = make_tensor(make_gmem_ptr(params.ptr_Aux), make_shape(M,N), params.dAux);                        // (M,N)
+
+auto thr_mma = tiled_mma.get_thread_slice(thread_idx);
+Tensor tCgD = thr_mma.partition_C(gD);                                                             // (VEC,THR_M,THR_N)
+Tensor tCgC = thr_mma.partition_C(gC);                                                             // (VEC,THR_M,THR_N)
+Tensor tCgAux = thr_mma.partition_C(gAux);                                                         // (VEC,THR_M,THR_N)
 ```

+```c++
+Tensor my_tensor = make_tensor<Type>(Layout<Shape<_2,_2>{}, Stride<_1,_2>>{});                           // (2,2):(1,2)
+    
+// Related tensors
+Tensor my_tensor1 = make_tensor<Type>(ThisIsAVeryComplicatedLayoutWithAVeryLongName);         // ((Mode0_0,Mode0_1,Mode0_2),Mode1,Mode2,Mode3)
+Tensor my_tensor2_related = make_tensor<Type>(ThisIsAVeryComplicatedLayoutWithAVeryLongName); // ((Mode0_0,Mode0_1,Mode0_2),Mode1,Mode2,Mode3)
+```
+
+### Warnings
+
+CUTLASS code aims to build free of warnings.
+
+#### Spurious warnings
+
+Some compilers, or some versions of a compiler, emit spurious warnings, that is, "false positives" for perfectly fine code.  While such code is correct, the warnings can obscure errors.  Users also may report warnings as bugs, and processing those bugs takes developer time away from other tasks.  Thus, it's good to try to "fix" the warnings, if doing so wouldn't make the code worse.
+
+#### Missing return statement
+
+GCC 10 (but not 7.5, 9.4.0, or 11) has trouble deducing that a function with `auto` return type and all of its returns in an `if constexpr` ... `else` statement must actually return.  As a result, GCC emits spurious "missing return statement" build warnings.  Such functions have one of two forms: `if constexpr` ... `else` where `else` returns, and `if constexpr` ... `else` where `else` is meant to fail at compile time.  Here is an example of the first form.
+
+```c++
+template<class T>
+constexpr auto first_form(T t) {
+  if constexpr (some_condition_v<T>) {
+    return some_function(t);
+  }
+  else if constexpr (another_condition_v<T>) {
+    return another_function(t);
+  }
+  else {
+    return yet_another_function(t);
+  }
+}
+```
+
+In this form, the `if constexpr` ... `else` sequence of branches covers all possibilities.  Here is an example of the second form.
+
+```c++
+template<class T>
+constexpr auto second_form(T t) {
+  if constexpr (some_condition_v<T>) {
+    return some_function(t);
+  }
+  else if constexpr (another_condition_v<T>) {
+    return another_function(t);
+  }
+  else {
+    static_assert(sizeof(T) < 0, "This branch always fails");
+  }
+}
+```
+
+In this form, the `else` branch had a `static_assert` that was meant always to fail if the `else` branch were taken, such as `static_assert(sizeof(T) < 0)`.  (Note that we cannot use `static_assert(false)` here, because it will ALWAYS fail at compile time, even if the `else` branch is not taken.  C++23 fixes this behavior, but CUTLASS currently requires that its code be compatible with C++17.  As a result, CUTLASS includes a `dependent_false<T>` library function that you can use in place of the always-`false` test `sizeof(T) < 0`.)
+
+One can suppress "missing return statement" warnings for both forms by invoking CUTLASS' function-like macro `CUTE_GCC_UNREACHABLE()`.  When building with GCC, this invokes the GCC-specific built-in function `__builtin_unreachable()`.  Actually calling this function is undefined behavior, so using this lets the programmer declare that the code path calling that function will never be taken.  (C++23 introduces the `std::unreachable()` function, which achieves the same goal.  Again, though, CUTLASS cannot currently use C++23 library functions.)  Here is an example of how to use `CUTE_GCC_UNREACHABLE()`.
+
+```c++
+template<class T>
+constexpr auto second_form(T t) {
+  if constexpr (some_condition_v<T>) {
+    return some_function(t);
+  }
+  else if constexpr (another_condition_v<T>) {
+    return another_function(t);
+  }
+  else {
+    static_assert(sizeof(T) < 0, "This branch always fails");
+  }
+  CUTE_GCC_UNREACHABLE();
+}
+```
+
+This macro should only be used if it is needed to suppress spurious warnings.  Also, this function should not be used if the developer is not sure whether the code exhaustively tests all possibilities.  For example, some functions may look like this.
+
+```c++
+template<class T>
+constexpr auto possibly_nonexhaustive(T t) {
+  if constexpr (some_condition_v<T>) {
+    return some_function(t);
+  }
+  else if constexpr (another_condition_v<T>) {
+    return another_function(t);
+  }
+ 
+  // NOTE lack of unadorned "else" here
+}
+```
+
+This is a good opportunity to review the function.  If the branches are obviously meant to be exhaustive, you can add an `else` branch with a `static_assert` (see above for how to express this).  If you're not sure, leave it alone and let the compiler issue warnings.
+
+#### Unused variable
+
+Some compilers may emit spurious unused warnings for some variable declarations, where the variable was only being used inside a `decltype` in an `if constexpr` test. Marking the variables as `[[maybe_unused]]` (a standard C++17 attribute) suppresses these warnings.  Again, please only do this if you're sure that the code is right.
+
 ### CUDA C++ style

 #### CUDA Built-in Variables
--- a/media/docs/quickstart.md
+++ b/media/docs/quickstart.md
@ -232,7 +232,7 @@ int main() {

 ## Launching a GEMM kernel in CUDA

-**Example:** launch a mixed-precision GEMM targeting Turing Tensor Cores. 
+**Example:** launch a mixed-precision GEMM targeting Turing Tensor Cores.

 _Note, this example uses CUTLASS Utilities. Be sure `tools/util/include` is listed as an include path._
 ```c++
@ -289,7 +289,7 @@ int main() {
  //
  // Launch GEMM on the device
  //
- 
+
  status = gemm_op({
    {M, N, K},
    {ptrA, lda},            // TensorRef to A device tensor
@ -315,7 +315,7 @@ Note, the above could be simplified as follows using helper methods defined in `

  //
  // Use the TensorRef returned by HostTensor::device_ref().
-  // 
+  //

  status = gemm_op({
    {M, N, K},
@ -329,7 +329,7 @@ Note, the above could be simplified as follows using helper methods defined in `

 ## Launching a GEMM kernel using CUTLASS 3.0 or newer

-**Example:** launch a mixed-precision GEMM targeting Hopper Tensor Cores. 
+**Example:** launch a mixed-precision GEMM targeting Hopper Tensor Cores.

 ```c++
 #include "cutlass/cutlass.h"
@ -367,7 +367,7 @@ int main(int argc, char const **args) {
  using TilesShape          = Shape<_128,_128,_64>;                           // Threadblock-level tile size
  using ClusterShape        = Shape<_1,_2,_1>;                                // Shape of the threadblocks in a cluster
  using StageCountType = cutlass::gemm::collective::StageCountAuto;           // Stage count maximized based on the tile size
-  using KernelSchedule = cutlass::gemm::collective::KernelScheduleAuto;       // Kernel to launch based on the default setting in the Collective Builder 
+  using KernelSchedule = cutlass::gemm::collective::KernelScheduleAuto;       // Kernel to launch based on the default setting in the Collective Builder

  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
      ArchTag, OperatorClass,
@ -425,10 +425,10 @@ int main(int argc, char const **args) {
  StrideC stride_C;
  StrideD stride_D;

-  stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(M, K, Int<1>{}));
-  stride_B = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(N, K, Int<1>{}));
-  stride_C = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(M, N, Int<1>{}));
-  stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(M, N, Int<1>{}));
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1});

  block_A.reset(M * K);
  block_B.reset(K * N);
@ -438,7 +438,7 @@ int main(int argc, char const **args) {
  //
  // Launch GEMM on the device
  //
- 
+
  status = gemm_op({
    cutlass::gemm::GemmUniversalMode::kGemm,
    {M, N, K},
@ -462,9 +462,9 @@ int main(int argc, char const **args) {
 The [CUTLASS Library](/tools/library) defines an API for managing and executing collections of compiled
 kernel instances and launching them from host code without template instantiations in client code.

-The host-side launch API is designed to be analogous to BLAS implementations for convenience, though its 
-kernel selection procedure is intended only to be functionally sufficient. It may not launch the 
-optimal tile size for a given problem. It chooses the first available kernel whose data types, 
+The host-side launch API is designed to be analogous to BLAS implementations for convenience, though its
+kernel selection procedure is intended only to be functionally sufficient. It may not launch the
+optimal tile size for a given problem. It chooses the first available kernel whose data types,
 layouts, and alignment constraints satisfy the given problem. Kernel instances and a data structure
 describing them are completely available to client applications which may choose to implement their
 own selection logic.
@ -479,12 +479,12 @@ by several SDK examples.
 * [11_planar_complex_array](/examples/11_planar_complex_array/planar_complex_array.cu)

 The CUTLASS Library defines enumerated types describing numeric data types, matrix and tensor
-layouts, math operation classes, complex transformations, and more. 
+layouts, math operation classes, complex transformations, and more.

 Client applications should specify [`tools/library/include`](/tools/library/include) in their
 include paths and link against libcutlas_lib.so.

-The CUTLASS SDK example [10_planar_complex](/examples/10_planar_complex/CMakeLists.txt) specifies 
+The CUTLASS SDK example [10_planar_complex](/examples/10_planar_complex/CMakeLists.txt) specifies
 its dependency on the CUTLASS Library with the following CMake command.
 ```
 target_link_libraries(
@ -534,7 +534,7 @@ int main() {
  //
  // CUTLASS Library call to execute device GEMM
  //
-  
+
  cutlass::library::Handle handle;

  //
@ -571,7 +571,7 @@ int main() {
    ptrD,                                           // pointer to D matrix in device memory
    ldd                                             // leading dimension of D matrix
  );
-  
+
  if (status != cutlass::Status::kSuccess) {
    return -1;
  }
@ -580,27 +580,27 @@ int main() {
 }
 ```

-# Example CMake Commands 
+# Example CMake Commands

-To instantiate all operations supporting all tile sizes, data types, and alignment constraints, specify 
+To instantiate all operations supporting all tile sizes, data types, and alignment constraints, specify
 `-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=all
 ```
-The above command line generates about twenty thousand kernels targeting NVIDIA Ampere, Turing, and Volta architectures. 
-Compiling thousands of kernels for three different architectures is time-consuming. Additionally, this would also result 
+The above command line generates about twenty thousand kernels targeting NVIDIA Ampere, Turing, and Volta architectures.
+Compiling thousands of kernels for three different architectures is time-consuming. Additionally, this would also result
 in a large binary size and on some platforms linker to fail on building the library.

-Enabling the "unity build" instantiates multiple kernel instances in each compilation unit, thereby reducing binary size 
+Enabling the "unity build" instantiates multiple kernel instances in each compilation unit, thereby reducing binary size
 and avoiding linker limitations on some platforms.
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
 ```

-It is advised to only compile CUTLASS kernels for NVIDIA architectures one plans on running. Furthermore, kernels 
-can be selectively included in the CUTLASS Library by specifying filter strings and wildcard characters when executing CMake. 
+It is advised to only compile CUTLASS kernels for NVIDIA architectures one plans on running. Furthermore, kernels
+can be selectively included in the CUTLASS Library by specifying filter strings and wildcard characters when executing CMake.

-Several examples are defined below for convenience. They may be combined as a comma-delimited list. 
+Several examples are defined below for convenience. They may be combined as a comma-delimited list.
 Compling only the kernels desired reduces compilation time.


@ -646,7 +646,7 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS='50;60;61;70;75;80' -DCUTLASS_LIBRARY_KERNELS=sf
 $ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop_*_f16
 ```

-**Example.** All backward weight gradient (wgrad) convolution kernels with FP32 accumulation, FP16 input, and optimized global memory iterator 
+**Example.** All backward weight gradient (wgrad) convolution kernels with FP32 accumulation, FP16 input, and optimized global memory iterator
 targeting NVIDIA Ampere, Turing, and Volta Tensor Core operations
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=tensorop*s*wgrad_optimized_f16