CUTLASS 3.1 (#915)

Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-04-14 20:19:34 -07:00
parent 9b8166e3f0
commit d572cc1aab
482 changed files with 37184 additions and 16419 deletions
--- a/media/docs/cute/01_layout.md
+++ b/media/docs/cute/01_layout.md
@ -199,9 +199,9 @@ which produces the following output for the above examples.
  4   6   5   7
 ```

-The multi-indices within the `layout_4x4` example are handled as expected and interpreted as a rank-2 layout.
+The multi-indices within the `layout_2x4` example are handled as expected and interpreted as a rank-2 layout.

-Note that for `layout_1x4`, we're using a 1-D coordinate for a 2-D multi-index in the second mode. In fact, we can generalize this and treat all of the above layouts as 1-D layouts.  For instance, the following `print1D` function
+Note that for `layout_2x4`, we're using a 1-D coordinate for a 2-D multi-index in the second mode. In fact, we can generalize this and treat all of the above layouts as 1-D layouts.  For instance, the following `print1D` function

 ```c++
 template <class Shape, class Stride>
--- a/media/docs/cute/02_layout_operations.md
+++ b/media/docs/cute/02_layout_operations.md
@ -52,12 +52,6 @@ For example, `print_layout` can display a rank-2 layout in a table
 It has an overload taking a rank-2 matrix layout and a thread layout,
 that displays a table with the mapping between threads and values.

-Some CuTe types might not have overloads for `print`,
-but there are other ways to print their contents.
-For example, copy atoms and mma atoms
-(see elsewhere in this tutorial)
-have a `print_all()` member function.
-
 ### Printing LaTeX output

 The `cute::print_latex` function works like `cute::print`,
@ -261,7 +255,7 @@ The complement B of a layout A with respect to an integer M satisfies the follow

 1. $A$ and $B$ are *disjoint*: $A(x) \neq B(x)$ for all $x \neq 0$ in the domain of $A$.

-2. B is *ordered*: $`B(x-1) < B(x)`$ for all $x$ in $\{0, 1, \dots, size(B) - 1\}$.
+2. B is *ordered*: $B(x-1) \lt B(x)$ for all $x$ in $\{0, 1, \dots, size(B) - 1\}$.

 3. B is *bounded* by M: $size(B) \geq M / size(A)$, and $cosize(B) \leq floor(M / cosize(A)) * cosize(A)$.

--- a/media/docs/cute/0t_mma_atom.md
+++ b/media/docs/cute/0t_mma_atom.md
@ -24,8 +24,8 @@ and an `MMA_Traits` struct templated on the Operation struct type.
 An "Operation" struct exposes the PTX instruction
 for that specific operation.
 It defines the arguments and interface it expects.
-Operation structs have minimal software dependencies -- 
-it does not use layouts, tensors, or non-standard numeric data types.
+Operation structs have minimal software dependencies --
+they do not use layouts, tensors, or non-standard numeric data types.
 Different structs have different names
 that describe what the MMA instruction does.
 We will explain the naming scheme below.
--- a/media/docs/efficient_gemm.md
+++ b/media/docs/efficient_gemm.md
@ -226,13 +226,22 @@ as part of the kernel design. A thread block is partitioned into two sets of war

 [*Producer* warp group (DMA)](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) waits for the shared memory buffers to be signaled as [empty](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) by the *consumer* warp group using the newly added **Async Pipeline class** ([refer](/media/docs/pipeline.md)). Once the data is written into the shared memory, TMA is also updates the barrier associated with that stage to notify affected threads that the buffer has been [filled](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp). The [*Consumer* warp group (MMA)](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) on the other hand waits for the *producer* warp group to signal that the buffer is [filled](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) and then launches tensor core MMA operations. Finally, the *consumer* warp group [releases](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) the buffers for the next set of TMA loads to happens.

-**Warp-Specialized Persistent kernel design**
+**Warp-Specialized Persistent Cooperative kernel design**

-Another flavor of Warp Specialized kernel design being introduced starting with Hopper is the [*Warp-Specialized Persistent*](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_persistent.hpp) kernel. Like Warp Specialized kernel the concepts of warp groups and barrier synchronization between warp groups remain the same in the persistent design. The distinctive feature of the Warp-Specialized Persistent kernel are the following :  
+Another flavor of Warp-Specialized kernel design being introduced starting with Hopper is the [*Warp-Specialized Persistent Cooperative*](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp) kernel. Like the Warp-Specialized kernel, the concepts of warp groups and barrier synchronization between warp groups remain the same in the cooperative design. 
+The distinctive feature of the Warp-Specialized Persistent Cooperative kernel are the following :
 * Persistent thread blocks launched to occupy as many SMs as mentioned in the [KernelHardwareInfo](/include/cutlass/kernel_hardware_info.hpp) struct. These persistent thread blocks are used to tile the output and thus (potentially) compute multiple output tiles through their lifetime. The main benefit this adds is amortization of the thread-block launch and kernel prologue overheads which are typical of all kernels.
-* Presence of one two *consumer* warp groups which allows for *epilogue* of one *consumer* warp group to be overlapped with the math operations of the other *consumer* warp group - thus maximizing tensor core utilization.   
+* Presence of two *consumer* warp groups cooperating on the same output tile by splitting the tile in half across the M dimension. This allows for larger tile sizes to be enabled - since the register pressure per *consumer* warp group is reduced - and hence improving performance.

-Each *consumer* warp group is assigned a different output tile. The *producer* warp group synchronizes using the [Ordered Sequence Barrier](/include/cutlass/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order. Since each thread block now computes multiple output tiles, the shape of the grid launch and the scheduling of tiles to the thread blocks is managed using the new [*Tile Scheduler*](/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp). The *Tile Scheduler* considers the shape of the *clusters* as well as the available number of available SMs to compute a valid scheduling of the output tiles to launched thread blocks.
+Since each thread block now computes multiple output tiles, the shape of the grid launch and the scheduling of tiles to the thread blocks is managed using the new [*Tile Scheduler*](/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp). The *Tile Scheduler* considers the shape of the *clusters* as well as the available number of available SMs to compute a valid scheduling of the output tiles to launched thread blocks.
+
+**Warp-Specialized Persistent Ping-Pong kernel design**
+
+The third kernel design is the [*Warp-Specialized Persistent Ping-Pong*](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp) kernel. 
+Like the Warp-Specialized Persistent Cooperative, kernel the concepts of warp groups, barrier synchronization between warp groups, and the shape of the grid launch remain the same in the persistent ping-pong design. 
+The distinctive feature of the Warp-Specialized Persistent Ping-Pong kernel is the following :
+* The two *consumer* warp groups are assigned a different output tile using the Tile Scheduler. This allows for *epilogue* of one *consumer* warp group to be overlapped with the math operations of the other *consumer* warp group - thus maximizing tensor core utilization. 
+* The *producer* warp group synchronizes using the [Ordered Sequence Barrier](/include/cutlass/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order.

 # Resources

--- a/media/docs/gemm_api_3x.md
+++ b/media/docs/gemm_api_3x.md
@ -277,7 +277,7 @@ warp-specialized mainloop implementation:
 template<
  int Stages_,
  class ClusterShape_ = Shape<_1,_1,_1>,
-  class KernelSchedule = KernelTmaWarpSpecialized
+  class KernelSchedule = KernelTmaWarpSpecializedCooperative
 >
 struct MainloopSm90TmaGmmaWarpSpecialized {
  constexpr static int Stages = Stages_;
@ -299,7 +299,8 @@ it needs to be run, or exposes a template API that lets the user pick a subset o
 struct KernelMultistage { };
 struct KernelTma { };
 struct KernelTmaWarpSpecialized { };
-struct KernelTmaWarpSpecializedPersistent { };
+struct KernelTmaWarpSpecializedPingpong { };
+struct KernelTmaWarpSpecializedCooperative { };
 ```

 - A single kernel schedule can support multiple mainloop implementations. For example,
@ -308,7 +309,7 @@ architectures such as `MainloopSm70TwoStage`, `MainloopSm80CpAsyncUnpredicated`,

 - A single mainloop can be composed with multiple
 possible kernel schedules. For example, the `MainloopSm90TmaGmmaWarpSpecialized` can be
-composed with either the `KernelTmaWarpSpecialized` or `KernelTmaWarpSpecializedPersistent`
+composed with any of the `KernelTmaWarpSpecialized`, `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative`
 kernel schedules.

 As [discussed in the CUTLASS 3.0 design documentation](cutlass_3x_design.md), adopting tag
@ -487,7 +488,7 @@ any of various `include/cutlass/gemm/kernel/{arch_tag}*.hpp` files in the direct
 Which specialization to dispatch to is decided through the dispatch policy's `Schedule` type.

 For example, the header file
-[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_persistent.hpp](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_persistent.hpp)
+[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp)
 has a specialization of `kernel::GemmUniversal` for Hopper
 that uses a warp-specialized mainloop with a persistent scheduling algorithm,
 while the header file
--- a/media/docs/implicit_gemm_convolution.md
+++ b/media/docs/implicit_gemm_convolution.md
@ -19,16 +19,20 @@ This release of CUTLASS contains several artifacts related to convolution.

 # Implicit GEMM Algorithm

-2-D convolution may be mapped to matrix multiply by forming a _convolution matrix_ containing
-elements of the activations tensor then multiplying this by a matrix formed from the filters tensor.
-The earliest form of this algorithm construct the convolution matrix explicitly via an operation
+2-D convolution may be mapped to matrix multiply
+by first forming a _convolution matrix_ containing elements of the activations tensor,
+then multiplying this by a matrix formed from the filters tensor.
+The earliest form of this algorithm constructs the convolution matrix explicitly via an operation
 conventionally referred to as `im2col`. The resulting matrix replicates each activation element by a factor
 equal to the filter size, consuming additional storage capacity and memory bandwidth.

-The _implicit GEMM_ algorithm is a variation on the blocked, hierarchical GEMM computation in CUDA
-that instead forms tiles of the convolution matrix on the fly as data is loaded from global memory
-into Shared Memory by carefully updating pointers and predicates. Once the convolution matrix is
-formed in Shared Memory, the existing components computing warp-level GEMM accumulate the result of
+The _implicit GEMM_ algorithm is a variation on the blocked, hierarchical GEMM computation in CUDA.
+Instead of constructing the convolution matrix explicitly,
+it forms tiles of the convolution matrix on the fly
+as data are loaded from global memory into Shared Memory
+by carefully updating pointers and predicates.
+Once the convolution matrix is formed in Shared Memory,
+the existing warp-level GEMM components accumulate the result of
 convolution and update the output tensor.

 This section describes the structure of an efficient Implicit GEMM Convolution CUDA kernel
@ -158,7 +162,7 @@ To get the best performance, the following parameters are recommended.
 - Channel count (C) is a multiple of 32 elements
 - Filter count (K) is a multiple of 32 elements

-This enables 128-bit vector memory acceses which lead to efficient CUDA kernels. Smaller alignment is supported even on tensor cores by setting AlignmentA and AlignmentB in conv::kernel::DefaultConv2dFprop, but the performance is lower than 128-bit aligned tesnors.
+This enables 128-bit vector memory acceses which lead to efficient CUDA kernels. Smaller alignment is supported even on tensor cores by setting AlignmentA and AlignmentB in `conv::kernel::DefaultConv2dFprop`, but the performance is lower than 128-bit aligned tensors.

 # CUTLASS Device-level Convolution Operator

@ -187,12 +191,12 @@ using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFprop<
  SwizzleThreadBlock,                                     // optional function to reorder threadblocks for locality
  NumStages,                                              // number of pipeline stages in threadblock-scoped GEMM
  cutlass::arch::OpMultiplyAddSaturate,                   // math operation on data of element a and b
-  cutlass::conv::IteratorAlgorithm::kOptimized            // globabl memory iterator algorithm  
+  cutlass::conv::IteratorAlgorithm::kOptimized            // global memory iterator algorithm  
 >::Kernel
 ```

 This template is intended to be generic and cover all feasible configurations. The example specifies
-the following concrete data types, layouts, and tile sizes.
+the following concrete data types, layouts, and tile shapes.

 ```c++
 /// Define an Implicit GEMM convolution forward propagation (fprop) kernel
@ -219,7 +223,7 @@ using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFprop<
  SwizzleThreadBlock,                                  // optional function to reorder threadblocks for locality
  2,                                                   // number of pipeline stages in threadblock-scoped GEMM
  cutlass::arch::OpMultiplyAddSaturate,                // math operation on data of element a and b
-  cutlass::conv::IteratorAlgorithm::kOptimized         // globabl memory iterator algorithm  
+  cutlass::conv::IteratorAlgorithm::kOptimized         // global memory iterator algorithm  
 >::Kernel
 ```

@ -227,7 +231,7 @@ That is, this computes 2D convolutional forward propagation with 4-bit integer i
 Internal accumulation is performed using 32-bit integers (`int32_t`), and an elementwise linear combination operation
 is performed on the output in single-precision floating point (`float`).

-The threadblock and warp-level tile sizes refer to the hierarhically blocked GEMM computation 
+The threadblock and warp-level tile shapes refer to the hierarchically blocked GEMM computation 
 [described here](/media/docs/gemm_api.md). Larger tiles achieve greater reuse of data loaded through shared memory
 but launch fewer CTAs and may not fully occupy the GPU for small problem sizes. Smaller tile configurations achieve
 lower peak utilizations but may better match the number of SMs within the GPU for real-world workloads.
@ -344,13 +348,13 @@ creating GEMM-A tile in shared memory.
 - [conv2d_fprop_filter_tile_access_iterator_optimized.h](/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_optimized.h) optimizes iterating over global memory and 
 creating GEMM-B tile in shared memory.

-The improvements covered by optimized iterators are: 
- (a) Precomputing kernel-invariant pointer deltas on the host 
- (b) Computing cta-invariant mask predicates on device-side iterator ctors
- (c) Use of [fast divmod](/include/cutlass/fast_math.h) to map GEMM dimensions to convolution tensors. 
-For example, _optimized_ activation iterator uses fast divmod to map GEMM _M_ to NPQ 
-for activation iterator
+The improvements covered by optimized iterators are:

+a. Precomputing kernel-invariant pointer deltas on the host 
+b. Computing cta-invariant mask predicates on device-side iterator ctors
+c. Use of [fast divmod](/include/cutlass/fast_math.h) to map GEMM dimensions to convolution tensors.
+
+For example, an _optimized_ activation iterator uses fast divmod to map GEMM _M_ to NPQ.

 **Pipelined mainloop** loads threadblock-scoped tiles from global memory into shared memory and then applies
 CUTLASS warp-level GEMM operations to load from Shared Memory and issue instructions to Turing Tensor Cores.
@ -483,7 +487,7 @@ inc_next[2] = (
 }
 ```

-This allows only a simple lookup from the _delta table_ performed in device code in `Conv2dFpropActivationTileAccessIteratorOptimized::advance()`
+This allows only a simple lookup from the _delta table_ performed in device code in `Conv2dFpropActivationTileAccessIteratorOptimized::advance()`.

 ```c++
 // cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h
@ -516,17 +520,17 @@ void advance() {

 ```

-### Utilizing Tensor Cores
+### Making use of Tensor Cores

 Turing Tensor Cores compute matrix multiply-accumulate operations efficiently by sharing data among all
 threads within a warp. The following operations are supported.

-|**Shape**|**A**|**B**|**C**|
-|---------|-----|-----|-----|
-| 8x8x32  | int4b_t | int4b_t | int32_t |
-| 8x8x16  | int8b_t | int8b_t | int32_t |
-| 16x8x8  | half   | half   | half    |
-| 16x8x8  | half   | half   | float   |
+| **Shape** | **A**   | **B**   | **C**   |
+|-----------|---------|---------|---------|
+| 8x8x32    | int4b_t | int4b_t | int32_t |
+| 8x8x16    | int8b_t | int8b_t | int32_t |
+| 16x8x8    | half    | half    | half    |
+| 16x8x8    | half    | half    | float   |

 Functionally, the Turing 8x8x32 matrix multiply operation distributes the _A_, _B_, and _C_ matrix across 32
 threads within a warp according to the following illustration.
@ -551,7 +555,7 @@ asm volatile(
  : "r"(A), "r"(B), "r"(C[0]), "r"(C[1]));
 ```

-To efficiently load data from Shared Memory into registers with the distribution among
+To load data efficiently from Shared Memory into registers with the distribution among
 warps matching the above, the Turing GPU architecture introduces 
 [`ldmatrix`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix).
 `ldmatrix` is the ultimate warp-cooperative instruction, as all threads contribute addresses to up to 32 row vectors of
@ -652,8 +656,11 @@ CUTLASS structures this as several components:
 ## Unit Tests

 Unit tests verify the functional behavior of each of the above components in a standalone CUDA kernel. This provides a
-convenient environment to (a.) inspect the template definition, (b.) showcase instantiation of use of these templates
-in device code, and (c.) assert functional correctness.
+convenient environment to
+
+a. inspect the template definition,
+b. showcase instantiation of use of these templates in device code, and
+c. assert functional correctness.

 **Convolution unit tests**
 - Device-wide convolution operator: [conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu)
--- a/media/docs/pipeline.md
+++ b/media/docs/pipeline.md
@ -149,7 +149,7 @@ if (thread_idx == 0 or thread_idx == 1) {
    // If any memory operations are involved, then we also need
    // to guarantee that writes are completed and visible to consumer(s).

-    pipeline.producer_commit(smem_pipe_write.index());
+    pipeline.producer_commit(smem_pipe_write);
    ++smem_pipe_write;
  }
 }
--- a/media/docs/profiler.md
+++ b/media/docs/profiler.md
@ -181,8 +181,7 @@ $ ./tools/profiler/cutlass_profiler --operation=gemm --help

 GEMM

-  [enum]      --gemm_kind                                       Variant of GEMM (gemm, batched, array, universal, planar_complex, planar_complex_array)
-  [enum]      --split_k_mode                                    Variant of split K mode(serial, parallel)
+  [enum]      --gemm_kind                                       Variant of GEMM (e.g. universal, gemm, planar_complex, planar_complex_array)
  [int]       --m,--problem-size::m                             M dimension of the GEMM problem space
  [int]       --n,--problem-size::n                             N dimension of the GEMM problem space
  [int]       --k,--problem-size::k                             K dimension of the GEMM problem space
@ -191,9 +190,10 @@ GEMM
  [tensor]    --C                                               Tensor storing the C operand
  [scalar]    --alpha,--epilogue::alpha                         Epilogue scalar alpha
  [scalar]    --beta,--epilogue::beta                           Epilogue scalar beta
+  [enum]      --split_k_mode,--split-k-mode                     Variant of split K mode(serial, parallel)
  [int]       --split_k_slices,--split-k-slices                 Number of partitions of K dimension
  [int]       --batch_count,--batch-count                       Number of GEMMs computed in one batch
-  [enum]      --op_class,--opcode-class                         Class of math instruction (simt, tensorop, wmmatensorop, wmma)
+  [enum]      --op_class,--opcode-class                         Class of math instruction (simt, tensorop, wmmatensorop, wmma).
  [enum]      --accum,--accumulator-type                        Math instruction accumulator data type
  [int]       --cta_m,--threadblock-shape::m                    Threadblock shape in the M dimension
  [int]       --cta_n,--threadblock-shape::n                    Threadblock shape in the N dimension
@ -225,9 +225,6 @@ Schmoo over accumulator types:
 Run when A is f16 with column-major and B is any datatype with row-major (For column major, use column, col, or n. For row major use, row or t):
  $ cutlass_profiler --operation=Gemm --A=f16:column --B=*:row

-Profile a particular problem size with split K and parallel reduction:
-  $ cutlass_profiler --operation=Gemm --split_k_mode=parallel --split_k_slices=2 --m=1024 --n=1024 --k=128
-
 Using various input value distribution:
  $ cutlass_profiler --operation=Gemm --dist=uniform,min:0,max:3
  $ cutlass_profiler --operation=Gemm --dist=gaussian,mean:0,stddev:3
--- a/media/docs/programming_guidelines.md
+++ b/media/docs/programming_guidelines.md
@ -39,33 +39,33 @@ and function inlining.

 ### Constant Memory

-Several CUTLASS template classes exhibit a pattern in which problem-specific internal state is known at kernel 
-launch time and remains invariant throughout the execution of a kernel. For example, tile iterators compute several 
-offsets based on the strides of the input tensor that is added to an internal pointer when loading the elements 
-of a tile. These are computed from the tensor stride and never updated; the per-thread internal state consists 
+Several CUTLASS template classes exhibit a pattern in which problem-specific internal state is known at kernel
+launch time and remains invariant throughout the execution of a kernel. For example, tile iterators compute several
+offsets based on the strides of the input tensor that is added to an internal pointer when loading the elements
+of a tile. These are computed from the tensor stride and never updated; the per-thread internal state consists
 only of the internal global memory pointer.

-CUTLASS can take advantage of this CUDA grid-invariant property by constructing the object in host code and passing 
-a composed parameters structure to the kernel. This confers two benefits: (1.) invariant state is held in constant 
+CUTLASS can take advantage of this CUDA grid-invariant property by constructing the object in host code and passing
+a composed parameters structure to the kernel. This confers two benefits: (1.) invariant state is held in constant
 memory, and (2.) there is no overhead to compute the initial state by each thread.

-The design pattern in CUTLASS is for classes with nontrivial constructors to define `struct Params` as an inner class 
-which contains grid-invariant state. These should define a constructor and an `initialize()` method. The `Params` 
-structure should also include a data member corresponding to each data member in the parent class, so these too can 
-be properly constructed in host code. The parent class should define a constructor which accepts `Params const &` as 
+The design pattern in CUTLASS is for classes with nontrivial constructors to define `struct Params` as an inner class
+which contains grid-invariant state. These should define a constructor and an `initialize()` method. The `Params`
+structure should also include a data member corresponding to each data member in the parent class, so these too can
+be properly constructed in host code. The parent class should define a constructor which accepts `Params const &` as
 its first argument.


 ### Composable Shared Memory

-Shared memory requires explicit effort by the programmer to allocate and de-allocate. CUTLASS follows the paradigm 
-introduced by [CUB](https://nvlabs.github.io/cub/) to define composed structures for storing data intended to be held 
-in shared memory. Any object requiring shared memory storage for itself or its data members should define a child 
-structure called `SharedStorage`. This holds data needed by the class and also instantiates `SharedStorage` 
+Shared memory requires explicit effort by the programmer to allocate and de-allocate. CUTLASS follows the paradigm
+introduced by [CUB](https://nvlabs.github.io/cub/) to define composed structures for storing data intended to be held
+in shared memory. Any object requiring shared memory storage for itself or its data members should define a child
+structure called `SharedStorage`. This holds data needed by the class and also instantiates `SharedStorage`
 objects for each data member.

-To be consistent, this pattern defines a convention in which classes define internal shared memory storage requirements. 
-Classes should consider all SharedStorage structures to be opaque other than their own child class. When the lifetimes 
+To be consistent, this pattern defines a convention in which classes define internal shared memory storage requirements.
+Classes should consider all SharedStorage structures to be opaque other than their own child class. When the lifetimes
 of child objects are known to be non-overlapping, `union`s may be used to alias multiple SharedStorage objects to the same
 shared memory region and reduce overall shared memory capacity.  Developers should carefully note that C++ `union` rules
 require that they only access the most recently written ("active") member of the `union`; this differs from C rules.
@ -80,7 +80,7 @@ Consequently, most loops within the CUTLASS GEMM implementation are specified by
 is able to unroll the loop bodies, map array elements to registers, and construct an efficient instruction schedule.

 All loops expected to be unrolled should be annotated with `CUTLASS_PRAGMA_UNROLL` to explicitly direct the compiler
-to unroll them. 
+to unroll them.

 ```c++
 int const kN = 8;
@ -89,7 +89,7 @@ Array<float, kN> x;                       // Array we would like to store in reg
 CUTLASS_PRAGMA_UNROLL                     // Directs the CUDA compiler to unroll this loop.
 for (int idx = 0; idx < kN; ++idx) {      // Loop has constant number of iterations.

-  x[i] = float(idx);                      // Indirect access by induction variable results in 
+  x[i] = float(idx);                      // Indirect access by induction variable results in
                                          // direct register access.
 }
 ```
@ -159,16 +159,13 @@ void possibly_an_unusually_long_function_name(
  std::uint32_t const* bar,
  TypeA a,
  TypeB b,
-  TypeC c)
-{
+  TypeC c) {
  // ... the function's body ...
 }
 ```

-For function definitions only,
-break the line between the parenthesis
-that closes the function's parameters,
-and the curly bracket
+A newline should not be inserted between the parenthesis
+that closes the function's parameters and the curly bracket
 that opens the function's body.

 #### If-else brackets and spacing
@ -302,9 +299,9 @@ struct Bar {
 #ifdef BAD_CUTLASS_SWAP
 namespace cutlass {

+// don't do this
 template<class T>
-void swap(T& a, T& b) // don't do this
-{
+void swap(T& a, T& b) {
  T tmp = a;
  a = b;
  b = tmp;
@ -324,8 +321,7 @@ using cutlass::swap;
 // and that T is constrained via
 // std::enable_if or a requires clause.
 template<class T>
-void foo(T& a, T& b)
-{
+void foo(T& a, T& b) {
  // The usual idiom for using std::swap is the "swap two-step":
  //
  // 1. import std::swap into the current scope, then
@ -340,8 +336,7 @@ void foo(T& a, T& b)

 } // namespace other

-int main()
-{
+int main() {
  int x = 42;
  int y = 43;
  other::foo(x, y);
@ -415,8 +410,7 @@ struct my_computation_result {

 my_computation_result my_computation(float tolerance);

-void foo(float tolerance)
-{
+void foo(float tolerance) {
  // Approach 1: Use structured binding.  The names
  // you choose on the left-hand side have nothing
  // to do with the struct, so it's up to you
@ -523,8 +517,7 @@ struct foo_result {
  bool success = false;
 };

-foo_result foo(std::span<const float> input)
-{
+foo_result foo(std::span<const float> input) {
  // ... code  ...

  // Prefer this.  We know what type the function returns.
@ -539,8 +532,7 @@ However, note that this won't work if the function returns `auto`.
 The general rule is to avoid code duplication.

 ```c++
-auto foo(std::span<const float> input)
-{
+auto foo(std::span<const float> input) {
  // ... code  ...

  if constexpr (some_condition) {
@ -619,7 +611,7 @@ Members within classes and structures should be organized as follows:

 This convention follows the
 [CUB library](https://nvlabs.github.io/cub/)
-and is also described by 
+and is also described by
 [Howard Hinnant](https://howardhinnant.github.io/classdecl.html).
 It also approximates the usual ordering of chapters
 in a typical Systems and Controls textbook.
@ -772,7 +764,7 @@ Use `#pragma once` to guard all headers.
 #### CUDA Built-in Variables

 Avoid direct access to CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, and `gridDim` within
-CUTLASS components except in special circumstances. 
+CUTLASS components except in special circumstances.

 Using built-in global variables directly within resuable components necessitates that all components
 use them consistently which may not be possible if CUTLASS components are used in other contexts.
--- a/media/docs/quickstart.md
+++ b/media/docs/quickstart.md
@ -587,9 +587,8 @@ To instantiate all operations supporting all tile sizes, data types, and alignme
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=all
 ```
-
 The above command line generates about twenty thousand kernels targeting NVIDIA Ampere, Turing, and Volta architectures. 
-Compiling thousands of kernels for three different architectures is time consuming. Additionaly, this would also result 
+Compiling thousands of kernels for three different architectures is time-consuming. Additionally, this would also result 
 in a large binary size and on some platforms linker to fail on building the library.

 Enabling the "unity build" instantiates multiple kernel instances in each compilation unit, thereby reducing binary size