CUTLASS 3.0.0 (#786)

* CUTLASS 3.0.0
2023-01-23 17:55:28 -08:00
parent 66d9cddc83
commit 277bd6e537
377 changed files with 76396 additions and 1186 deletions
--- a/media/docs/code_organization.md
+++ b/media/docs/code_organization.md
@ -7,6 +7,7 @@
 This document describes the layout of the CUTLASS repository. The main components are:

 * **CUTLASS Template Library** - CUDA Templates for Linear Algebra Subroutines and Solvers (header only)
+* **CuTe Template Library** - CUTLASS's core vocabulary layout type and associated algebra (header only)
 * **CUTLASS Utilities** - Additional templates 
 * **CUTLASS Instance Library** - instantiations of CUTLASS templates covering the design space
 * **CUTLASS Profiler** - CUTLASS Library, Profiler, and Utilities
@ -29,7 +30,6 @@ CUTLASS Templates are implemented by header files in the following directory str

 ```
 include/                     # Top-level include directory. Client applications should target this path.
-
  cutlass/                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only

    arch/                    # direct exposure of architecture features (including instruction-level GEMMs)
@ -37,10 +37,11 @@ include/                     # Top-level include directory. Client applications
    gemm/                    # code specialized for general matrix product computations
      thread/                #   thread-level operators
      warp/                  #   warp-level operators
+      collective/            #   3.x API operators for all threads a tiled mma/copy are built over
      threadblock/           #   CTA-level operators
      kernel/                #   CUDA kernel entry points
      device/                #   launches kernel(s) over a full device
-      *                      # scope-agnostic components and basic vocabular type definitions for GEMM
+      *                      # scope-agnostic components and basic vocabulary type definitions for GEMM

    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory
      *
@ -51,7 +52,7 @@ include/                     # Top-level include directory. Client applications
      threadblock/           #   CTA-level operators
      kernel/                #   CUDA kernel entry points
      device/                #   launches kernel(s) over a full device
-      *                      # scope-agnostic components and basic vocabular type definitions
+      *                      # scope-agnostic components and basic vocabulary type definitions

    transform/               # code specialized for layout, type, and domain transformations
      thread/                #   thread-level operators
@ -64,11 +65,27 @@ include/                     # Top-level include directory. Client applications
    util/                    # miscellaneous CUTLASS components
      *
    *                        # core vocabulary types and fundamental arithmetic operators
+
+  cute /                     # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy
+    algorithm/               # Definitions of core operations such as copy, gemm, and operations on cute::tuples
+    arch/                    # Bare bones PTX wrapper structs for copy and math instructions
+    atom/                    # Meta-information either link to or built from arch/ operators
+      mma_atom.hpp           # cute::Mma_Atom and cute::TiledMma
+      copy_atom.hpp          # cute::Copy_Atom and cute::TiledCopy
+      *sm*.hpp               # Arch specific meta-information for copy and math operations
+    container/               # Core container types used across CuTe, namely, cute::tuple
+    numeric/                 # CuTe's internal numerics implementation
+    *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations
 ```

 See [Programming Guidelines](/media/docs/programming_guidelines.md) for further details about
 conventions and design patterns used throughout CUTLASS.

+## CuTe
+
+CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly packages the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations. More documentation
+for CuTe can be found in [`/media/docs/cute/`](/media/docs/cute/).
+
 ## Tools

 The `tools/` directory contains clients of the CUTLASS Template library and includes the following.
@ -181,9 +198,9 @@ examples/

  11_planar_complex_array/   # example demonstrating planar complex kernels with batch-specific problem sizes

-  12_gemm_bias_relu/         # example demonstrating GEMM fused with bias and relu
+  12_gemm_bias_relu/         # example demonstrating GEMM fused with bias and relu activation function

-  13_fused_two_gemms/        # example demonstrating two GEMms fused in one kernel
+  13_fused_two_gemms/        # example demonstrating two GEMMs fused into one kernel
 ```

 ## Media
--- a/media/docs/cute/00_quickstart.md
+++ b/media/docs/cute/00_quickstart.md
@ -0,0 +1,75 @@
+# Getting Started With CuTe
+
+CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly packages the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations.
+
+The core abstraction of CuTe are the hierarchically multidimensional layouts which can be composed with data arrays to represent tensors. The representation of layouts is powerful enough to represent nearly everything we need to implement efficient dense linear algebra. Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
+
+## System Requirements
+
+CuTe shares CUTLASS 3.0's software requirements,
+including NVCC with a C++17 host compiler.
+
+## Knowledge prerequisites
+
+CuTe is a CUDA C++ library.  It requires C++17
+(the revision of the C++ Standard that was released in 2017).
+
+Throughout this tutorial, we assume intermediate C++ experience.
+For example, we assume that readers know
+how to read and write templated functions and classes, and
+how to use the `auto` keyword to deduce a function's return type.
+We will be gentle with C++ and explain some things
+that you might already know.
+
+We also assume intermediate CUDA experience.
+For example, readers must know
+the difference between device and host code,
+and how to launch kernels.
+
+## Building Tests and Examples
+
+CuTe's tests and examples build and run as part of CUTLASS's normal build process.
+CuTe's unit tests live in the [`test/unit/cute`](../../../test/unit/cute) subdirectory.
+Its examples live in the [`examples/cute`](../../../examples/cute) subdirectory.
+
+## Library Organization
+
+CuTe is a header-only C++ library, so there is no source code that needs building. Library headers are contained within the top level [`include/cute`](../../../include/cute) directory, with components of the library grouped by directories that represent their semantics.
+
+|        Directory       |        Contents        |
+|------------------------|------------------------|
+| [`include/cute`](../../../include/cute) | Each header in the top level corresponds to one of the fundamental building blocks of CuTe, such as [`Layout`](../../../include/cute/layout.hpp) or [`Tensor`](../../../include/cute/tensor.hpp). |
+| [`include/cute/container`](../../../include/cute/container) | Implementations of STL-like container objects, such as tuple, array, aligned array, and array views.  |
+| [`include/cute/numeric`](../../../include/cute/numeric) | Templates that handle nonstandard floating-point types, unsigned integers, complex numbers, and integer sequence - like fundamental numeric data types.  |
+| [`include/cute/algorithm`](../../../include/cute/algorithm) | Implementations of utility algorithms such as copy, fill, and clear that automatically leverage architecture-specific features if available. |
+| [`include/cute/arch`](../../../include/cute/arch) | Wrappers for architecture-specific matrix-matrix multiply and copy instructions. |
+| [`include/cute/atom`](../../../include/cute/atom) | Meta-information for instructions in `arch` and utilities like partitioning and tiling.
+
+## Tutorial
+
+This directory contains a CuTe tutorial in Markdown format.
+The file
+[`0x_gemm_tutorial.md`](./0x_gemm_tutorial.md)
+explains how to implement dense matrix-matrix multiply using CuTe components.
+It gives a broad overview of CuTe and thus would be a good place to start.
+
+Other files in this directory discuss specific parts of CuTe.
+
+* [`01_layout.md`](./01_layout.md) describes `Layout`, CuTe's core abstraction.
+
+* [`02_layout_operations.md`](./02_layout_operations.md) describes more advanced `Layout` operations and the CuTe layout algebra.
+
+* [`03_tensor.md`](./03_tensor.md) describes `Tensor`,
+  a multidimensional array abstraction which composes `Layout`
+  with an array of data.
+
+* [`04_algorithms.md`](./04_algorithms.md) summarizes CuTe's
+  generic algorithms that operate on `Tensor`s.
+
+* [`0t_mma_atom.md`](./0t_mma_atom.md) demonstrates CuTe's meta-information and interface to our GPUs'
+  architecture-specific Matrix Multiply-Accumulate (MMA) instructions.
+
+* [`0x_gemm_tutorial.md`](./0x_gemm_tutorial.md) provides a walkthrough of building a GEMM from scratch using CuTe.
+
+* [`0y_predication.md`](./0y_predication.md) explains what to do
+  if a tiling doesn't fit evenly into a matrix.
--- a/media/docs/cute/01_layout.md
+++ b/media/docs/cute/01_layout.md
@ -0,0 +1,254 @@
+# CuTe Layouts
+
+## Layout
+
+This document describes `Layout`, CuTe's core abstraction.
+A `Layout` maps from (a) logical coordinate space(s)
+to a physical index space.
+
+`Layout`s present a common interface to multidimensional array access
+that abstracts away the details of how the array's elements are organized in memory.
+This lets users write algorithms that access multidimensional arrays generically,
+so that layouts can change, without users' code needing to change.
+
+CuTe also provides an "algebra of `Layout`s."
+`Layout`s can be combined and manipulated
+to construct more complicated layouts
+and to partition them across other layouts.
+This can help users do things like partition layouts of data over layouts of threads.
+
+## Layouts and Tensors
+
+Any of the `Layout`s discussed in this section can be composed with data -- a pointer or an array -- to create a `Tensor`. The responsibility of the `Layout` is to define valid coordinate space(s) and, therefore, the logical shape of the data and map those into an index space. The index space is precisely the offset that would be used to index into the array of data.
+
+For details on `Tensor`, please refer to the
+[`Tensor` section of the tutorial](./03_tensor.md).
+
+## Shapes and Strides
+
+A `Layout` is a pair of `Shape` and `Stride`.
+Both `Shape` and `Stride` are `IntTuple` types.
+
+### IntTuple
+
+An `IntTuple` is an integer or a tuple of `IntTuple`s.
+This means that `IntTuple`s can be arbitrarily nested.
+Operations defined on `IntTuple`s include the following.
+
+* `get<I>(IntTuple)`: The `I`th element of the `IntTuple`. Note that `get<0>` is defined for integer `IntTuples`.
+
+* `rank(IntTuple)`: The number of elements in an `IntTuple`. An int has rank 1, a tuple has rank `tuple_size`.
+
+* `depth(IntTuple)`: The number of hierarchical `IntTuple`s. An int has depth 0, a tuple has depth 1, a tuple that contains a tuple has depth 2, etc.
+
+* `size(IntTuple)`: The product of all elements of the IntTuple.
+
+We write `IntTuple`s with parenthesis to denote the hierarchy. E.g. `6`, `(2)`, `(4,3)`, `(3,(6,2),8)` are all `IntTuple`s.
+
+## Layout
+
+A `Layout` is then a pair of `IntTuple`s. The first defines the abstract *shape* of the layout and the second defines the *strides*, which map from coordinates within the shape to the index space.
+
+As a pair of `IntTuple`s, we can define many similar operations on `Layout`s including
+
+* `get<I>(Layout)`: The `I`th sub-layout of the `Layout`.
+
+* `rank(Layout)`: The number of modes in a `Layout`.
+
+* `depth(Layout)`: The number of hierarchical `Layout`s. An int has depth 0, a tuple has depth 1, a tuple that contains a tuple has depth 2, etc.
+
+* `shape(Layout)`: The shape of the `Layout`.
+
+* `stride(Layout)`: The stride of the `Layout`.
+
+* `size(Layout)`: The logical extent of the `Layout`. Equivalent to `size(shape(Layout))`.
+
+### Hierarchical access functions
+
+`IntTuple`s and thus `Layout`s can be arbitrarily nested.
+For convenience, we define versions of some of the above functions
+that take a sequence of integers, instead of just one integer.
+This makes it possible to access elements
+inside of nested `IntTuple` or `Layout`.
+For example, we permit `get<I...>(x)`, where `I...` here
+and throughout this section is a "C++ parameter pack"
+that denotes zero or more (integer) template arguments.
+That is, `get<I0,I1,...,IN>(x)` is equivalent to
+`get<IN>(` $\dots$ `(get<I1>(get<I0>(x)))` $\dots$ `))`,
+where the ellipses are pseudocode and not actual C++ syntax.
+These hierarchical access functions include the following.
+
+* `rank<I...>(x)  := rank(get<I...>(x))`. The rank of the `I...`th element of `x`.
+
+* `depth<I...>(x) := depth(get<I...>(x))`. The depth of the `I...`th element of `x`.
+
+* `size<I...>(x)  := size(get<I...>(x))`. The size of the `I...`th element of `x`.
+
+### Vector examples
+
+Then, we can define a vector as any `Shape` and `Stride` pair with `rank == 1`.
+For example, the `Layout`
+
+```
+Shape:  (8)
+Stride: (1)
+```
+
+defines a contiguous 8-element vector.
+Similarly, with a stride of `(2)`,
+the interpretation is that the eight elements
+are stored at positions 0, 2, 4, $\dots$.
+
+By the above definition, we *also* interpret
+
+```
+Shape:  ((4,2))
+Stride: ((1,4))
+```
+
+as a vector, since its shape is rank 1. The inner shape describes a 4x2 layout of data in column-major order, but the extra pair of parenthesis suggest we can interpret those two modes as a single 1-D 8-element vector instead. Due to the strides, the elements are also contiguous.
+
+### Matrix examples
+
+Generalizing, we define a matrix as any `Shape` and `Stride` pair with rank 2. For example,
+
+```
+Shape:  (4,2)
+Stride: (1,4)
+  0   4
+  1   5
+  2   6
+  3   7
+```
+
+is a 4x2 column-major matrix, and
+
+```
+Shape:  (4,2)
+Stride: (2,1)
+  0   1
+  2   3
+  4   5
+  6   7
+```
+
+is a 4x2 row-major matrix.
+
+Each of the modes of the matrix can also be split into *multi-indices* like the vector example.
+This lets us express more layouts beyond just row major and column major. For example,
+
+```
+Shape:  ((2,2),2)
+Stride: ((4,1),2)
+  0   2
+  4   6
+  1   3
+  5   7
+```
+
+is also logically 4x2, with a stride of 2 across the rows but a multi-stride down the columns.
+Since this layout is logically 4x2,
+like the column-major and row-major examples above,
+we can _still_ use 2-D coordinates to index into it.
+
+## Constructing a `Layout`
+
+A `Layout` can be constructed in many different ways.
+It can include any combination of compile-time (static) integers
+or run-time (dynamic) integers.
+
+```c++
+auto layout_8s = make_layout(Int<8>{});
+auto layout_8d = make_layout(8);
+
+auto layout_2sx4s = make_layout(make_shape(Int<2>{},Int<4>{}));
+auto layout_2sx4d = make_layout(make_shape(Int<2>{},4));
+
+auto layout_2x4 = make_layout(make_shape (2, make_shape (2,2)),
+                              make_stride(4, make_stride(1,2)));
+```
+
+## Using a `Layout`
+
+The fundamental use of a `Layout` is to map between logical coordinate space(s) and index space. For example, to print an arbitrary rank-2 layout, we can write the function
+
+```c++
+template <class Shape, class Stride>
+void print2D(Layout<Shape,Stride> const& layout)
+{
+  for (int m = 0; m < size<0>(layout); ++m) {
+    for (int n = 0; n < size<1>(layout); ++n) {
+      printf("%3d  ", layout(m,n));
+    }
+    printf("\n");
+  }
+}
+```
+
+which produces the following output for the above examples.
+
+```
+> print2D(layout_2sx4s)
+  0   2   4   6
+  1   3   5   7
+> print2D(layout_2sx4d)
+  0   2   4   6
+  1   3   5   7
+> print2D(layout_2x4)
+  0   2   1   3
+  4   6   5   7
+```
+
+The multi-indices within the `layout_4x4` example are handled as expected and interpreted as a rank-2 layout.
+
+Note that for `layout_1x4`, we're using a 1-D coordinate for a 2-D multi-index in the second mode. In fact, we can generalize this and treat all of the above layouts as 1-D layouts.  For instance, the following `print1D` function
+
+```c++
+template <class Shape, class Stride>
+void print1D(Layout<Shape,Stride> const& layout)
+{
+  for (int i = 0; i < size(layout); ++i) {
+    printf("%3d  ", layout(i));
+  }
+}
+```
+
+produces the following output for the above examples.
+
+```
+> print1D(layout_8s)
+  0   1   2   3   4   5   6   7
+> print1D(layout_8d)
+  0   1   2   3   4   5   6   7
+> print1D(layout_2sx4s)
+  0   1   2   3   4   5   6   7
+> print1D(layout_2sx4d)
+  0   1   2   3   4   5   6   7
+> print1D(layout_2x4)
+  0   4   2   6   1   5   3   7
+```
+
+This shows explicitly that all of the layouts are simply folded views of an 8-element array.
+
+## Summary
+
+* The `Shape` of a `Layout` defines its coordinate space(s).
+
+    * Every `Layout` has a 1-D coordinate space.
+      This can be used to iterate in a "generalized-column-major" order.
+
+    * Every `Layout` has a R-D coordinate space,
+      where R is the rank of the layout.
+      These spaces are ordered _colexicographically_
+      (reading right to left, instead of "lexicographically,"
+      which reads left to right).
+      The enumeration of that order
+      corresponds to the 1-D coordinates above.
+
+    * Every `Layout` has an h-D coordinate space where h is "hierarchical." These are ordered colexicographically and the enumeration of that order corresponds to the 1-D coordinates above. An h-D coordinate is congruent to the `Shape` so that each element of the coordinate has a corresponding element of the `Shape`.
+
+* The `Stride` of a `Layout` maps coordinates to indices.
+
+    * In general, this could be any function from 1-D coordinates (integers) to indices (integers).
+
+    * In `CuTe` we use an inner product of the h-D coordinates with the `Stride` elements.
--- a/media/docs/cute/02_layout_operations.md
+++ b/media/docs/cute/02_layout_operations.md
@ -0,0 +1,710 @@
+# CuTe Layout Operations
+
+CuTe provides an "algebra of `Layout`s."
+`Layout`s can be combined and manipulated
+to construct more complicated `Layout`s.
+This includes tiling and partitioning `Layout`s across other `Layout`s.
+In this section, we explain some of these core operations in detail.
+
+## How do I print CuTe objects on host or device?
+
+CuTe comes with different ways to print CuTe objects.
+You can print human-readable text,
+or you can print LaTeX commands for generating
+a beautifully formatted and colored table
+describing the CuTe object.
+Both of these can be helpful for reasoning about or debugging
+layouts, copy atoms, or matrix multiply atoms
+(don't worry, we'll explain all of these things in this tutorial).
+
+CuTe's print functions work on either host or device.
+Note that on device, printing is expensive.
+Even just leaving print code in place on device,
+even if it is never called
+(e.g., printing in an `if` branch that is not taken at run time),
+may generate slower code.
+Thus, be sure to remove code that prints on device after debugging.
+
+The following code examples assume that you have a
+`using namespace cute;` statement in scope.
+
+### Printing human-readable text
+
+The `cute::print` function has overloads for almost all CuTe types, including Pointers, Layout, Shape, Stride, and Tensors.  When in doubt, try calling `print` on it.  You might also only want to print on thread 0 of each thread block, or block 0 of the grid.  The `thread0()` function returns true only for global thread 0 of the kernel.  A typical idiom for printing CuTe objects to print only on thread 0 of block 0.
+
+```c++
+if (thread0()) {
+  print(some_cute_object);
+}
+```
+
+Some algorithms do different things on different threads or blocks,
+so you might sometimes need to print on threads or blocks other than zero.
+The header file
+[`cute/util/debug.hpp`](../../../include/cute/util/debug.hpp),
+among other utilities,
+includes the function `bool thread(int tid, int bid)`
+that returns `true` if running on thread `tid` and block `bid`.
+
+Some CuTe types have special printing functions that use a different output format.
+For example, `print_layout` can display a rank-2 layout in a table
+(using plain text formatting).
+It has an overload taking a rank-2 matrix layout and a thread layout,
+that displays a table with the mapping between threads and values.
+
+Some CuTe types might not have overloads for `print`,
+but there are other ways to print their contents.
+For example, copy atoms and mma atoms
+(see elsewhere in this tutorial)
+have a `print_all()` member function.
+
+### Printing LaTeX output
+
+The `cute::print_latex` function works like `cute::print`,
+but prints LaTeX commands that you can use
+to generate a nicely formatted and colored table.
+
+## Fundamental types
+
+### Layout and its components
+
+This directory includes
+[an overview of CuTe's fundamental types for describing layouts](./01_layout.md).
+
+#### Tuple
+
+CuTe starts with a Tuple, which is a finite ordered list of zero or more elements.
+In C++, we identify a Tuple with the
+[`cute::tuple` class](../../../include/cute/container/tuple.hpp).
+`cute::tuple` behaves like `std::tuple`, but it works on device or host,
+and it imposes restrictions on its template arguments for performance and simplicity.
+
+#### IntTuple
+
+CuTe then defines an IntTuple as either an integer, or a Tuple of IntTuple.
+This recursive definition lets us build arbitrarily nested layouts.
+In C++, we identify an IntTuple with [`IntTuple`](../../../include/cute/int_tuple.hpp),
+which is just an alias of `cute::tuple`.
+Any of the following are thus valid template arguments of IntTuple.
+
+1. "Run-time integers" (or "static integers")
+    are just ordinary integral types like `int` or `size_t`.
+
+2. "Compile-time integers" include `std::integral_constant`
+    or subclasses of it that CuTe defines,
+    such as `Int<Value>` (see below).
+    These types all have in common
+    that the value is encoded in the type itself
+    (as a public `static constexpr value` member).
+    CuTe defines aliases `_1`, `_2`, `_3` etc.
+    to the types `Int<1>`, `Int<2>`, `Int<3>` etc.
+
+3. `IntTuple` with any valid template arguments.
+
+CuTe reuses IntTuple for many different things,
+including Shape, Stride, Step, and Coord
+(see [`include/cute/layout.hpp`](../../../include/cute/layout.hpp)).
+In C++, Shape, Stride, Step, and Coord are all aliases for IntTuple.
+
+### Layout
+
+A Layout is a tuple of (Shape, Stride).
+Semantically, it implements a mapping from
+a "logical" Shape-shaped (multidimensional) index,
+to a "physical" 1-D index into an array.
+Here is an example of a 2 x 3 array with static strides (3, 1).
+
+```c++
+Layout layout = make_layout(make_shape (_2{}, _3{}),
+                            make_stride(_3{}, _1{}));
+print_layout(layout);
+for (int i = 0; i < size(layout); ++i) {
+  print(layout(i));
+  print(", ");
+}
+print("\n");
+print(layout(1, 1));
+print("\n");
+```
+
+This code produces the following text output.
+
+```text
+(_2,_3):(_3,_1)
+      0   1   2
+    +---+---+---+
+ 0  | 0 | 1 | 2 |
+    +---+---+---+
+ 1  | 3 | 4 | 5 |
+    +---+---+---+
+0, 3, 1, 4, 2, 5,
+4
+```
+
+`print(layout(1, 1))` prints the mapping of
+the logical 2-D coordinate (0,1) to 1-D index, which is 4.
+You can see that from the table,
+which shows the left logical index as the "row,"
+and the right logical index as the "column."
+
+### Underscore (`_`)
+
+An Underscore is a special type used for array slices.  The underscore punctuation `_` is a constant instance of Underscore.  It acts like `:` (the colon punctuation) in Python or Fortran array slices.  See [`include/cute/underscore.hpp`](../../../include/cute/underscore.hpp).
+
+### Tile
+
+"A Tile is not a Layout, it's a tuple of Layouts or Tiles or Underscores."
+See [`include/cute/tile.hpp`](../../../include/cute/tile.hpp).
+
+The algebraic layout operations discussed below are defined on `Layout`s, but `Tile` allows these operations to recurse and to be applied to sublayouts or particular modes of a given Layout. These are referred to as by-mode operations.
+
+See the section on "Logical Divide" to see an example of using `Tile` to extract portions of a row-mode and portions of a column-mode independently.
+
+## Layout definitions and operations
+
+### Layouts are functions from integers (logical 1-D coordinate) to integers (1-D index)
+
+The `for` loop in the above print example shows how CuTe identifies 1-D coordinates with a column-major layout of logical 2-D coordinates.  Iterating from `i = 0` to `size(layout)` (which is 6), and indexing into our layout with the single integer coordinate `i`, traverses the layout in column-major fashion, even though this is a row-major layout.  You can see this from the output of the `for` loop (0, 3, 1, 4, 2, 5).  CuTe calls this index `i` a "1-D coordinate," versus the "natural coordinate," which would be the logical 2-D coordinate.
+
+If you're familiar with the C++23 feature `mdspan`,
+this is an important difference between
+`mdspan` layout mappings and CuTe `Layout`s.
+`mdspan` layout mappings are *one way*:
+they always take a multidimensional logical coordinate,
+and they return an integer offset.
+Depending on the strides,
+the offset may skip over elements of the physical 1-D array.
+Thus, `mdspan`'s offset does NOT mean the same thing as
+the 1-D logical coordinate `i` in the `for` loop above.
+You can iterate correctly over any CuTe `Layout`
+by using the 1-D logical coordinate.
+`mdspan` doesn't have an idea of a 1-D logical coordinate.
+
+### Rank, depth, size, cosize
+
+*Rank*: the tuple size of the layout's shape.
+
+*Depth*: the depth of the layout's shape.  A single integer has depth 0.  A tuple has depth 1 + the max depth of its components.
+
+*Size*: Size of the shape; size of the domain of the function. This is the product of all extents in the layout's shape.
+
+*Cosize*: Size of the function's codomain (not necessarily the range); for a layout A, A(size(A) - 1) + 1.  (Here, we use size(A) - 1 as a 1-D logical coordinate input.)
+
+### Layout compatibility
+
+We say that layouts A and B are *compatible* if their shapes are compatible.  Shape A is compatible with shape B if any natural coordinate of A is also a valid coordinate for B.
+
+### Flatten
+
+The `flatten` operation "un-nests" a potentially nested Layout.  For example,
+
+```c++
+Layout layout = Layout<Shape <Shape <_4, _3>, _1>,
+                     Stride<Stride<_3, _1>, _0>>{};
+Layout flat_layout = flatten(layout);
+```
+
+results in `flat_layout` having the following type
+
+```text
+Layout<Shape<_4, _3, _1>, Stride<_3, _1, _0>>
+```
+
+and
+
+```c++
+Layout layout = Layout<Shape <_4, Shape <_4,  _2>>,
+                     Stride<_4, Stride<_1, _16>>>{};
+Layout flat_layout = flatten(layout);
+```
+
+results in `flat_layout` having the following type
+
+```text
+Layout<Shape<_4, _4, _2>, Stride<_4, _1, _16>>
+```
+
+Hierarchical Layouts and flattening let us reinterpret tensors in place as matrices, matrices as vectors, vectors as matrices, etc.  This lets us implement arbitrary tensor contractions as batched matrix multiply, by combining the contraction modes into a single mode, and combining the A, B, C, and "batch" modes as needed to reach the desired form.
+
+### Coalesce
+
+The `coalesce` operation first flattens the layout, then combines all the modes that are possible to combine, starting with mode 0 (the leftmost mode) and moving right.  If all the modes can be combined, then this results in a 1-D layout expressing what array elements the original layout accesses.
+
+For example,
+
+```text
+layout: (_2,(_1,_6)):(_1,(_6,_2))
+coalesce(layout): _12:_1
+```
+
+What does it mean to "combine" modes?  In the above example, the flattened layout is (2, 1, 6) : (1, 6, 2).
+
+1. If we look at the leftmost two modes, this is just a vector of length 2 and stride 1.  The middle mode has extent 1, so the corresponding stride 6 would not be observed anyway.  This leaves us with (2, 6) : (1, 2).
+
+2. The intermediate result (2, 6) : (1, 2) is just a 2 x 6 column-major matrix, which can be coalesced into a vector of length 12 and stride 1.
+
+More formally, "combining all the modes" means a left fold, where the binary operation that combines two modes has three cases.
+
+1. If the leftmost layout is s1:d1, and the next layout is 1:d0, then combine into s1:d1.  This generalizes Step 1 above.  If a mode has extent 1, we can't observe its stride, so we can skip the mode.
+
+2. If the leftmost layout is 1:d1, and the next layout is s0:d0, then combine into s0:d0.  Again, if a mode has extent 1, we can't observe its stride, so we can skip the mode.
+
+3. If the leftmost layout is s1:d1, and the next layout is s0 : s1*d1, then combine into s0 * s1 : d1.  This generalizes Step 2 above.  One can call this "noticing a column-major layout sequence."
+
+That's it!  For example, the result of coalescing the row-major layout (2, 2) : (2, 1) is (2, 2) : (2, 1), the same layout, because none of the above three cases applies.
+
+### Complement
+
+#### Definition
+
+The complement B of a layout A with respect to an integer M satisfies the following properties.
+
+1. $A$ and $B$ are *disjoint*: $A(x) \neq B(x)$ for all $x \neq 0$ in the domain of $A$.
+
+2. B is *ordered*: $`B(x-1) < B(x)`$ for all $x$ in $\{0, 1, \dots, size(B) - 1\}$.
+
+3. B is *bounded* by M: $size(B) \geq M / size(A)$, and $cosize(B) \leq floor(M / cosize(A)) * cosize(A)$.
+
+Regarding disjointness: we need to specify $x \neq 0$ because CuTe layouts are linear.  That is, if the domain is nonempty, the range always contains zero.
+
+Regarding the ordered property: CuTe layouts are hierarchically strided, so this implies that if size(B) is nonzero, then the strides of B are all positive.
+
+#### Examples
+
+complement(4:1, 24) is 6:4.
+
+1. The result is disjoint of 4:1, so it must have a stride of at least 4 (since it includes 0, but must skip over 1, 2, 3).
+
+2. The size of the result is $\geq 24 / 4 = 6$.  (This plus Step (1) means that the cosize is at least 24.)
+
+3. The cosize of the result is $\leq (24 / 4) * 4 = 24$.  (This plus Step (2) means that the cosize is exactly 24.)
+
+4. The only (1-D) layout with size 6 and cosize 24 is 6:4.
+
+complement(6:4, 24) is 4:1.
+
+1. 4:1 is disjoint of 6:4, but so is s:d
+   for any s > 0 and d > 20.
+
+2. The size of the result is $\geq 24 / 6 = 4$.
+
+3. The cosize of the result is $\leq (24 / 21) * 21 = 21$.
+
+4. The stride cannot be greater than 20
+   (else (2) would contradict (3)),
+   so it must be less than 4.
+
+5. This leaves 4:1 by elimination.
+
+### Composition
+
+Layouts are functions, so composition of layouts is just composition of functions.  The composition $A \circ B$ means "apply the layout B first, then treat the result as a 1-D logical coordinate input to the layout A, and apply A to it."  Very often, this composition can be represented as another Layout.
+
+#### Rules for computing composition
+
+Both humans and CuTe compute composition using the following rules.
+
+1. $A \circ B$ has a shape that is compatible with B. In function composition, the rightmost function defines the domain. For `Layout`s this means that any valid coordinate for $B$ can also be used as a coordinate for $A \circ B$.
+
+2. Concatenation: A layout can be expressed as the concatenation of its sublayouts.  We denote concatenation with parentheses: $B = (B_0,B_1,...)$.  The CuTe function `make_layout`, when given zero or more `Layout`s, concatenates them.
+
+3. Composition is (left-)distributive with concatenation: $A \circ B = A \circ (B0, B1, ...) = (A \circ B0, A \circ B1, ...)$.
+
+4. "Base case": For layouts $A = a : b$ and $B = c : d$ with integral shape and stride, $A \circ B = R = c : (b * d)$.
+
+5. By-mode composition: Let $\langle B, C \rangle$ (angle brackets, not parentheses)
+   denote a tuple of two layouts B and C, not their concatenation.  Let A = (A0, A1).
+   Then, $A \circ \langle B, C \rangle = (A0, A1) \circ \langle B, C \rangle = (A0 \circ B, A1 \circ C)$.
+   This allows the application of composition independently to sublayouts of $A$.
+
+#### Examples: Reshape a vector into a matrix
+
+This section gives two composition examples.  Both start with a vector with layout $20:2$ (that is, the vector has 20 elements, and the stride between each is 2).  They compose this vector with a 4 x 5 matrix layout.  This effectively "reshapes" the vector in place into a matrix.
+
+##### Example 1
+
+$20:2 \circ (4,5) : (1,4)$.
+
+This describes interpreting the vector $20:2$
+as a 4 x 5 column-major matrix.
+
+The resulting layout has shape $(4,5)$,
+because in function composition,
+the rightmost function defines the domain.
+What are the strides?
+
+1. A layout can be expressed as the concatenation of its sublayouts,
+   so $(4,5) : (1,4)$ is $(4:1, 5:4)$.
+
+2. Composition is distributive, so
+   $20:2 \circ (4:1, 5:4)$ is $(20:2 \circ 4:1, 20:2 \circ 5:4)$.
+
+3. $20:2 \circ 4:1$ has shape 4 (rightmost function defines the domain)
+   and stride $2 = 2 \cdot 1$.
+
+4. $20:2 \circ 5:4$ has shape 5 and stride $8 = 2 \cdot 4$.
+
+5. Result: (4:2, 5:8), which by concatenation is (4,5) : (2,8).
+
+#### Example 2
+
+$20:2 \circ (4,5) : (5,1)$.
+
+This describes interpreting the vector 20:2
+as a 4 x 5 row-major matrix.
+
+The resulting layout has shape $(4,5)$, just as before.  What are the strides?
+
+1. By deconcatenation, $(4,5) : (5,1)$ is $(4:5, 5:1)$.
+
+2. Composition is distributive, so $20:2 \circ (4:5, 5:1)$ is $(20:2 \circ 4:5, 20:2 \circ 5:1)$.
+
+3. $20:2 \circ 4:5$ has shape $4$ and stride $10 = 2 \cdot 5$.
+
+4. $20:2 \circ 5:1$ has shape $5$ and stride $2 = 2 \cdot 1$.
+
+5. Result: (4:10, 5:2), which by concatenation is (4,5) : (10,2).
+
+### Product
+
+CuTe includes four different kinds of layout products.
+
+1. `logical_product`
+
+2. `blocked_product`
+
+3. `raked_product`
+
+4. `tiled_product`
+
+`logical_product(A, B)` results in a layout where each element of layout B
+has been replaced by a "copy" of layout A.
+The other three products offer variations of this idea.
+
+#### Example: Tiled matrix
+
+Suppose that I want to make a matrix consisting of 3 x 4 tiles
+in a row-major arrangement,
+where each tile is a 2 x 2 column-major matrix.
+
+The Layout of each tile (tile) has Shape (2,2) and Stride (1,2).
+
+The Layout of the "matrix of tiles" (`matrix_of_tiles`)
+has Shape (3,4) and Stride (4,1).
+
+##### Blocked product: the intuitive tiling
+
+If I were to deduce by hand what the layout of the tiled matrix should be,
+it would look like this.
+
+|       | (0,0) | (1,0) | (0,1) | (1,1) | (0,2) | (1,2) | (0,3) | (1,3) |
+| ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   |
+| (0,0) |  0    |  2    |  4    |  6    |  8    | 10    | 12    | 14    |
+| (1,0) |  1    |  3    |  5    |  7    |  9    | 11    | 13    | 15    |
+| (0,1) | 16    | 18    | 20    | 22    | 24    | 26    | 28    | 30    |
+| (1,1) | 17    | 19    | 21    | 23    | 25    | 27    | 29    | 31    |
+| (0,2) | 32    | 34    | 36    | 38    | 40    | 42    | 44    | 46    |
+| (1,2) | 33    | 35    | 37    | 39    | 41    | 43    | 45    | 47    |
+
+The row and column labels use the equivalence of 1-D logical coordinates and 2-D column-major coordinates.  The left index in each pair is the row resp. column coordinate of the tile, while the right index in each pair is the row resp. column coordinate of the matrix-of-tiles.  The resulting layout has Shape ((2, 3), (2, 4)), and Stride ((1, 16), (2, 4)), and the second mode can be coalesced.  The Shape ((2, 3), (2, 4)) is hierarchical, but it is still rank-2 and can be drawn in 2D as above.  Note how the row mode of the tile remains part of the row mode of the product, and the column mode of the tile remains a column mode of the product.
+
+The above layout is what `blocked_product(tile, matrix_of_tiles)` produces.
+A critical use case for blocked product is "tiling" an "atom"
+(some tile that relates to a hardware feature) over a matrix.
+
+```c++
+Layout tile            = Layout<Shape <_2,_2>,
+                                Stride<_1,_2>>{};
+Layout matrix_of_tiles = Layout<Shape <_3,_4>,
+                                Stride<_4,_1>>{};
+
+print_layout(blocked_product(tile, matrix_of_tiles));
+```
+
+##### Logical product
+
+The logical product `logical_product(tile, matrix_of_tiles)`
+results in Shape ((2, 2), (3, 4)) and Stride ((1, 2), (16, 4)).
+
+|       | (0,0) | (1,0) | (2,0) | (0,1) | (1,1) | (2,1) | (0,2) | (1,2) | (2,2) | (0,3) | (1,3) | (2,3) |
+| ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   |
+| (0,0) |  0    | 16    | 32    |  4    | 20    | 36    |  8    | 24    | 40    | 12    | 28    | 44    |
+| (1,0) |  1    | 17    | 33    |  5    | 21    | 37    |  9    | 25    | 41    | 13    | 29    | 45    |
+| (0,1) |  2    | 18    | 34    |  6    | 22    | 38    | 10    | 26    | 42    | 14    | 30    | 46    |
+| (1,1) |  3    | 19    | 35    |  7    | 23    | 39    | 11    | 27    | 43    | 15    | 31    | 47    |
+
+Note how the tile appears in the leftmost column and is reproduced
+in each column in the same order as the matrix-of-tiles. That is, 
+the tile can be indexed through the first mode of the result and the
+matrix-of-tiles can be indexed through the second mode.
+
+```c++
+Layout tile            = Layout<Shape <_2,_2>,
+                                Stride<_1,_2>>{};
+Layout matrix_of_tiles = Layout<Shape <_3,_4>,
+                                Stride<_4,_1>>{};
+
+print_layout(logical_product(tile, matrix_of_tiles));
+```
+
+##### Raked product
+
+The raked product `raked_product(tile, matrix_of_tiles)` results in
+Shape ((3, 2), (4, 2)) and Stride ((16, 1), (4, 2)).
+
+|       | (0,0) | (1,0) | (2,0) | (3,0) | (0,1) | (1,1) | (2,1) | (3,1) |
+| ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   |
+| (0,0) |  0    |  4    |  8    | 12    |  2    |  6    | 10    | 14    |
+| (1,0) | 16    | 20    | 24    | 28    | 18    | 22    | 26    | 30    |
+| (2,0) | 32    | 36    | 40    | 44    | 34    | 38    | 42    | 46    |
+| (0,1) |  1    |  5    |  9    | 13    |  3    |  7    | 11    | 15    |
+| (1,1) | 17    | 21    | 25    | 29    | 19    | 23    | 27    | 31    |
+| (2,1) | 33    | 37    | 41    | 45    | 35    | 39    | 43    | 47    |
+
+The tile is now interleaved or "raked" with the other 3x4 matrix-of-tiles
+instead of appearing as blocks. Other references call this is cyclic 
+distribution.
+
+This might look familiar if you have ever used ScaLAPACK.
+It expresses a 2-D block cyclic distribution of a 6 x 8 matrix
+over 4 processes in a 2 x 2 "process grid."  See
+["The Two-dimensional Block-Cyclic Distribution"](https://netlib.org/scalapack/slug/node75.html#sec2dbcd)
+and
+["Local Storage Scheme and Block-Cyclic Mapping"](https://netlib.org/scalapack/slug/node76.html#seclocalstorage)
+in the ScaLAPACK Users' Guide.
+
+In general, `logical_product` and these variations can produce any interleaving,
+including blocked, cyclic, by-mode blocked/cyclic, and intermediate interleavings
+that don't have common names.
+
+```c++
+Layout tile            = Layout<Shape <_2,_2>,
+                                Stride<_1,_2>>{};
+Layout matrix_of_tiles = Layout<Shape <_3,_4>,
+                                Stride<_4,_1>>{};
+
+print_layout(raked_product(tile, matrix_of_tiles));
+```
+
+### Division
+
+The previous section covered layout products,
+that reproduce one layout over another.
+This section covers layout *division*.
+Functions that divide a layout into components are useful
+as a basis for tiling and partitioning layouts.
+
+For example, consider folding a vector into a matrix.
+We could imagine an operation, called `logical_divide`,
+
+```c++
+Layout vec = Layout<_16,_3>{};           //  16 : 3
+Layout col = Layout< _4,_1>{};           //   4 : 1
+Layout mat = logical_divide(vec, col);   // (4,4) : (3,12)
+```
+
+that "takes" the first 4 elements of the vector into the first mode
+and leaves the "rest" in the second mode. This is a column-major matrix
+view of the data in `vec`.
+What if we want a row-major matrix view?
+
+```c++
+Layout vec = Layout<_16,_3>{};           //  16 : 3
+Layout col = Layout< _4,_4>{};           //   4 : 4
+Layout mat = logical_divide(vec, col);   // (4,4) : (12,3)
+```
+
+Now, every fourth element of the vector is in the first mode and
+the "rest" are in the second mode.
+Multidimensional, hierarchical indices let us extend this operation
+to any layout that "divides" the vector.
+
+```c++
+Layout vec = Layout<_16,_3>{};           //  16 : 3
+Layout col = Layout< _4,_2>{};           //   4 : 2
+Layout mat = logical_divide(vec, col);   // (4,(2,2)) : (6,(3,24))
+```
+
+```c++
+Layout vec = Layout<_16,_3>{};           //  16 : 3
+Layout col = Layout< _4,_2>{};
+Layout col = Layout<Shape <_2,_2>,
+                    Stride<_4,_1>>{};    // (2,2) : (4,1)
+Layout mat = logical_divide(vec, col);   // ((2,2),(2,2)) : ((12,3),(6,24))
+```
+
+All of the above examples produce a 4x4 matrix
+that can be indexed and treated like a normal 4x4 matrix,
+but each has a different underlying layout.
+Thus, our algorithms can be written using logical coordinates,
+without needing to address the detailed indexing that each layout requires.
+
+CuTe includes 3 different kinds of layout division operations.
+
+1. `logical_divide`
+
+2. `zipped_divide`
+
+3. `tiled_divide`
+
+We will summarize these in the sections that follow.
+
+#### Logical divide : the intuitive tiling
+
+Suppose I have the 6 x 8 matrix from the Raked Product section
+and want to "collect" the `tile`, turning the Raked Product into
+the Blocked Product.
+
+To do this, we would like to gather two elements from the column
+and leave the rest, then gather two elements from the row and leave the rest.
+Thus, we want to apply `logical_divide` independently to the rows and cols
+in order to retrieve the appropriate elements.
+
+In code, we copy the Layout from the result of the Raked Product section, then
+specify the elements in the rows and cols we would like to gather.
+
+```c++
+Layout raked_prod = Layout<Shape <Shape < _3,_2>,Shape <_4,_2>>,
+                           Stride<Stride<_16,_1>,Stride<_4,_2>>>{};
+Tile   subtile    = make_tile(Layout<_2,_3>{},    // Gather elements 2 : 3 from mode 0
+                              Layout<_2,_4>{});   // Gather elements 2 : 4 from mode 1
+
+print_layout(logical_divide(raked_prod, subtile));
+```
+
+Indeed, this does produce the result from the Blocked Product section.
+
+|       | (0,0) | (1,0) | (0,1) | (1,1) | (0,2) | (1,2) | (0,3) | (1,3) |
+| ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   |
+| (0,0) |  0    |  2    |  4    |  6    |  8    | 10    | 12    | 14    |
+| (1,0) |  1    |  3    |  5    |  7    |  9    | 11    | 13    | 15    |
+| (0,1) | 16    | 18    | 20    | 22    | 24    | 26    | 28    | 30    |
+| (1,1) | 17    | 19    | 21    | 23    | 25    | 27    | 29    | 31    |
+| (0,2) | 32    | 34    | 36    | 38    | 40    | 42    | 44    | 46    |
+| (1,2) | 33    | 35    | 37    | 39    | 41    | 43    | 45    | 47    |
+
+Of course, any other rearrangement of the rows and cols is also valid.
+
+#### Zipped divide
+
+The `zipped_divide` function applies `logical_divide`, and then gathers the
+"subtiles" into a single mode and the "rest" into a single mode.
+
+For example, if we apply `zipped_divide` instead of `logical_divide` in the example above,
+
+```c++
+Layout raked_prod = Layout<Shape <Shape < _3,_2>,Shape <_4,_2>>,
+                           Stride<Stride<_16,_1>,Stride<_4,_2>>>{};
+Tile   subtile    = make_tile(Layout<_2,_3>{},    // Gather elements 2 : 3 from mode 0
+                              Layout<_2,_4>{});   // Gather elements 2 : 4 from mode 1
+
+print_layout(zipped_divide(raked_prod, subtile));
+```
+
+then we get the result
+
+|       | (0,0) | (1,0) | (2,0) | (0,1) | (1,1) | (2,1) | (0,2) | (1,2) | (2,2) | (0,3) | (1,3) | (2,3) |
+| ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   |
+| (0,0) |  0    | 16    | 32    |  4    | 20    | 36    |  8    | 24    | 40    | 12    | 28    | 44    |
+| (1,0) |  1    | 17    | 33    |  5    | 21    | 37    |  9    | 25    | 41    | 13    | 29    | 45    |
+| (0,1) |  2    | 18    | 34    |  6    | 22    | 38    | 10    | 26    | 42    | 14    | 30    | 46    |
+| (1,1) |  3    | 19    | 35    |  7    | 23    | 39    | 11    | 27    | 43    | 15    | 31    | 47    |
+
+Note that this is the same layout as the result in the Logical Product section.
+That is, the first mode is our original tile (and can be interpreted as a 2x2 matrix itself)
+and the second mode is its logical layout within the raked layout.
+
+##### More Examples of Divide
+
+For brevity, shapes can be used with `logical_divide` and `tiled_divide` to quickly split and tile modes of a tensor. For example, this C++ code
+
+```c++
+Layout layout     = Layout<Shape <_12, _32,_6>,
+                           Stride< _1,_128,_0>>{};
+Shape  tile_shape = make_shape(_4{},_8{});
+Layout logical_divided_tile = logical_divide(layout, tile_shape);
+Layout zipped_divided_tile  =  zipped_divide(layout, tile_shape);
+
+print("layout               :  "); print(layout);               print("\n");
+print("tile_shape           :  "); print(tile_shape);           print("\n");
+print("logical_divided_tile :  "); print(logical_divided_tile); print("\n");
+print("zipped_divided_tile  :  "); print(zipped_divided_tile);  print("\n\n");
+```
+
+produces the following output when we vary `layout`.
+
+```text
+full_layout          :  (_12,_32,_6):(_1,_128,_0)
+tile_shape           :  (_4,_8)
+logical_divided_tile :  ((_4,_3),(_8,_4),_6):((_1,_4),(_128,_1024),_0)
+zipped_divided_tile  :  ((_4,_8),(_3,_4,_6)):((_1,_128),(_4,_1024,_0))
+
+full_layout          :  (_12,(_4,_8),_6):(_1,(_32,_512),_0)
+tile_shape           :  (_4,_8)
+logical_divided_tile :  ((_4,_3),((_4,_2),_4),_6):((_1,_4),((_32,_512),_1024),_0)
+zipped_divided_tile  :  ((_4,(_4,_2)),(_3,_4,_6)):((_1,(_32,_512)),(_4,_1024,_0))
+```
+
+This code
+
+```c++
+Layout layout = make_layout(Shape<_8,_8>{},
+                            Stride<_8,_1>{});
+Layout tile = make_tile(make_layout(Shape<_4>{}),
+                        make_layout(Shape<_2>{}));
+print("layout: ");
+print_layout(layout);
+print("\n");
+print("tile: ");
+print(tile);
+print("\n");
+print("logical_divide: ");
+print_layout(logical_divide(layout, tile));
+print("zipped_divide: ");
+print_layout(zipped_divide(layout, tile));
+```
+
+results in the following layouts.
+
+<p align="center">
+  <img src="../../images/cute/logical_divide-and-zipped_divide.png" alt="logical_divide-and-zipped_divide" height="400"/>
+</p>
+
+This code
+
+```c++
+Layout layout = make_layout(Shape<_8,_8>{},
+                            Stride<_8,_1>{});
+Layout tile = make_tile(make_layout(Shape<_2>{}),
+                        make_layout(Shape<_4>{}));
+print("layout: ");
+print_layout(layout);
+print("\n");
+print("tile: ");
+print(tile);
+print("\n");
+print("logical_divide: ");
+print_layout(logical_divide(layout, tile));
+print("zipped_divide: ");
+print_layout(zipped_divide(layout, tile));
+```
+
+results in the following layouts.
+
+<p align="center">
+  <img src="../../images/cute/logical_divide-and-zipped_divide-2.png" alt="logical_divide-and-zipped_divide-2" height="400"/>
+</p>
+
+#### Tiled divide
+
+The `tiled_divide` function works like `zipped_divide`,
+except that it unpacks the second mode. This is useful when you have a `Tile` that describes all of the elements for a particular operation, for example, and want to gather those together but retain the logical shape of those tiles within the original layout. That is,
+
+```text
+Layout Shape : (M, N, L, ...)
+Tile Shape   : <M', N'>
+Tiled Result : ((M', N'), m, n, L, ...)
+```
+
+where `m` is `M / M'` and `n` is `N / N'`.
+We can consider `m` as the "number of `Tile`s in `M`" and `n` as the "number of `Tile`s in `N`". This style of operation is common when applying MMA Atoms and Copy Atoms.
--- a/media/docs/cute/03_tensor.md
+++ b/media/docs/cute/03_tensor.md
@ -0,0 +1,262 @@
+# CuTe Tensors
+
+## A Tensor is a multidimensional array
+
+CuTe's `Tensor` class represents a multidimensional array.
+The array's elements can live in any kind of memory,
+including global memory, shared memory, and register memory.
+
+### Array access
+
+Users access a `Tensor`'s elements in one of three ways:
+
+* `operator()`, taking as many integral arguments as the number of modes,
+  corresponding to the element's (possibly) multidimensional logical index;
+
+* `operator()`, taking a `Coord` (an `IntTuple` of the logical indices); or
+
+* `operator[]`, taking a `Coord` (an `IntTuple` of the logical indices).
+
+### Slices: Get a Tensor accessing a subset of elements
+
+Users can get a "slice" of a `Tensor`,
+that is, a `Tensor` that accesses a subset of elements
+of the original `Tensor`.
+
+Slices happen through the same `operator()`
+that they use for accessing an individual element.
+Passing in `_` (the underscore character, an instance of `Underscore`)
+has the same effect as `:` (the colon character) in Fortran or Matlab:
+the resulting slice accesses all indices in that mode of the `Tensor`.
+
+### Tensor's behavior determined by its Layout and Engine
+
+A `Tensor`'s behavior is entirely determined by its two components,
+which correspond to its two template parameters: `Engine`, and `Layout`.
+
+For a description of `Layout`,
+please refer to [the `Layout` section](./01_layout.md)
+of this tutorial, or the [GEMM overview](./0x_gemm_tutorial.md).
+
+An `Engine` represents a one-dimensional array of elements.
+When users perform array access on a `Tensor`,
+the `Tensor` uses its `Layout` to map from a logical coordinate
+to a one-dimensional index.
+Then, the `Tensor` uses its `Engine`
+to map the one-dimensional index
+to a reference to the element.
+You can see this in `Tensor`'s implementation of array access.
+
+```c++
+decltype(auto) operator[](Coord const& coord) {
+  return engine().begin()[layout()(coord)];
+}
+```
+
+One could summarize almost all CuTe use cases as follows:
+
+* create `Layout`s,
+
+* create `Tensor`s with those `Layout`s, and
+
+* invoke (either CuTe's, or custom) algorithms on those `Tensor`s.
+
+### Ownership of the elements
+
+`Tensor`s can be owning or nonowning.
+
+"Owning" `Tensor`s behave like `std::array`.
+When you copy the `Tensor`, you (deep-)copy its elements,
+and the `Tensor`'s destructor deallocates the array of elements.
+
+"Nonowning" `Tensor`'s behave like a (raw) pointer to the elements.
+Copying the `Tensor` doesn't copy the elements,
+and destroying the `Tensor` doesn't deallocate the array of elements.
+
+Whether a `Tensor` is "owning" or "nonowning" depends entirely on its `Engine`.
+This has implications for developers of generic `Tensor` algorithms.
+For example, input `Tensor` parameters of a function
+should be passed by const reference,
+because passing the `Tensor`s by value
+might make a deep copy of the `Tensor`'s elements.
+It might also *not* make a deep copy of the elements;
+there's no way to know without specializing the algorithm
+on the `Tensor`'s `Engine` type.
+Similarly, output or input/output `Tensor` parameters of a function
+should be passed by (nonconst) reference.
+Returning a `Tensor` might (or might not)
+make a deep copy of the elements.
+
+The various overloads of the `copy_if` algorithm in
+[`include/cute/algorithm/copy.hpp`](../../../include/cute/algorithm/copy.hpp)
+take their `src` (input, source of the copy) parameter
+as `Tensor<SrcEngine, SrcLayout>& const`,
+and take their `dst` (output, destination of the copy) parameter
+as `Tensor<DstEngine, DstLayout>&`.
+Additionally, there are overloads for mutable temporaries like
+`Tensor<DstEngine, DstLayout>&&`
+so that these algorithms can be applied directly to slices,
+as in the following example.
+
+```c++
+copy(src_tensor(_,3), dst_tensor(2,_));
+```
+
+In C++ terms, each of the expressions
+`src_tensor(_,3)`, and `dst_tensor(2,_)`
+is in the "prvalue"
+[value category](https://en.cppreference.com/w/cpp/language/value_category),
+because it is a function call expression
+whose return type is nonreference.
+(In this case, calling `Tensor::operator()`
+with at least one `_` (`Underscore`) argument
+returns a `Tensor`.)
+The prvalue `dst_tensor(2,_)` won't match
+the `copy` overload taking
+`Tensor<DstEngine, DstLayout>&`,
+because prvalues can't be bound to
+nonconst lvalue references (single `&`).
+However, it will match the `copy` overload taking
+`Tensor<DstEngine, DstLayout>&&`
+(note the two `&&` instead of one `&`).
+Calling the latter overload binds the reference
+to the prvalue `dst_tensor(2,_)`.
+This results in
+[creation of a temporary](https://en.cppreference.com/w/cpp/language/implicit_conversion#Temporary_materialization)
+`Tensor` result to be passed into `copy`.
+
+### CuTe's provided `Engine` types
+
+CuTe comes with three `Engine` types.
+
+* `ArrayEngine<class T, int N>`: an owning `Engine`,
+   representing an array of `N` elements of type `T`
+
+* `ViewEngine<Iterator>`: a nonowning `Engine`,
+  where `Iterator` is a random access iterator
+  (either a pointer to an array, or something that acts like one)
+
+* `ConstViewEngine<Iterator>`: a nonowning `Engine`,
+  which is the view-of-const-elements version of `ViewEngine`
+
+### "Tags" for different kinds of memory
+
+`ViewEngine` and `ConstViewEngine` wrap pointers to various kinds of memory.
+Users can "tag" the memory with its space -- e.g., global or shared --
+by calling `make_gmem_ptr(g)` when `g` is a pointer to global memory,
+or `make_smem_ptr(s)` when `s` is a pointer to shared memory.
+
+Tagging memory makes it possible for CuTe's `Tensor` algorithms
+to use the fastest implementation for the specific kind of memory.
+It also avoids incorrect memory access.
+For example, some kinds of optimized copy operations require
+that the source of the copy be in global memory,
+and the destination of the copy be in shared memory.
+Tagging makes it possible for CuTe to dispatch
+to those optimized copy operations where possible.
+CuTe does this by specializing `Tensor` algorithms
+on the `Tensor`'s `Engine` type.
+
+### Engine members
+
+In order for a type to be valid for use as an `Engine`,
+it must have the following public members.
+
+```c++
+using value_type = /* ... the value type ... */;
+using iterator =   /* ... the iterator type ... */;
+iterator begin()   /* sometimes const */;
+```
+
+## Constructing a Tensor
+
+### Nonowning view of existing memory
+
+A `Tensor` can be a nonowning view of existing memory.
+For this use case, users can create the `Tensor` by calling `make_tensor`
+with two arguments: a wrapped pointer to the memory to view, and the `Layout`.
+Users wrap the pointer by identifying its memory space:
+e.g., global memory (via `make_gmem_ptr`) or shared memory (via `make_smem_ptr`).
+`Tensor`s that view existing memory can have either static or dynamic `Layout`s.
+
+Here are some examples of creating `Tensor`s
+that are nonowning views of existing memory.
+
+```c++
+// Global memory (static or dynamic layouts)
+Tensor gmem_8s     = make_tensor(make_gmem_ptr(A), Int<8>{});
+Tensor gmem_8d     = make_tensor(make_gmem_ptr(A), 8);
+Tensor gmem_8sx16d = make_tensor(make_gmem_ptr(A), make_shape(Int<8>{},16));
+Tensor gmem_8dx16s = make_tensor(make_gmem_ptr(A), make_shape (      8  ,Int<16>{}),
+                                                   make_stride(Int<16>{},Int< 1>{}));
+
+// Shared memory (static or dynamic layouts)
+Shape smem_shape = make_shape(Int<4>{},Int<8>{});
+__shared__ T smem[decltype(size(smem_shape))::value];   // (static-only allocation)
+Tensor smem_4x8_col = make_tensor(make_smem_ptr(&smem[0]), smem_shape);
+Tensor smem_4x8_row = make_tensor(make_smem_ptr(&smem[0]), smem_shape, GenRowMajor{});
+```
+
+### Owning array of register memory
+
+A `Tensor` can also be an owning array of register memory.
+For this use case, users can create the `Tensor`
+by calling `make_tensor<T>(layout)`,
+where `T` is the type of each element of the array,
+and `layout` is the `Tensor`'s `Layout`.
+Owning `Tensor`s must have a static `Layout`,
+as CuTe does not perform dynamic memory allocation in `Tensor`s.
+
+Here are some examples of creating owning `Tensor`s.
+
+```c++
+// Register memory (static layouts only)
+Tensor rmem_4x8_col = make_tensor<float>(make_shape(Int<4>{},Int<8>{}));
+Tensor rmem_4x8_row = make_tensor<float>(make_shape(Int<4>{},Int<8>{}), GenRowMajor{});
+Tensor rmem_4x8_mix = make_tensor<float>(make_shape (Int<4>{},Int< 8>{}),
+                                         make_stride(Int<2>{},Int<32>{}));
+Tensor rmem_8   = make_fragment_like(gmem_8sx16d(_,0));
+```
+
+The `make_fragment_like` function makes an owning Tensor of register memory,
+with the same shape as its input `Tensor` argument.
+
+## Tensor use examples
+
+### Copy rows of a matrix from global memory to registers
+
+The following example copies rows of a matrix (with any `Layout`)
+from global memory to register memory,
+then executes some algorithm `do_something`
+on the row that lives in register memory.
+
+```c++
+Tensor gmem = make_tensor(make_gmem_ptr(A), make_shape(Int<8>{}, 16));
+Tensor rmem = make_fragment_like(gmem(_, 0));
+for (int j = 0; j < size<1>(gmem); ++j) {
+  copy(gmem(_, j), rmem);
+  do_something(rmem);
+}
+```
+
+This code does not need to know anything the `Layout` of `gmem`
+other than that it is rank-2 and that the first mode is a compile-time value.
+The following code checks both of those conditions at compile time.
+
+```c++
+CUTE_STATIC_ASSERT_V(rank(gmem) == Int<2>{});
+CUTE_STATIC_ASSERT_V(is_static<decltype(shape<0>(gmem))>{});
+```
+
+A `Tensor` encapsulates the data type, data location,
+and possibly also the shape and stride of the tensor at compile time.
+As a result, `copy` can dispatch, based on the types and Layouts of its arguments,
+to use any of various synchronous or asynchronous hardware copy instructions
+and can auto-vectorize the copy instructions in many cases as well.
+CuTe's `copy` algorithm lives in
+[`include/cute/algorithm/copy.hpp`](../../../include/cute/algorithm/copy.hpp).
+For more details on the algorithms that CuTe provides,
+please refer to [the algorithms section](./04_algorithms.md)
+of the tutorial, or the
+[CuTe overview in the GEMM tutorial](./0x_gemm_tutorial.md).
+
--- a/media/docs/cute/04_algorithms.md
+++ b/media/docs/cute/04_algorithms.md
@ -0,0 +1,223 @@
+# CuTe Tensor algorithms
+
+This section summarizes the interfaces and implementations
+of common numerical algorithms performed on `Tensor`s.
+
+The implementation of these algorithms may be found in the
+[include/cute/algorithm/](../../../include/cute/algorithm/)
+directory.
+
+## `copy`
+
+CuTe's `copy` algorithm copies the elements of a source `Tensor`
+into the elements of a destination `Tensor`.
+The various overloads of `copy` can be found in
+[`include/cute/algorithm/copy.hpp`](../../../include/cute/algorithm/copy.hpp).
+
+### Interface and specialization opportunities
+
+A `Tensor` encapsulates the data type, data location,
+and possibly also the shape and stride of the tensor at compile time.
+As a result, `copy` can and does dispatch,
+based on the types of its arguments,
+to use any of various synchronous or asynchronous hardware copy instructions.
+
+The `copy` algorithm has two main overloads.
+The first just takes the source `Tensor` and the destination `Tensor`.
+
+```c++
+template <class SrcEngine, class SrcLayout,
+          class DstEngine, class DstLayout>
+CUTE_HOST_DEVICE
+void
+copy(Tensor<SrcEngine, SrcLayout> const& src,
+     Tensor<DstEngine, DstLayout>      & dst);
+```
+
+The second takes those two parameters, plus a `Copy_Atom`.
+
+```c++
+template <class... CopyArgs,
+          class SrcEngine, class SrcLayout,
+          class DstEngine, class DstLayout>
+CUTE_HOST_DEVICE
+void
+copy(Copy_Atom<CopyArgs...>       const& copy_atom,
+     Tensor<SrcEngine, SrcLayout> const& src,
+     Tensor<DstEngine, DstLayout>      & dst);
+```
+
+The two-parameter `copy` overload picks a default implementation
+based only on the types of the two `Tensor` parameters.
+The `Copy_Atom` overload lets callers override that default
+by specifying a nondefault `copy` implementation.
+
+### Parallelism and synchronization depend on parameter types
+
+Either the default implementation or
+the implementation selected by a `Copy_Atom` overload
+may use none or all available parallelism,
+and may have a variety of synchronization semantics.
+The behavior depends on `copy`'s parameter types.
+Users are expected to figure this out based on their knowledge
+of the architecture on which they are running.
+(Developers often write a custom optimized kernel
+for each GPU architecture.)
+
+The `copy` algorithm may be sequential per thread,
+or it may be parallel across some collection of threads
+(e.g., a block or cluster).
+
+If `copy` is parallel,
+then the collection of participating threads
+may need synchronization before any thread in the collection
+may assume that the copy operation has completed.
+For example, if the participating threads form a thread block,
+then users must invoke `__syncthreads()`
+or the Cooperative Groups equivalent
+before they may use the results of `copy`.
+
+The `copy` algorithm may use asynchronous copy instructions,
+such as `cp.async`, or its C++ interface `memcpy_async`.
+In that case, users will need to perform
+the additional synchronization appropriate to that underlying implementation
+before they may use the results of the `copy` algorithm.
+[The CuTe GEMM tutorial example](../../../examples/cute/tutorial/sgemm_nt_1.cu)
+shows one such synchronization method.
+More optimized GEMM implementations use pipelining techniques
+to overlap asynchronous `copy` operations with other useful work.
+
+### A generic copy implementation
+
+A simple example of a generic `copy` implementation
+for any two `Tensor`s looks like this.
+
+```c++
+template <class TA, class ALayout,
+          class TB, class BLayout>
+CUTE_HOST_DEVICE
+void
+copy(Tensor<TA, ALayout> const& src,  // Any logical shape
+     Tensor<TB, BLayout>      & dst)  // Any logical shape
+{
+  for (int i = 0; i < size(src); ++i) {
+    dst(i) = src(i);
+  }
+}
+```
+
+This generic `copy` algorithm addresses both `Tensor`s
+with 1-D logical coordinates, thus traversing both `Tensor`s
+in a logical column-major order.
+Some reasonable architecture-independent optimizations
+would include the following.
+
+1. If the two `Tensor`s have known memory spaces with optimized
+   access instructions (like `cp.async`), then dispatch to the
+   custom instruction.
+
+2. The the two `Tensor`s have static layouts and it can be proven
+   that element vectorization is valid -- for example, four `LDS.32`s
+   can be combined into a single `LDS.128` -- then vectorize the source
+   and destinations tensors.
+
+3. If possible, validate that the copy instruction to be used is 
+   appropriate for the source and destination tensors.
+
+CuTe's optimized copy implementations can do all of these.
+
+## `copy_if`
+
+CuTe's `copy_if` algorithm lives in the same header as `copy`,
+[`include/cute/algorithm/copy.hpp`](../../../include/cute/algorithm/copy.hpp).
+The algorithm takes source and destination `Tensor` parameters like `copy`,
+but it also takes a "predication `Tensor`"
+with the same shape as the input and output.
+Elements of the source `Tensor` are only copied
+if the corresponding predication `Tensor` element is nonzero.
+
+For details on why and how to use `copy_if`,
+please refer to the
+["predication" section of the tutorial](./0y_predication.md).
+
+## `gemm`
+
+### What `gemm` computes
+
+The `gemm` algorithm takes three `Tensor`s, A, B, and C.
+What it does depends on the number of modes
+that its `Tensor` parameters have.
+We express these modes using letters.
+
+* V indicates a "vector," a mode of independent elements.
+
+* M and N indicate the number of rows resp. columns
+  of the matrix result C of the BLAS's GEMM routine.
+
+* K indicates the "reduction mode" of GEMM,
+  that is, the mode along which GEMM sums.
+  Please see the [GEMM tutorial](./0x_gemm_tutorial.md) for details.
+
+We list the modes of the input `Tensor`s A and B,
+and the output `Tensor` C,
+using a notation `(...) x (...) => (...)`.
+The two leftmost `(...)` describe A and B (in that order),
+and the `(...)` to the right of the `=>` describes C.
+
+1. `(V) x (V) => (V)`. The element-wise product of vectors: C<sub>v</sub> += A<sub>v</sub> B<sub>v</sub>. Dispatches to FMA or MMA.
+
+2. `(M) x (N) => (M,N)`. The outer product of vectors: C<sub>mn</sub> += A<sub>m</sub> B_<sub>n</sub>. Dispatches to (4) with V=1.
+
+3. `(M,K) x (N,K) => (M,N)`. The product of matrices: C<sub>mn</sub> += A<sub>mk</sub> B<sub>nk</sub>. Dispatches to (2) for each K.
+
+4. `(V,M) x (V,N) => (V,M,N)`. The batched outer product of vectors: C<sub>vmn</sub> += A<sub>vm</sub> B<sub>vn</sub>. Optimizes for register reuse and dispatches to (1) for each M, N.
+
+5. `(V,M,K) x (V,N,K) => (V,M,N)`. The batched product of matrices: C<sub>vmn</sub> += A<sub>vmk</sub> B<sub>vnk</sub>. Dispatches to (4) for each K.
+
+Please refer to the [GEMM tutorial](./0x_gemm_tutorial.md)
+for an overview of CuTe's convention for ordering the modes.
+For example, if K appears, it always appears rightmost ("outermost").
+If V appears, it always appears leftmost ("innermost").
+
+### Dispatch to optimized implementations
+
+Just like with `copy`, CuTe's implementations of `gemm`
+uses its `Tensor` arguments' types to dispatch
+to an appropriately optimized implementation.
+Also like `copy`, `gemm` takes an optional `MMA_Atom` parameter
+that lets callers override the default `FMA` instruction
+that CuTe would select based on the `Tensor` arguments' types.
+
+For more information on `MMA_Atom` and on specialization of `gemm`
+for different architectures, please refer to the
+[MMA section of the tutorial](./0t_mma_atom.md).
+
+## `axpby`
+
+The `axpby` algorithm lives in the header file
+[`include/cute/algorithm/axpby.hpp`](../../../include/cute/algorithm/axpby.hpp).
+It assigns to $y$ the result of $\alpha x + \beta y$,
+where $\alpha$ and $\beta$ are scalars and $x$ and $y$ are `Tensor`s.
+The name stands for "Alpha times X Plus Beta times Y,"
+and is a generalization of the original BLAS "AXPY" routine
+("Alpha times X Plus Y").
+
+## `fill`
+
+The `fill` algorithm lives in the header file
+[`include/cute/algorithm/fill.hpp`](../../../include/cute/algorithm/fill.hpp).
+It overwrites the elements of its `Tensor` output argument
+with a given scalar value.
+
+## `clear`
+
+The `clear` algorithm lives in the header file
+[`include/cute/algorithm/clear.hpp`](../../../include/cute/algorithm/clear.hpp).
+It overwrites the elements of its `Tensor` output argument with zeros.
+
+## Other algorithms
+
+CuTe provides other algorithms.
+Their header files can be found in the
+[`include/cute/algorithm`](../../../include/cute/algorithm)
+directory.
--- a/media/docs/cute/0t_mma_atom.md
+++ b/media/docs/cute/0t_mma_atom.md
@ -0,0 +1,434 @@
+# CuTe's support for Matrix Multiply-Accumulate instructions
+
+In this file, we explain in detail how we support our GPUs'
+Matrix Multiply-Accumulate (MMA) hardware instructions in CuTe.
+
+MMAs are architecture-specific.
+Different generations of GPU architectures
+introduce different sets of MMA instructions.
+However, CuTe features such as `Layout`
+makes it possible to expose MMAs for use in generic CUDA C++ code.
+We do this in two steps.
+
+1. We wrap each MMA's PTX instruction in an "Operation" struct.
+
+2. For each Operation struct, we define a "Traits" struct
+   that defines all of the meta-information needed to use the Operation.
+
+## CuTe MMA Atoms
+
+CuTe exposes each MMA to generic CUDA C++ code as a pair of structs:
+an "Operation" struct,
+and an `MMA_Traits` struct templated on the Operation struct type.
+
+An "Operation" struct exposes the PTX instruction
+for that specific operation.
+It defines the arguments and interface it expects.
+Operation structs have minimal software dependencies -- 
+it does not use layouts, tensors, or non-standard numeric data types.
+Different structs have different names
+that describe what the MMA instruction does.
+We will explain the naming scheme below.
+
+A corresponding `MMA_Traits` struct specialization
+defines meta-information about the Operation,
+such as the compute types, the logical shape of the operation,
+and the `Layout`s of threads and values within the operation.
+The `MMA_Traits` struct takes the Operation as a template parameter.
+CuTe specializes `MMA_Traits` for each Operation type that it supports.
+
+Together, these two types comprise an "Atom" that decouples the complexity of thread and data layouts from the call site of of the PTX instruction.  The Atom's Traits struct exposes information that is relevant to a single MMA operation, no matter the granularity at which it operates.
+
+CuTe MMA atoms expose the semantics of a single MMA operation.
+This is true regardless of the hardware level at which the MMA operates.
+CuTe supports MMA atoms that operate at a variety of hardware levels,
+including
+
+* a single thread (e.g., fused multiply-add (FMA) instruction);
+
+* a quadpair (Volta);
+
+* a single warp (Ampere); and
+
+* a warpgroup (Hopper).
+
+### Operation structs
+
+#### Location of files
+
+CuTe provides its Operations structs in the
+[`include/cute/arch`](../../../include/cute/arch)
+directory, in header files starting with `mma`.
+
+#### Operation struct's name
+
+A CuTe Operation struct's name encodes information about
+
+* its first supported architecture,
+
+* the M, N, and K dimensions that it accepts,
+
+* the types that it takes, and
+
+* the expected A and B layouts.
+
+For example, the Volta section below will refer to the
+`SM70_8x8x4_F32F16F16F32_NT` Operation struct defined in
+[`include/cute/arch/mma_sm70.hpp`](../../../include/cute/arch/mma_sm70.hpp).
+
+* "SM70" refers to Volta.
+
+* "8x8x4" refers to M = 8, N = 8, and K = 4,
+  the dimensions of the MMA operation that the quadpair performs
+  (see below).
+
+* "F32F16F16F32" refers to the element types
+  of the four matrix operands A, B, C, and D.
+  An MMA computes D = C + A * B,
+  so we read the types from left to right:
+  D is F32 (`float`), A is F16 (half),
+  B is F16 (half), and C is F32 (`float`).
+
+* "NT" means that A is M-major (not transposed)
+  and B is N-major (transposed).
+
+#### Contents
+
+An Operation struct has the following members.
+
+##### Type aliases
+
+An Operation struct has four public type aliases:
+`DRegisters`, `ARegisters`, `BRegisters`, and `CRegisters`.
+For example, the `SM70_8x8x4_F32F16F16F32_NT` Operation struct defined in
+[`include/cute/arch/mma_sm70.hpp`](../../../include/cute/arch/mma_sm70.hpp)
+defines these as follows.
+
+```c++
+using DRegisters = float[8];
+using ARegisters = uint32_t[2];
+using BRegisters = uint32_t[2];
+using CRegisters = float[8];
+```
+
+This shows how many values each thread will pass into the PTX instruction
+for each of the matrices A, B, C, and D.  For this Operation,
+each thread passes 8 F32 values each for C and D (hence `float[8]`),
+and 4 F16 values each for A and B (hence `uint32_t[2]`;
+the instruction packs two 16-bit F16 values
+in each of the two 32-bit `uint32_t` values).
+
+##### `fma` static member device function
+
+An operation struct defines a public `static void fma` function.
+It is marked with the `CUTE_HOST_DEVICE` macro,
+which adds the `__host__ __device__` annotations.
+Different Operations define `fma` to take different numbers of arguments,
+depending on the PTX MMA instruction.
+The implementation protects use of the PTX instruction with a macro,
+and raises an `assert` if `fma` is called when the macro is not defined.
+This ensures that tests and examples that use this Operation in an Atom
+can still compile, even if the PTX instruction is not available.
+
+### Traits
+
+#### Location of files
+
+CuTe provides its Traits structs in the
+[`include/cute/atom`](../../../include/cute/atom)
+directory, in header files starting with `mma_traits`.
+
+#### Contents
+
+An `MMA_Traits` specialization defines the following public type aliases.
+
+* `ElementDVal`: Compute type of the D matrix
+
+* `ElementAVal`: Compute type of the A matrix
+
+* `ElementBVal`: Compute type of the B matrix
+
+* `ElementCVal`: Compute type of the C matrix
+
+* `Shape_MNK`: Logical MxNxK shape of the MMA operation
+
+* `ThrID`: Logical thread mapping within the single MMA operation
+  (specifying the quadpair, warp, or warpgroup view)
+
+* `ALayout`: Mapping of (thread,value) pairs to the logical MxK A matrix
+
+* `BLayout`: Mapping of (thread,value) pairs to the logical NxK B matrix
+
+* `CLayout`: Mapping of (thread,value) pairs to the logical MxN C matrix
+
+#### Example
+
+The specialization of MMA_Traits for the
+`SM70_8x8x4_F32F16F16F32_NT` Operation lives in the header file
+[`include/cute/atom/mma_traits_sm70.hpp`](../../../include/cute/atom/mma_traits_sm70.hpp).
+It looks like this.
+
+```c++
+template <>
+struct MMA_Traits<SM70_8x8x4_F32F16F16F32_NT>
+{
+  using ElementDVal = float;
+  using ElementAVal = half_t;
+  using ElementBVal = half_t;
+  using ElementCVal = float;
+
+  using Shape_MNK = Shape<_8,_8,_4>;
+  using ThrID   = SM70_QuadPair;
+  using ALayout = SM70_8x4_Col;
+  using BLayout = SM70_8x4_Col;
+  using CLayout = SM70_8x8_32b;
+};
+```
+
+The next section will explain these type aliases in detail.
+
+## Volta
+
+This and the following sections show examples of how to construct MMA atoms.
+We don't try to explain this for all GPU architectures and MMAs.
+Instead, we use selected examples to illustrate the process
+of developing new atoms.
+
+Volta architecture implements an HMMA instruction where a group of 8 threads called a quadpair (QP) collaborate to share data and perform an 8x8x4 (fp32 or fp16) matrix multiply-accumulate. (since a warp is 32 threads wide, it would perform an MMA across 4 QPs for a tile size of 16x16x4).
+
+We first take a look at how we would take the ISA semantics of thread and data partitioning for the HMMA instruction, and encode it in a Traits struct. The HMMA NT instruction has the thread-data layout:
+
+<p align="center">
+  <img src="../../images/cute/HMMA.8x8x4.NT.png" alt="HMMA.8x8x4.NT.png" height="400"/>
+</p>
+
+### Types
+
+The HMMA NT above uses types:
+
+```cpp
+  using ElementDVal = float;
+  using ElementAVal = half_t;
+  using ElementBVal = half_t;
+  using ElementCVal = float;
+```
+
+The rest of the `MMA_Traits` will be described in units of these types.
+
+### Shape
+
+The HMMA NT above has shape 8x8x4:
+
+```cpp
+  // Logical shape of the MMA
+  using Shape_MNK = Shape <_8,_8,_4>;
+```
+
+### Thread ID
+
+If the 32 threads in a warp are logically indexed by [0 ... 31], then the above image contains threads [0,1,2,3]U[16,17,18,19]. These threads make up the 0th quadpair. We can write a thread mapping that maps eight logical thread ids [0,1,2,3,4,5,6,7] of the MMA to a quadpair thread index [0,1,2,3]U[16,17,18,19] of a warp. The layout function has 4 elements with a stride of 1 and 2 of those with a stride of 16. With this, we write a layout that represents a quadpair:
+
+```cpp
+  // Mapping from (logical thread id) -> (thread idx)
+  using ThrID = Layout<Shape <_4, _2>,
+                       Stride<_1,_16>>;
+```
+
+Again, this layout function maps the logical thread id [0,8) of the MMA operation onto the quadpair thread index [0,4)U[16,20) of a warp.
+
+### Accumulator Mapping
+
+Let us look at exactly how the 8 threads within a QP are mapped to the A, B and C matrices. For the C and D matrices, the above image is broken down a bit more below. On the left is shown the whole QP level view, and on the right is shown the values owned by just thread 0.
+
+<p align="center">
+  <img src="../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
+</p>
+
+The metainformation of this single instruction level view is what we want to encode in CuTe. Specifically, the QP level view in this diagram corresponds to the four MMA traits for [SM70_F32F16F16F32](../../../include/cute/arch/mma_sm70.hpp). These structs contain the `Element` types, the `Shape_MNK`, and the `ThrID` mapping we constructed above. Now, let us take a look at the definition of `CLayout`, the thread-data layout of accumulators. The job of `CLayout` is to construct a mapping between the `(logical_thr_id, logical_val_id)` and `(m, n)` coordinate in the C matrix which can then be used to build up more complicated layouts and operations like the 16x16x4 WMMA.
+
+We can start constructing a `CLayout` from the picture above. As with any CuTe layout, it is a pair of `Shape` and corresponding `Stride`. Let us just look at the shape for now. We know that the HMMA uses 8 threads each of which own 8 values. Therefore, the shape of our mapping must have a size of 8 along two modes. With this, we have
+
+```cpp
+  // (T8,V8) -> (m,n) 
+  using CLayout = Layout<Shape <_8, _8>,
+                         Stride<_?, _?>;  // Stride to be filled in below
+```
+
+This is not to be confused with the logical 8x8 shape of the C matrix. This is 8-threads by 8-values. We now want to map those to (m,n) coordinates. Since CuTe layouts return indices rather than coordinates, we choose a column-major encoding of the (m,n) coordinates:
+
+```
+(logical_thr_id, logical_val_id) -> (m, n) == m + n * M
+```
+
+With this in place, we can start thinking about how to construct the strides in `CLayout`. Let's begin by looking at the strides between threads. Note that 
+* `(T0,V0)` is located at `(m,n) = (0,0) = 0`
+* `(T1,V0)` is located at `(m,n) = (1,0) = 1`
+* `(T2,V0)` is located at `(m,n) = (0,2) = 16`
+* `(T3,V0)` is located at `(m,n) = (1,2) = 17`
+* `(T4,V0)` is located at `(m,n) = (4,0) = 4`
+* `(T5,V0)` is located at `(m,n) = (5,0) = 5`
+* `(T6,V0)` is located at `(m,n) = (4,2) = 20`
+* `(T7,V0)` is located at `(m,n) = (5,2) = 21`
+
+where `T4`,`T5`,`T6`,`T7` are the 4th,5th,6th,7th logical thread id of the MMA corresponding to thread indices of 16,17,18,19 of the warp (recorded in the `ThrID` mapping!).
+
+We note that the pattern can be transcribed to a layout. We can find the position of the 8 threads via
+
+```cpp
+  using CLayout = Layout<Shape <Shape <_2,  _2, _2>, _8>,
+                         Stride<Stride<_1, _16, _4>, _?>;
+```
+
+With the exact same approach, we can construct the stride along the `logical value id` mode. 
+* `(T0,V0)` is located at `(m,n) = (0,0) = 0`
+* `(T0,V1)` is located at `(m,n) = (0,1) = 8`
+* `(T0,V2)` is located at `(m,n) = (2,0) = 2`
+* `(T0,V3)` is located at `(m,n) = (2,1) = 10`
+* `(T0,V4)` is located at `(m,n) = (0,4) = 32`
+* `(T0,V5)` is located at `(m,n) = (0,5) = 40`
+* `(T0,V6)` is located at `(m,n) = (2,4) = 34`
+* `(T0,V7)` is located at `(m,n) = (2,5) = 42`
+
+We note that this pattern can also be transcribed to a layout. We can find the position of the 8 values via
+
+```cpp
+  // (T8,V8) -> (m,n)
+  using CLayout = Layout<Shape <Shape <_2, _2,_2>, Shape <_2,_2, _2>>,
+                         Stride<Stride<_1,_16,_4>, Stride<_8,_2,_32>>>;
+```
+
+And that's all! We can verify that each `(tid,vid)` coordinate in this layout is reliably mapped to the correct (encoded) `(m,n)` coordinate.
+
+In the case of F16 accumulators, the layout is way less complex. Each row of accumulators `(m, :)` is held by a single thread, which makes the layout:
+
+```cpp
+  using CLayout = Layout<Shape <_8,_8>,
+                         Stride<_1,_8>>;
+```
+
+### A and B Layout Mapping
+
+A and B matrix layouts depend on whether the sources are transposed or not. The diagram below shows the thread ID to data ownership map for A and B matrices in the case of NT and TN transposes.
+
+<p align="center">
+  <img src="../../images/cute/HMMA.8x8x4.quadpair.AB.png" alt="HMMA.8x8x4.quadpair.AB.png" height="400"/>
+</p>
+
+Let's look at the TN layout for A matrix first (right side in the diagram). Again, there are the same 8 logical threads, but each threads owns only 4 elements this time. The shape of `ALayout` will then be `Shape<_8, _4>`. As for the strides, we again need a similar mapping between `(m, k) == m + k * M`. Looking down the `M` mode, we go from `(T0, V0)` to `(T1, V0)` which is a stride of 1 for all 8 threads. For the `K` mode, as we go across, we go from `(T0, V0)` to `(T0, V1)`, which makes a stride of 8 for all 4 values. Therefore, the A layout is:
+
+```cpp
+  // (T8,V4) -> (m,k) 
+  using ALayout = Layout<Shape <_8,_4>,
+                         Stride<_1,_8>>;
+```
+
+Source B layout is constructed similarly for the TN HMMA, except that we want write it as `(N,K)` rather than `(K,N)` for convenience. For the strides, as we go across the `N` mode, we go from `(T0, V0)` to `(T1, V0)`, making this a stride of 1 for all 8 threads. As we go down the `K` mode, `(T0, V0)` to `(T0, V1)` which is a stride of 8 for all 4 values. So the B layout is the same as A:
+
+```cpp
+  // (T8,V4) -> (n,k) 
+  using BLayout = Layout<Shape <_8,_4>,
+                         Stride<_1,_8>>;
+```
+
+The layouts in the case of NT are a bit more complicated (left side of the diagram). Going down the `M` mode of `A`, we see the four values of `T0` first and then we see the four values of `T4`. This means we first have a stride of 1 for 4 values, followed by a stride of 4 from `T0` to `T4`. So we have two sub-strides along the `M` mode. For the `K` mode, as we go across, we simply increment the `thr_id`, keeping `val_id` the same, making the stride 8 for 4 threads. This makes the A layout:
+
+```cpp
+  // (T8,V4) -> (m,k)
+  using ALayout = Layout<Shape <Shape <_4,_2>,_4>,
+                         Stride<Stride<_8,_4>,_1>>;
+```
+
+With the `(N,K)` ordering for B, the layout is the same.
+
+```cpp
+  // (T8,V4) -> (n,k)
+  using BLayout = Layout<Shape <Shape <_4,_2>,_4>,
+                         Stride<Stride<_8,_4>,_1>>;
+```
+
+For the NN and TT transposes, they are simply combinations of the two layouts we have seen for A and B so far.
+
+## Hopper
+
+Now, we are ready to take a look at the much larger GMMA operation (Group MMA) first introduced with Hopper architecture. These MMA instructions operate at the granularity of 128 threads (4 warps), which are collectively referred to as a warpgroup.
+
+### Thread ID
+
+In the case of Hopper GMMAs, the thread IDs are assigned based on the simple 1D contiguous layout, which makes `thrID` trivial:
+
+```cpp
+using ThrID = Layout<_128, _1>;
+```
+
+### Accumulator Mapping
+
+Accumulators are mapped hierarchically in GMMA, starting from the concept of a core matrix and building up to a layout for the whole C matrix tile. Let's look at this core matrix first. We only consider fp16 accumulators here, but extensions of fp32 accumulators as trivial as we will see later.
+
+Each core matrix has the layout as shown in the diagram below.
+<p align="center">
+  <img src="../../images/cute/gmma_coremat_cd_fp16.png" alt="gmma_coremat_cd_fp16.png" height="600"/>
+</p>
+
+As in the Volta examples, the thread IDs are logical only, and which of the four warps they belong to in the warpgroup is not important.
+
+Then GMMA tiles this core matrix first vertically along the M mode, and then repeats that column of core matrices along the N mode to construct the full MxN tile. This tiling is shown in the image below.
+
+<p align="center">
+  <img src="../../images/cute/gmma_wg_n_slice.png" alt="gmma_wg_n_slice.png" height="600"/>
+</p>
+
+With this image, we are again ready to start building the `CLayout` for `SM90_64x128x16_F16F16F16F16_TN` atom. Same as before, we are constructing a mapping between the `(logical_thr_id, logical_val_id) -> (m, n)` coordinate spaces.
+
+To begin, let's follow the first few threads and values. We immediately see that they are arranged along the `N`-mode with pairs of values and four threads. This gives us
+
+```cpp
+// (T128,V4) -> (M64,N8)
+using CLayout = Layout<Shape <Shape <  _4, ...>, Shape < _2, ...>>,
+                       Stride<Stride<_128, ...>, Stride<_64, ...>>>;
+```
+
+To complete the first 8x8 core matrix, the four threads repeat eight times down the `M`-mode:
+
+```cpp
+// (T128,V4) -> (M64,N8)
+using CLayout = Layout<Shape <Shape <  _4, _8, ...>, Shape < _2, ...>>,
+                       Stride<Stride<_128, _1, ...>, Stride<_64, ...>>>;
+```
+
+Then, as we go to the next core matrix, we wrap back again to `T0`, but this time to `(T0, V2)`.
+
+```cpp
+// (T128,V4) -> (M64,N8)
+using CLayout = Layout<Shape <Shape <  _4, _8, ...>, Shape < _2, _2>>,
+                       Stride<Stride<_128, _1, ...>, Stride<_64, _8>>>;
+```
+
+Finally, we get this entire pattern repeating four times, once for each warp, down the `M`-mode starting at `(m,n) = (16,0) = 16`. where two core matrices that belong to the same warp are stacked on top of each other. This makes the size of the final sub-mode of M 4. As for the stride, this time we go to `(T32, V0)`, which makes it a stride of 32. 
+
+```cpp
+// (T128,V4) -> (M64,N8)
+using CLayout = Layout<Shape <Shape <  _4, _8,  _4>, Shape < _2, _2>>,
+                       Stride<Stride<_128, _1, _16>, Stride<_64, _8>>>;
+```
+
+This is the full `CLayout` for 64x8 accumulators. The GMMA instructions include 64xN variants with `N = [16,32,64,128,256]` where this 64x8 pattern is repeated giving each thread additional values. As this starts at `(m,n) = (0,8) = 512`, this is easy to account for in our `CLayout`. For example, the 64x128 `CLayout` is
+
+```cpp
+// (T128,V64) -> (M64,N128)
+using CLayout = Layout<Shape <Shape <  _4, _8,  _4>, Shape < _2, _2,  _16>>,
+                       Stride<Stride<_128, _1, _16>, Stride<_64, _8, _512>>>;
+```
+
+where we see 16 copies of the 64x8 tile.
+
+### A and B Layout Mapping
+
+GMMA atoms that consume A and B sources directly from shared memory are a bit interesting. The GMMA Descriptor is constructed on an entore tile of A and/or B data in shared memory rather than being partitioned by threads. That is, every thread sees the entire tile of data and the tile is not reordered so that the descriptor can be constructed on it. In `ALayout` form, this can be expressed
+
+```cpp
+// (T128,V64x8) -> (M64,K16)
+using ALayout = Layout<Shape <_128, Shape <_64,_16>>,
+                       Stride<  _0, Stride< _1,_64>>>;
+```
+
+That is, all threads are mapped the to `(m,k) = (0,0) = 0` element and the values (and shape of the values) remains unchanged. The GMMA Descriptor Constructor can then inspect the `(M,K)` layout of this data and create an appropriate GMMA Descriptor or produce an error message saying the data is in an invalid layout for GMMA.
--- a/media/docs/cute/0x_gemm_tutorial.md
+++ b/media/docs/cute/0x_gemm_tutorial.md
@ -0,0 +1,668 @@
+# CuTe dense matrix-matrix multiply tutorial
+
+This section uses the CuTe functionality to write
+a dense matrix-matrix multiply implementation.
+
+## A simple dense matrix-matrix multiply example
+
+In this section, we will go through
+[this example](../../../examples/cute/tutorial/sgemm_nt_1.cu).
+It illustrates a blocked GPU implementation of GEMM
+that uses the building blocks of CuTe
+to construct global and shared memory layout mappings
+and partition threads among them.
+This example is closest to the blocked GEMM
+that a computer science student might be asked to implement
+in a first-year graduate school
+or upper-division undergraduate scientific computing course.
+
+Readers who understand this section may also wish to study
+CUTLASS's implementation of the stream-K GEMM algorithm,
+which uses many features of CuTe.
+
+### Filename and high-level interface
+
+First, let's look at the example's filename `sgemm_nt_1.cu`.
+"SGEMM" is the BLAS (Basic Linear Algebra Subroutines) abbreviation
+for "Single-precision real, GEneral, Matrix-matrix Multiply."
+(If we want to refer to matrix-matrix multiply for all data types,
+we say "GEMM.")
+The BLAS project started in the 1970s.
+You can learn more about its history in Turing Award winner Jack Dongarra's
+2004 Oral History interview by SIAM
+(the Society for Industrial and Applied Mathematics),
+and also in the C++ Standard document [P1417](https://wg21.link/p1417).
+The abbreviation SGEMM unpacks as follows.
+
+* "Single-precision" is Fortran-speak for float.
+    The BLAS supports four different matrix or vector element types:
+
+    * S for single precision (`float`),
+
+    * D for double precision (`double`),
+
+    * C for complex float (like C++'s `std::complex<float>`,
+        where each of the real and imaginary components has type `float`),
+        and
+
+    * Z for complex double (like C++'s `std::complex<double>`).
+
+* "GEneral" means that the matrix is represented
+    as a two-dimensional dense array
+    and not assumed to have any kind of symmetry.
+    The BLAS supports a variety of matrix representations,
+    including
+
+    * SY: SYmmetric,
+
+    * HE: HErmitian,
+
+    * TR: TRiangular,
+
+    * GB: General Banded,
+
+    * SB: Symmetric Banded,
+
+    * SP: Symmetric Packed, and
+
+    * TP: Triangular Packed.
+
+* MM means "Matrix-matrix multiply," as opposed to other operations,
+    like MV (Matrix-Vector multiply).
+
+The string "nt" in the filename means that
+the first input matrix A is "Not transposed,"
+while the second input matrix B is "Transposed."
+That is, the function computes `C := beta * C + alpha * A * B^T`,
+where the superscript T denotes the transpose of the matrix.
+(We never change the input matrix in place or
+store its entire transpose explicitly.
+Instead, we reinterpret its data in place.)
+
+GEMM's TRANSA and TRANSB arguments lets users specify
+the transpose or Hermitian transpose (if complex)
+of either or both input matrices A or B.
+It turns out that implementations favor this "NT" case,
+along with "TN" (A is Transposed, B is Not transposed).
+We will explain why below.
+
+As described, the original BLAS GEMM specifies
+the dimensions of its matrices
+as A is M x K, B is K x N, and C is M x N.
+Out of convenience, CuTe interprets A
+as M x K, B as N x K, and C as M x N. Instead of row-major or column-major (or Transposed
+and Not-Transposed like above), we like to be more specific with M-major, N-major, or K-major.
+Regardless, we'll still use the BLAS "NT" notation for high-level descriptions
+of kernels when it's appropriate.
+
+Now, let's look at the code.
+We'll start with the kernel entry point `gemm_device`
+at the top of the file.
+
+```c++
+template <class MShape, class NShape, class KShape,
+          class TA, class AStride, class ABlockLayout, class AThreadLayout,
+          class TB, class BStride, class BBlockLayout, class BThreadLayout,
+          class TC, class CStride, class CBlockLayout, class CThreadLayout,
+          class Alpha, class Beta>
+__global__ static
+__launch_bounds__(decltype(size(CThreadLayout{}))::value)
+void
+gemm_device(MShape M, NShape N, KShape K,
+            TA const* A, AStride dA, ABlockLayout blockA, AThreadLayout tA,
+            TB const* B, BStride dB, BBlockLayout blockB, BThreadLayout tB,
+            TC      * C, CStride dC, CBlockLayout       , CThreadLayout tC,
+            Alpha alpha, Beta beta);
+```
+
+There are many template parameters;
+we'll explain them all in due time.
+
+`TA`, `TB`, and `TC` are the element types
+of the matrices `A`, `B`, and `C`, respectively.
+The two scalar constants `alpha` and `beta`
+are part of what GEMM computes: `C = beta * C + alpha * A * B`.
+Unlike the (traditional Fortran and C) BLAS,
+CuTe lets you mix different matrix element types and/or scalar types.
+The compiler will help, but it's somewhat up to you
+to use types that are safe and efficient on the GPU.
+For example, a custom arbitrary-precision real type
+that does dynamic allocation inside may not work on the GPU at all.
+Even if it does, it may not perform well.
+
+This leaves five kinds of things to explain:
+
+1. Shapes
+
+2. Strides
+
+3. Block layouts
+
+4. Thread layouts
+
+5. Launch bounds
+
+### Shapes
+
+The original Fortran BLAS GEMM lists the matrices' dimensions
+in the order M, N, K. CuTe also uses this convention.
+The "MShape" is just M,
+the NShape is just N,
+and the KShape is just K.
+In this example, they are dynamic (run-time) values
+defined at the top of the `gemm` host function
+that invokes the device kernel.
+
+```c++
+// Define shapes (dynamic)
+auto M = int(m);
+auto N = int(n);
+auto K = int(k);
+```
+
+Note that the function takes M, N, and K.
+It doesn't take the shapes of the three matrices separately,
+as (say) three different `Shape<int, int>` objects.
+This is because matrix-matrix multiply constrains the shapes.
+
+There's nothing mysterious about `int` here;
+it's the usual C++ built-in integral type.
+`auto M = int(m)` is a way to say
+"convert `m` to an `int` if it's not already an `int`,
+and assign it to the freshly declared variable `M`."
+CuTe also has a capitalized `Int<Value>` templated type
+for representing values as compile-time constants.
+For example, `Int<5>` represents a compile-time `int` value 5.
+(CuTe implements these as subclasses
+of the C++ Standard Library class `std::integral_constant`.)
+The above `gemm_device` function is templated on the types
+of M, N, and K; this shows that CuTe can represent dimensions
+as either run-time or compile-time values.
+
+If you're familiar with the mdspan class going into C++23,
+you might notice that CuTe represents shapes
+a bit differently from mdspan.
+mdspan uses `extents<class IndexType, size_t ... Extents>`
+to represent a shape.
+The `Extents` are zero or more compile-time values
+(see below) representing the dimensions in the shape.
+The `Extents...` are "non-type template parameters" (NTTPs) --
+that is, they are not types, but compile-time values of type `size_t`.
+If you use the special reserved `size_t` value `std::dynamic_extent`
+as an extent value,
+the resulting dimension is a run-time value
+and is stored in the `extents` instance.
+Any other extent value is a compile-time value
+that is encoded in the extents type itself.
+In contrast, CuTe represents a shape as `Shape<class ... Types>`.
+The `Types...` are actual types, not NTTPs.
+A built-in integral type like `int` or `uint64_t`
+denotes a run-time dimension that is stored in the `Shape` instance,
+while a compile-time value like `Int<5>`
+encodes a compile-time dimension.
+For example, the CuTe equivalent of
+`extents<int, 3, dynamic_extent, 5>`
+is `Shape<Int<3>, int, Int<5>>`.
+
+#### Compile-time-ness of values
+
+C++ values have three levels of "compile-time-ness":
+
+1. dynamic (run-time) values,
+
+2. constexpr values, and
+
+3. static (compile-time) values.
+
+(Rather than saying "C++ has,"
+it's more accurate to say "C++17 has."
+C++20 introduces `consteval` or "immediate" functions,
+which make attempting to evaluate the function at run time
+(any call not in an unevaluated context) a compiler error.
+We'll ignore those for this tutorial,
+since CuTe only requires C++17.)
+
+The `constexpr` keyword was introduced in C++11.
+It means something like
+"the compiler can evaluate this expression at compile time."
+It does NOT mean "the compiler MUST evaluate this at compile time."
+If you use a `constexpr` expression in a `static_assert`
+or as a non-type template argument,
+then the compiler must evaluate the expression at compile time.
+However, for `constexpr` occurring in other places,
+the compiler may choose to store the value in registers or memory,
+and/or do computations with the value at run time.
+In some cases, the compiler must do that.
+The following example shows that the compiler
+might need to store `constexpr` values in memory sometimes.
+
+```c++
+// Some function defined in a different compilation unit.
+extern int foo(int* x);
+
+int bar()
+{
+  constexpr int value = 42; // a compile-time constant
+
+  // Even constexpr variables have a sizeof,
+  // because we still might need to take their address.
+  static_assert(sizeof(value) == 4);
+
+  // Compiler can't inspect foo to see how it uses the value,
+  // so it has to store the value in some memory location
+  // so that we can pass its address to the function.
+  return foo(&value);
+}
+```
+
+"Static" is an unfortunately overloaded term in C++.  Sometimes it means "the opposite of instance," like a "static function" or "static member" of a class.  (Some programming languages, like Java, say "class method" to refer to a "static function of a class.")  That's not what we mean here.  Instead, we mean "part of a compile-time type."  For example, `Int<1>` encodes the value 1 at compile time, as part of the type of a templated class `Int<Value>`.  `Int<3>` and `Int<4>` have different types.  You can get the value of of the type like this: `Int<3>::value`.  (The `value` is a `static constexpr` member of the class, where "static" means "opposite of instance.")  As soon as you go from `Int<3>` to `Int<3>::value`, you've gone from (3) above (a compile-time value) to (2) above (a `constexpr` value).  In some situations, this may mean that the compiler treats it as a run-time value.
+
+#### Strides
+
+We define a layout using both shapes and strides.
+The shape just tells you the dimensions (modes, etc.) of the array.
+The strides tell you the mapping from a multidimensional index
+into a one-dimensional offset.
+Here, we're describing the shapes and strides
+of the "global" matrices A, B, and C.
+The example defines the global matrices' strides
+near the top of the `gemm` function.
+
+```c++
+// Define strides (mixed)
+auto dA = make_stride(Int<1>{}, ldA); // (dM,dK)
+auto dB = make_stride(Int<1>{}, ldB); // (dN,dK)
+auto dC = make_stride(Int<1>{}, ldC); // (dM,dN)
+```
+
+To evaluate this mapping for a given multidimensional index, take the dot product of the indices with the strides.  For example, the offset of `A(index_m, index_k)` is `index_m * 1 + index_k * ldA`.  Note the implications for the compile-time-ness of the offset.  Any run-time value among either the shape or the strides makes the offset a run-time value.  Of course, if a particular stride is a compile-time constant (especially 1), it's easier for the compiler to optimize the arithmetic and result.
+
+Note that in the original source code,
+this example is missing the comments after each line.
+We've added them in here,
+as they stir a brief digression about shapes and modes.
+The comment after B says (dN, dK), not (dK, dN).
+This means that B is treated as an N x K matrix
+instead of a K x N matrix.
+As mentioned, CuTe follows the convention
+that the meaning of matrix modes is
+(M,K) for A, (N,K) for B, and (M,N) for C.
+In particular, CuTe's convention is that
+"the reduction mode is outermost."
+The "reduction mode" of `Shape<M, N, K>` is K.
+That's the mode over which we do a reduction,
+that is, sum up products of matrix entries.
+The K mode disappears in the output C.
+"Outermost" here means "rightmost"
+(literally, appearing rightmost in the list M, N, K).
+Note that the shapes form a kind of Einstein tensor notation.
+GEMM does Shape<M, N> = Shape<M, K> * Shape<K, N>.
+In Einstein notation, the repeated index indicates
+a sum of that term over all values of K.
+
+We say in general that the leftmost mode is the "inner(most)" mode,
+and the rightmost mode is the "outer(most)" mode.
+This is because,
+along with CuTe's convention of thinking of arrays as logically column major,
+the leftmost mode is most commonly the mode with the most spatial locality.
+It's very often the "most contiguous" mode.
+For this reason, it's "the mode that we want in the innermost loop"
+(in the nesting of loops that implements GEMM).
+This is why we call it the "innermost" mode.
+Its contiguity means that also call the innermost mode the "vector mode."
+
+The vector mode also has special meaning:
+it contains all of the information needed
+to execute the smallest possible computation or communication operations
+on hardware, that is, what CuTe calls the "atoms."
+
+Modes are like units conceptually.
+For example, you shouldn't mix M-mode indices with K-mode indices.
+However, CuTe does nothing to enforce this.
+(For example, CuTe does not require use of "tagged" index types.
+Indexing works with the usual integer types.)
+
+The previous paragraph relates to shapes, not strides.
+Returning to the strides, the above code describes these strides as "mixed."
+This means that they include both run-time and compile-time values.
+For example, the stride between A(m, k) and A(m+1, k) is `Int<1>`,
+a compile-time value 1.  The stride between A(m, k) and A(m, k+1),
+however, is `ldA`, the "leading dimension of A," a run-time value.
+The "leading dimension" of a matrix
+refers to the stride between consecutive columns of a column-major matrix
+(where the stride between consecutive rows is 1),
+or the stride between consecutive rows of a row-major matrix
+(where the stride between consecutive columns is 1).
+This is a naming convention from the BLAS
+and libraries that use it, like LAPACK.
+For the purpose of this tutorial, it's just a naming convention
+for "the stride that isn't the compile-time constant 1."
+
+#### M-major, N-major, K-major
+
+Note that we haven't uttered the phrases "column-major" or "row-major" here.  This is where the experience of a BLAS user diverges from the experience of a BLAS implementer.  BLAS users speak of "column-major" and "row-major" layouts.  C++23's `mdspan` class encodes these as `layout_left` resp. `layout_right`.  However, we don't speak of "column-major" or "row-major" in our GEMM implementations.
+
+We say that a matrix is "M-major" if it is stride 1 in the M-mode, "N-major" if it is stride 1 in the N-mode, or "K-major" if it is stride 1 in the K-mode.  In the above code, A has shape (M, K) and strides (1, ldA).  Since A has stride 1 in the M mode, we say that A is "M major."  B has shape (N, K) and strides (1, ldB), so B is "N-major."  Similarly, C has shape (M, N) and strides (1, ldC), so C is "M major."
+
+How do we translate this into the BLAS user's experience?
+The following table illustrates for B and C.
+(Throughout the table, "Impl" stands for "implementation.")
+
+Note that the implementation reverses the order of B's modes,
+and flips B's strides.
+Recall that one evaluates a layout
+by taking the dot product of the indices and strides.
+Thus, reversing the order of both the modes and the strides
+does not change this evaluation.
+
+| Matrix | User's shape | User's layout | User's strides | Impl layout | Impl shape | Impl strides |
+| ---    | ---          | ---           | ---            | ---         | ---        | ---          |
+| C      | M x N        | Column major  | (1, LDC)       | M-major     | (M, N)     | (1, LDC)     |
+| A      | M x K        | Column major  | (1, LDA)       | M-major     | (M, K)     | (1, LDA)     |
+
+What about the matrix B?  We explained above that B is N-major.  How would that translate back into the BLAS user's experience?  We take a hint here from the filename including "nt."  The "nt" part of the name means that A is not transposed, while B is transposed.  The BLAS convention (see e.g., [the documentation for DGEMM](https://netlib.org/lapack/explore-html/d1/d54/group__double__blas__level3_gaeda3cbd99c8fb834a60a6412878226e1.html)) is that if you take the transpose, then the dimensions refer to the transpose ("with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix").  Thus, this example actually computes `C = beta * C + alpha * A * B^T`, where `B^T` is an K x N matrix with strides (LDB, 1).  The user's "original" matrix B is thus N x K, with strides (1, LDB) -- that's a column-major layout.  (Reversing the modes and the strides preserves the layout, since evaluating the layout mapping just takes the dot product of indices and strides.)  This lets us expand the above table to include B.
+
+| Matrix | Transposed? | User's shape | User's layout | User's strides | Impl layout | Impl shape | Impl strides |
+| ---    | ---         | ---          | ---           | ---            | ---         | ---        | ---          |
+| C      | No          | M x N        | Column major  | (1, LDC)       | M-major     | (M, N)     | (1, LDC)     |
+| A      | No          | M x K        | Column major  | (1, LDA)       | M-major     | (M, K)     | (1, LDA)     |
+| B      | Yes         | N x K        | Column major  | (1, LDB)       | N-major     | (N, K)     | (1, LDB)     |
+
+CuTe developers say: "In CuTe, you can't tell transposed
+apart from non-transposed, MN-major from K-major, etc.
+without inspecting the strides."
+It's now a bit more clear what that means.
+CuTe doesn't see whether A or B are transposed.
+Instead, CuTe sees shapes and strides.
+A CuTe developer must reason backwards from the shapes and strides
+in order to see what the BLAS user sees.
+
+Why does CuTe do this?  Consider that matrix multiply performs a reduction in the K-mode.  From the user's perspective, it's reducing across rows of the first input matrix, but across columns of the second input matrix.  If we instead mentally flip the modes of the first input matrix, then the implementation reduces over columns (the K mode) of both input matrices.  This leads to two cases in which the implementation can effectively treat both input matrices in the same way.  (If you call it with A and B reversed, it should even give the same results for these cases.)
+
+| Case                       | User asks for A | User asks for B | Abbreviation |
+| ---                        | ---             | ---             | ---          |
+| A is M major, B is N major | Not transposed  | Transposed      | NT           |
+| A and B are both K major   | Transposed      | Not transposed  | TN           |
+
+This is why an introductory example starts with NT or TN.
+For a summary of the four different transpose options for A and B,
+and their corresponding implementation layouts,
+please see the table below.
+
+| Transpose abbreviation | User sees A transposed? | User sees B transposed? | A's impl layout | B's impl layout |
+| ---                    | ---                     | ---                     | ---             | ---             |
+| NT                     | No                      | Yes                     | M major         | N major         |
+| TN                     | Yes                     | No                      | K major         | K major         |
+| NN                     | No                      | No                      | M major         | K major         |
+| TT                     | Yes                     | Yes                     | K major         | N major         |
+
+#### MN-major and K-major
+
+As we mentioned above, there are two "preferred arrangements," TN and NT.  In the TN arrangement, both A and B are K-major.  In the NT arrangement, A is M-major and B is N-major.  Even though the two stride-1 modes in NT have different names, it's still the leftmost mode for both A and B that has stride 1.  Thus, we can think of the NT arrangement as "MN-major," analogous to how the TN arrangement is "K-major."
+
+The two preferred arrangements tend to work themselves into implementations, particularly when they use hardware instructions for accelerating matrix multiplies of blocks.  In some cases, the hardware instruction may require NT (MN-major) or TN (K-major).  For NN or TT, such instructions would require an intermediate transpose -- for example, when loading from global memory to shared memory.
+
+### Block layouts
+
+Efficient matrix multiply implementations loop over blocks.
+For example, a typical GPU implementation strategy
+is for each thread block to iterate over some number of blocks.
+In the example, this loop occurs near the end of `gemm_device`.
+
+```c++
+// TUTORIAL: Example of a very simple compute loop
+//   Data is read from global to shared memory via the tA|tB partitioning
+//   gemm(.) operates on the shared memory directly via the tC partitioning
+
+auto k_max = size<2>(tAgA);
+
+for (int k = 0; k < k_max; ++k)
+{
+  // Copy A and B blocks from global memory to shared memory.
+  copy(tAgA(_,_,k), tAsA);
+  copy(tBgB(_,_,k), tBsB);
+
+  // On some architectures, copy may be asynchronous.
+  // This may call for extra synchronization instructions
+  // beyond just __syncthreads().
+
+  __syncthreads();
+
+  // Compute gemm on shared memory input and register accumulator.
+  // The "epilogue" after this loop will copy the accumulator
+  // from the register file into global memory.
+  gemm(tCsA, tCsB, tCrC);
+
+  __syncthreads();
+}
+```
+
+We will explain the notation in this loop below.  The important things to remember are that the coordinate `k` loops over the blocks which the calling thread is supposed to compute, the `copy` functions copy A resp. B blocks from global memory (the first argument) to shared memory (the second argument -- same as C++'s `std::copy`, but the opposite of `memcpy`), and the `gemm` function computes C += A * B on the shared memory blocks.
+
+It turns out that copy takes an optional first argument, the "atom," as in the following.
+
+```c++
+copy(atom, source, destination);
+```
+
+The "atom" is metadata that explains how to do the copy operation.
+
+There are a few topics to push onto the stack.
+
+The copy function call shows a notation for taking slices of a tensor.  A CuTe `Tensor` is a multidimensional array view.  It consists of a pointer and a `Layout`.  You can learn more about `Tensor`s elsewhere in CuTe's documentation, but for now, please note that `tAgA(_,_,k)` means "create a Tensor that views (i, j, k) for all valid i, all valid j, and a specific value of k."  The result has rank one less than the original Tensor.  CuTe's underscore means the same thing as a single stand-alone colon in Fortran or Matlab.  Note also that CuTe uses the same notation for slices as for tensor indexing.  The implementation can distinguish the two cases by checking whether any of the arguments is an underscore.  In contrast, the C++23 class mdspan uses a separate function, `submdspan` (not in C++23, and proposed for C++26; see [P2630](https://wg21.link/p2630)), for slicing.
+
+Fully understanding what `copy` and `gemm` do calls for learning about thread layouts as well, so we will wait to explain them completely.  For now, note that these functions are implicitly parallel, as they are called collectively by all threads in a thread block.
+
+The block dimensions are defined near the top of the host function `gemm`.
+
+```c++
+// Define block sizes (static)
+auto bM = Int<128>{};
+auto bN = Int<128>{};
+auto bK = Int<  8>{};
+```
+
+We see that these are fully compile-time dimensions.  This is often the case, especially when we use hardware instructions that only work for certain problem dimensions.  Three lines of code immediately below these construct the block layouts.
+
+```c++
+// Define the block layouts (static)
+auto sA = make_layout(make_shape(bM,bK));
+auto sB = make_layout(make_shape(bN,bK));
+auto sC = make_layout(make_shape(bM,bN));
+```
+
+Here, the block layouts just come from the block dimensions.  A Layout has two things: a Shape, and Strides.  If the caller does not provide Strides, then CuTe computes Strides corresponding to the default "column-major" arrangement of data.  This just happens to match the global matrices' layouts, but in general doesn't have to.  For example, in the NN or TT cases, we may want to transpose one of the input matrices when copying from global memory to shared memory.
+
+The example "comments out" some code that prints all the layouts on "thread 0" of each thread block.  If you enable the printing code and run the example, it will print all the layouts.  For example, sA prints as
+
+```
+sA
+(_128,_8)
+(_1,_128)
+```
+
+and sB prints as
+
+```
+sB
+(_128,_8)
+(_1,_128)
+```
+
+consistently with the definitions above.
+
+If you have looked at other GEMM examples in CuTe, you might be wondering about hardware matrix-matrix multiply instructions.  Those instructions tend to require certain values for shapes and strides, that may be a function of the matrix's element type.  CuTe knows about these instructions and their required shapes and strides.  We will go into more detail about that elsewhere.
+
+The `gemm_device` top-level kernel uses these block layouts to allocate shared memory buffers for A and B tiles.
+
+```c++
+// Shared memory buffers
+__shared__ TA smemA[cosize_v<ABlockLayout>];
+__shared__ TB smemB[cosize_v<BBlockLayout>];
+```
+
+Note how the shared memory buffers' sizes depend only on the A resp. B layouts (and element sizes). What's a `cosize_v`?  The "`_v`" is a C++ naming convention that specifies a function from one or more template argument(s), to a value.  In this case, it's a number of elements.  A layout is a function from a set of multidimensional coordinates to a set of one-dimensional array offsets.  It's a function, so we can speak of its domain and codomain.  The "cosize" of a layout is the size of its codomain.  (See e.g., CuTe's implementation of `Layout`.)  If we want to allocate a linear array, for which all the offsets produced by a layout are valid, then we can use the cosize of the layout as the length of the array (in terms of number of elements, not in terms of number of bytes).
+
+### Thread layouts
+
+CuTe uses a `Layout` to describe the assignment of threads to work items.
+In this example, the host function `gemm` constructs the thread layouts
+for A, B, and C.
+
+```c++
+// Define the thread layouts (static)
+auto tA = make_layout(make_shape(Int<32>{}, Int< 8>{}));
+auto tB = make_layout(make_shape(Int<32>{}, Int< 8>{}));
+auto tC = make_layout(make_shape(Int<16>{}, Int<16>{}));
+```
+
+That is, the thread layout for the A read is M-major 32x8, for the B read is N-major 32x8, and for the C compute/write is M-major 16x16. These thread layouts will partition the data for their respective stages.
+
+#### The example uses compile-time thread and block layouts
+
+Note that the device function `gemm_device` insists that all the thread and block layouts are static -- that is, known at compile time.  You can see this from the `CUTE_STATIC_ASSERT` statements near the top of `gemm_device`.  `CUTE_STATIC_ASSERT` is a wrapper for `static_assert`, which fails at compile time if its condition is `false`.
+
+```c++
+// Preconditions
+CUTE_STATIC_ASSERT(is_static<ABlockLayout>::value);
+CUTE_STATIC_ASSERT(is_static<BBlockLayout>::value);
+CUTE_STATIC_ASSERT(is_static<CBlockLayout>::value);
+
+CUTE_STATIC_ASSERT(is_static<AThreadLayout>::value);
+CUTE_STATIC_ASSERT(is_static<BThreadLayout>::value);
+CUTE_STATIC_ASSERT(is_static<CThreadLayout>::value);
+```
+
+Use of static layouts has two advantages.  First, it makes it easier to prove correctness of the algorithm.  If the code compiles, it's likely correct.  (On the other hand, new CuTe users may find themselves doing more debugging at compile time than they have before.)  Second, it makes it easier and faster for CuTe to dispatch to the correct optimized implementations (called "atoms" -- see below) for copying blocks and performing matrix multiplies.
+
+#### The example's block gemm is parallel over elements of C
+
+In the actual device function, `tC` has layout `CThreadLayout`.  You might recall that the kernel function `gemm_device` uses `CThreadLayout` to derive the launch bounds, specifically the maximum number of threads per block.  The launch bounds show up in the declaration of `gemm_device`.
+
+```c++
+template <class MShape, class NShape, class KShape,
+          class TA, class AStride, class ABlockLayout, class AThreadLayout,
+          class TB, class BStride, class BBlockLayout, class BThreadLayout,
+          class TC, class CStride, class CBlockLayout, class CThreadLayout,
+          class Alpha, class Beta>
+__global__ static
+__launch_bounds__(decltype(size(CThreadLayout{}))::value)
+void
+gemm_device(MShape M, NShape N, KShape K,
+            TA const* A, AStride dA, ABlockLayout blockA, AThreadLayout tA,
+            TB const* B, BStride dB, BBlockLayout blockB, BThreadLayout tB,
+            TC      * C, CStride dC, CBlockLayout       , CThreadLayout tC,
+            Alpha alpha, Beta beta);
+```
+
+The "size" of `CThreadLayout` is the total number of threads, 16 * 16 = 256.  (We take `::value` because the size is actually `Int<256>`, a compile-time constant with a `static constexpr int value = 256` member.)  This suggests that the block gemm function (in the loop over blocks) parallelizes over elements of the C block.  We can see this as well from the kernel launch (at the end of the `gemm` host function), which uses the size of `CThreadLayout` as the block dimension.
+
+```c++
+// Define the thread layouts (static)
+auto tA = make_layout(make_shape(Int<32>{}, Int< 8>{}));
+auto tB = make_layout(make_shape(Int<32>{}, Int< 8>{}));
+auto tC = make_layout(make_shape(Int<16>{}, Int<16>{}));
+
+dim3 dimBlock(size(tC));
+dim3 dimGrid(ceil_div(size(M), size(bM)),
+             ceil_div(size(N), size(bN)));
+gemm_device
+    <<< dimGrid, dimBlock, 0, stream >>>
+    (M,  N,  K,
+     A, dA, sA, tA,
+     B, dB, sB, tB,
+     C, dC, sC, tC,
+     alpha, beta);
+```
+
+Note that dimBlock is single-dimensional (despite being a dim3), as the size of a layout is a single value.  We can see this also because the example only ever uses `threadIdx.x`, not `threadIdx.y`.  Yet, C's thread layout has shape (16, 16).  What's with that?  Recall that a thread layout maps from a "logical" coordinate space (possibly multidimensional tuples of indices) to (one-dimensional) integer indices.  In this case, `CThreadLayout` maps from pairs of indices in the Cartesian product space {0, 1, 2, ..., 15} x {0, 1, 2, ..., 15}, to one-dimensional indices 0, 1, 2, ..., 255.  The latter, the output of `CThreadLayout`, is the actual thread index `threadIdx.x` in this case.  `CThreadLayout` has only a shape (16, 16) and no nondefault strides, so it uses CuTe's default column-major arrangement (with strides (1, 16) in this case).
+
+#### What does `local_tile` do?
+
+The following code near the top of `gemm_device`
+operates on the "global" (input and output) matrices A, B, and C
+(where mA, mB, and mC are their Tensor representations).
+
+```c++
+// Get the appropriate blocks for this thread block --
+// potential for thread block locality
+auto blk_shape = make_shape(size<0>(sA), size<0>(sB), size<1>(sB));  // (BLK_M,BLK_N,BLK_K)
+auto blk_coord = make_coord(blockIdx.x, blockIdx.y, _);              // (m,n,k)
+
+Tensor gA = local_tile(mA, blk_shape, blk_coord, Step<_1, X,_1>{});  // (BLK_M,BLK_K,k)
+Tensor gB = local_tile(mB, blk_shape, blk_coord, Step< X,_1,_1>{});  // (BLK_N,BLK_K,k)
+Tensor gC = local_tile(mC, blk_shape, blk_coord, Step<_1,_1, X>{});  // (BLK_M,BLK_N)
+```
+
+There are two new features here:
+
+* `make_coord`, which returns a `Coord`, a multidimensional index which can be used as the input of a `Layout`; and
+
+* `local_tile`, which we will explain below.
+
+The `Coord`(inate) `blk_coord` refers to the set of blocks (indexed by k -- the underscore here indicating a free parameter) our thread block will access.  (The index k here doesn't mean the K mode; it's the same index as in the loop over blocks that does the computation.)
+
+If we print out the `gA`, `gB`, and `gC` layouts, we get the following.
+
+```
+gA
+(_128,_8,512)
+(_1,5120,40960)
+
+gB
+(_128,_8,512)
+(_1,5120,40960)
+
+gC
+(_128,_128)
+(_1,5120)
+```
+
+All of these layouts come from the original input or output matrices A, B, and C.  Thus, they preserve the original strides, which are all the same in this example (when using default problem dimensions), 5120.  This is most easily seen in the gC layout.  For the other layouts, there is a clue in 5120 * 8 = 40960.  That is, every time we increase k by one, we "skip over 8 columns" of the global matrix, over to the next block of data.  This illustrates an important feature of CuTe, that it can view the same data with different modes and/or strides, as a way to identify parallelism or locality.
+
+## Next steps
+
+The above "simple GEMM" example's performance on many problems
+is asymptotically optimal
+with respect to the GPU's floating-point throughput.
+Getting nearly peak performance
+relative to the GPU's floating-point throughput,
+for a wider variety of problem dimensions,
+calls for more advanced techniques.
+Please refer to other examples in this repository
+to learn more about those techniques.
+For example, the
+[predication section of the tutorial](./0y_predication.md)
+explains what to do if a matrix tiling
+doesn't perfectly divide the matrix.
+
+### Implement GEMM as generalized tensor constraction (GETT)
+
+"GETT" here stands for "general(ized) tensor times tensor,"
+a tensor contraction.
+
+CuTe permits matrices to have nested `Layout`s.
+For example, a matrix A can have a nested `Layout` for its M and N modes.
+This means that we can use a "matrix" (`Tensor` with two modes)
+to represent any `Tensor`.
+This amounts to a "native hierarchical representation."
+
+As a result, we can implement GETT by using
+our existing GEMM implementation layers,
+with a little bit of fancy custom predication for the K mode.
+This is because the stride type of A
+and the problem shape itself
+are CuTe Shapes and Strides.
+This lets us represent the hierarchical modes
+of a tensor contraction problem
+(which still fundamentally only have 4 modes --
+batch mode,
+two outer modes (one for A and one for B),
+and one reduction mode --
+each of which can now have as many nested modes as you want
+for the contraction's inputs).
+We thus implement GETT as contraction just in one mode -- the K mode.
+However, K itself can be hierarchical and can have noncontiguous strides.
+We can reorder the modes such that all contraction modes
+become a single, possibly hierarchical K mode in the kernel.
+This is how we would encode a contraction in multiple modes at once.
--- a/media/docs/cute/0y_predication.md
+++ b/media/docs/cute/0y_predication.md
@ -0,0 +1,217 @@
+# Predication: What to do when tiling isn't perfect
+
+The [GEMM tutorial](./0x_gemm_tutorial.md) shows how
+we compute a matrix-matrix multiply
+by iterating over tiles of the input matrices and output matrix.
+The examples all assume that the tiles fit evenly into the matrices,
+with no remainder.
+What do we do if this is not the case?
+For example, we might want to tile a 41 x 55 matrix into 4 x 8 tiles,
+but 41 / 4 is 10 remainder 1, and 55 / 8 is 6 remainder 7.
+What do we do with those "leftover" parts of the matrix?
+
+Another way to say this, is that `logical_divide`
+(CuTe's way of tiling layouts) "rounds up."
+For example, if `N` is the layout (1000, 1) and `B` is the layout (128, 1),
+then `logical_divide(N, B)` is the layout ((128, 8), (1, 128)).
+This effectively rounds up the original shape N = 1000
+into an 128 x 8 matrix (as if N = 1024).
+What about those last 24 elements,
+that aren't part of the original data?
+
+The idiomatic CuTe way to solve this problem is through "predication."
+Rather than trying to reason about the "remainder tiles,"
+CuTe instead rounds up, but only tries to access data in each tile
+that are part of the matrix.
+This corresponds well with how our GPUs optimize:
+branches without warp divergence are relatively fast.
+It also matches the usual CUDA idiom
+when dividing N work items in 1-D fashion over B thread blocks:
+first test if "my thread" is out of bounds before doing work.
+
+There are a few ways to figure out
+which elements need to be predicated.
+In-kernel GEMMs like to do this in the following way.
+
+```c++
+// Create the predicate tensor
+Layout idA  = make_layout(shape(A));   // e.g. 1000:1
+Layout idAB = logical_divide(idA, B);  // e.g. (128,8):(1,128)
+
+Tensor pred = make_tensor<bool>(shape(idAB));
+for (int i = 0; i < size(pred); ++i) {
+  pred(i) = idAB(i) < size(A);
+}
+
+// ... intervening code ...
+
+// Use the predicate tensor.  c is some coordinate.
+// This code would likely live inside some algorithm.
+if (pred(c)) { copy(idAB(c), smem(c)); }
+```
+
+The general procedure is that we
+
+1. create an "identity" layout (`Layout idA  = make_layout(shape(A))`,
+   in the above example) with the same shape as our original data;
+
+2. repeat the same tiling/partitioning/slicing (possibly rounding up)
+   on that identity layout (`Layout idAB = logical_divide(idA, B)`);
+
+3. create a "predicate tensor" by comparing the coordinates
+   of that reference layout with the bounds of the original layout;
+   and then
+
+4. use the predicate tensor to mask off accesses to out-of-bounds elements.
+
+For example, suppose that we've partitioned A and B tiles
+across threads as follows.
+
+```c++
+Tensor tAgA = local_partition(gA, tA, thread_idx);                  // (THR_M,THR_K,k)
+Tensor tAsA = local_partition(sA, tA, thread_idx);                  // (THR_M,THR_K,PIPE)
+
+Tensor tBgB = local_partition(gB, tB, thread_idx);                  // (THR_N,THR_K,k)
+Tensor tBsB = local_partition(sB, tB, thread_idx);                  // (THR_N,THR_K,PIPE)
+```
+
+`tAgA` and `tBgB` partition the global A resp. B matrices over threads,
+and `tAsA` and `tBsB` partition the shared memory tiles of A resp. B over threads.
+
+The following code creates predicate tensors
+corresponding to `tAgA` and `tBgB`.
+They will be computed once in the prologue.
+and will be used to mask off instructions in the inner loop.
+
+```c++
+Tensor tApA = make_tensor<bool>(make_shape (size<0>(tAgA), size<1>(tAgA)),
+                                make_stride(     Int<1>{},      Int<0>{}));
+Tensor tBpB = make_tensor<bool>(make_shape (size<0>(tBgB), size<1>(tBgB)),
+                                make_stride(     Int<1>{},      Int<0>{}));
+```
+
+We're only thread-parallelizing over the leftmost (row) dimension,
+so we only need to predicate over the leftmost dimension.
+Thus, we can make the rightmost (column) stride zero,
+since we will never actually address the rightmost dimension.
+
+The following code creates "two-dimensional identity tensors"
+that map coordinates (m,k) -> (m,k)
+for the tile of data within the thread block.
+
+```c++
+Tensor cA = make_identity_tensor(make_shape(size<0>(sA), size<1>(sA)));   // (BLK_M,BLK_K) -> (blk_m,blk_k)
+Tensor cB = make_identity_tensor(make_shape(size<0>(sB), size<1>(sB)));   // (BLK_N,BLK_K) -> (blk_n,blk_k)
+```
+
+The following lines then tile and partition
+the two reference tensors
+in exactly the same way the data were tiled and partitioned
+into `tAsA` and `tBsB`.
+
+```c++
+Tensor tAcA = local_partition(cA, tA, thread_idx);
+Tensor tBcB = local_partition(cB, tB, thread_idx);
+```
+
+Tiling and partitioning affect the offset and domain,
+but not the codomain of the tensors,
+so we're left with tensors that map `(thr_m,thr_k) -> (m,k)`
+where `(thr_m,thr_k)` is this particular thread's subtensor of the tile
+and `(m,k)` is the original codomain: a coordinate into the original tile.
+
+The unrolled loops in the code below then compare
+the m- and n-coordinates of those tensors with our known maximums
+to mask off elements we are not allowed to access.
+
+```c++
+Tensor cA   = make_identity_tensor(make_shape(size<0>(sA), size<1>(sA)));  // (BLK_M,BLK_K) -> (blk_m,blk_k)
+Tensor tAcA = local_partition(cA, tA, thread_idx);
+
+Tensor cB   = make_identity_tensor(make_shape(size<0>(sB), size<1>(sB)));  // (BLK_N,BLK_K) -> (blk_n,blk_k)
+Tensor tBcB = local_partition(cB, tB, thread_idx);
+
+// Populate
+CUTE_UNROLL
+for (int m = 0; m < size<0>(tApA); ++m) {
+  tApA(m,0) = get<0>(tAcA(m,0)) < m_max_coord;
+}
+CUTE_UNROLL
+for (int n = 0; n < size<0>(tBpB); ++n) {
+  tBpB(n,0) = get<0>(tBcB(n,0)) < n_max_coord;
+}
+```
+
+Those last `for` loops fill in the two predicate tensors.
+In this case, we only need to predicate over the leftmost dimension,
+so we only address `(m,0)` resp. `(n,0)`.
+
+We can then use the predicate tensors in `copy_if`
+to copy only the elements for which the corresponding
+predicate tensor elements are nonzero.
+
+```c++
+// Prefetch k_tile=0, gate these on k_residue as well
+CUTE_UNROLL
+for (int k = 0; k < size<1>(tAsA); ++k) {
+  if (get<1>(tAcA(0,k)) >= -k_residue) { // some other condition on the column index
+    copy_if(tApA, tAgA(_,k,0), tAsA(_,k,0));
+  }
+}
+
+CUTE_UNROLL
+for (int k = 0; k < size<1>(tBsB); ++k) {
+  if (get<1>(tBcB(0,k)) >= -k_residue) { // some other condition on the column index
+    copy_if(tBpB, tBgB(_,k,0), tBsB(_,k,0));
+  }
+}
+```
+
+Here are some advantages of this "reference tensor" approach.
+
+1. It doesn't depend on the layout/strides of the tensor
+   being predicated, just the logical bounds being imposed.
+
+2. The partitioning stage can be anything.
+
+3. It naturally extends to any-dimensional predication.
+
+4. It's a natural generalization of a typical CUDA 1-D
+   parallel vector access pattern,
+   which computes an access index `k`
+   (e.g., as `blockDim.x * blockIdx.x + threadIdx.x`)
+   and then predicates access to the vector's `k`-th element
+   on whether `k` is in bounds.
+
+As an example of (3), the epilogue predication does exactly the same thing,
+
+```c++
+// Repeat with a tensor of coordinates for predication
+Tensor cC   = make_identity_tensor(make_shape(size<0>(gC), size<1>(gC)));
+Tensor tCcC = thr_mma.partition_C(cC);
+
+const bool isBetaZero = (beta == 0);
+
+CUTE_UNROLL
+for (int i = 0; i < size(tCrC); ++i) {
+  if (elem_less(tCcC(i), make_coord(m_max_coord,n_max_coord))) {
+    tCgC(i) = isBetaZero ? alpha * tCrC(i) : alpha * tCrC(i) + beta * tCgC(i);
+  }
+}
+```
+
+but with the mma responsible for the tiling/partitioning `tCcC`
+so that the reference subtensor matches the accumulator's subtensor.
+Then, the reference subtensor is predicated against the `if` bounds
+(in both m- and n-coordinates) inside the `for` loop.
+
+Another way to explain this is that we don't modify the tiles
+to give you the "right" extents so that you never overrun.
+Instead, we let you query the original coordinate
+to see if that coordinate overruns.
+This avoids all branching and variable/dynamic loop bounds
+(thus maintaining load balance and synchronicity,
+both very important in-kernel) in favor of predication.
+It's also general enough to extend to all ranks,
+all layouts of threads and data,
+and all tiling/partitioning patterns.
--- a/media/docs/cutlass_3x_backwards_compatibility.md
+++ b/media/docs/cutlass_3x_backwards_compatibility.md
@ -0,0 +1,473 @@
+[README](/README.md#documentation) > **CUTLASS 3.0 GEMM Backwards Compatibility**
+
+# CUTLASS 3.0 GEMM Backwards Compatibility
+
+Although CUTLASS 3.0 restructures the GEMM hierarchy and introduces new types for the
+threadblock layer and below, we intend the entire source code to be usable in user applications.
+We expect users to be able to `#include` any source file from CUTLASS 3.0, whether
+they implement the 2.x or the 3.x API, without breaking user builds. This means that a single 
+translation unit should be able to contain any valid kernel regardless of its API version. The
+sections below discuss how `device` and `kernel` layer type names are made compatible across the
+two API versions, and what the users can expect out of the `threadblock` layer API going forward.
+
+## Compatible Device API
+
+The entry point for CUTLASS's Device GEMM API
+is the class
+`cutlass::gemm::device::GemmUniversalAdapter`.
+This class lives in the header file
+[include/cutlass/gemm/device/gemm_universal_adapter.h](/include/cutlass/gemm/device/gemm_universal_adapter.h).
+
+`GemmUniversalAdapter` is a "universal adapter"
+and serves as a common device interface
+for both CUTLASS 3.x and CUTLASS 2.x kernels.
+Its template parameter `GemmKernel`,
+the GEMM kernel type, can be any of the following:
+
+* `cutlass::gemm::kernel::GemmUniversal`,
+  implementing CUTLASS 3.x API kernels;
+* `cutlass::gemm::kernel::GemmUniversal`,
+  implementing CUTLASS 2.x API kernels;
+* Any valid CUTLASS 2.x `kernel` layer GEMM that
+  was previously composable with `device::GemmUniversalAdapter`
+
+Users implementing new kernels in either API should prefer
+using `kernel::GemmUniversal` as the kernel type
+and compose it with `device::GemmUniversalAdapter`.
+Users with existing `kernel::Gemm` kernels
+can continue to use them as template arguments
+of `device::GemmUniversalAdapter`. They can adopt
+`GemmUniversal` as a gradual migration path,
+since `GemmUniversal` accepts either 3.0 or 2.x collectives.
+Please see the [next section for `kernel::GemmUniversal`](#compatible-kernel-api) for details.
+
+`GemmUniversalAdapter` presents a single
+host-side interface to both 3.0 and 2.x kernels.
+CUTLASS accomplishes this by
+specializing `GemmUniversalAdapter`'s implementation
+on either 2.x API implementing kernel layer GEMMs, or 3.x API
+implementing kernel layer GEMMs (as detected by `gemm::detail::IsCutlass3GemmKernel`
+discussed below). As a result, `GemmUniversalAdapter`'s behavior
+might differ between the two specializations.
+
+### Device API design differences
+
+In CUTLASS 2.x, the Device API was more closely tied
+to the Kernel API.  In CUTLASS 3.0, the Device API
+accepts any kernel type that meets the Kernel API
+interface requirements.  CUTLASS 3.0's Device API code is
+parameterized by the kernel type, but this code
+is *generic*; the same code works for any kernel type.
+
+The device layer compatibility interface, `device::GemmUniversalAdapter`,
+also provides reflective mappings from 3.0-specific types
+back to the closest possible 2.x equivalent types. This is [discussed further in the section below](#conversions-between-2x-tags-and-30-types).
+
+CUTLASS 3.0's `device::GemmUniversalAdapter` also exposes some new APIs that the 2.x `device::GemmUniversalAdapter` implementation does not. Most notably, this includes the ability to bypass the `GemmKernel::Arguments` to `GemmKernel::Params` lowering.
+
+```c++
+// Primary run() entry point API that is static allowing users to create and manage their own params.
+static Status
+run(Params& params, cudaStream_t stream = nullptr);
+```
+
+This new API is useful for the following scenarios.
+
+* Running again does not require reinvoking `GemmKernel::to_underlying_arguments()`
+* Manual control over construction of `GemmKernel::Params` for custom kernels with custom stride types
+* Fully static problem shapes and strides for bespoke kernels where no argument mapping needs to take place
+
+## Compatible Kernel API
+
+CUTLASS 3.x API shares the kernel layer API with CUTLASS 2.x
+through the single entry point type `cutlass::gemm::kernel::GemmUniversal`.
+All kernel layer GEMMs are viewed as a composition of a collective mainloop
+and a collective epilogue.
+
+**`kernel::GemmUniversal` implements both 2.x and 3.x APIs**
+
+The entry point for CUTLASS's kernel API is the class
+`cutlass::gemm::kernel::GemmUniversal`.
+This class' declaration lives in the header file
+[include/cutlass/gemm/kernel/gemm_universal.hpp](/include/cutlass/gemm/kernel/gemm_universal.hpp).
+
+```c++
+/*
+ * Stateless universal device GEMM kernel type that treats GEMM as
+ * a composition of a collective mainloop and a collective epilogue.
+ * SFIANE shims both 2.x and 3.0 API kernels based on ProblemShapeOrThreadblockMma_.
+**/
+template <
+  class ProblemShapeOrThreadblockMma_,
+  class CollectiveMainloopOrEpilogue_,
+  class CollectiveEpilogueOrThreadblockSwizzle_,
+  class GridSwizzle_ = void,
+  class Enable = void
+>
+class GemmUniversal;
+```
+
+We call this class "universal" because it can be built
+using either the CUTLASS 3.0 or the 2.x mainloops and epilogues.
+If `GemmUniversal`'s first template argument
+(`ProblemShapeOrThreadblockMma_`) is a `cute::tuple`,
+then `GemmUniversal` assumes that
+the remaining three template arguments
+(the mainloop, epilogue, and grid swizzle)
+implement the 3.0 APIs.
+Otherwise, `GemmUniversal` assumes that
+the remaining three template arguments
+implement the 2.x APIs.
+All the template arguments must be either
+CUTLASS 3.0 or CUTLASS 2.x types. For example,
+`GemmUniversal` does not permit using
+a 2.x mainloop with a 3.0 collective epilogue.
+
+CUTLASS 3.x implements various embodiments of `kernel::GemmUniversal`.
+Each kernel layer schedule is specialized
+for a GEMM scheduling algorithm and GPU architecture.
+Specializations of `kernel::GemmUniversal` for 3.0 APIs live in 
+any of various `gemm_*.hpp` files in the directory
+[include/cutlass/gemm/kernel/](../../include/cutlass/gemm/kernel/).
+The specialization to which to dispatch is decided through the dispatch policy's `Schedule` type.
+
+Specializations for 2.x APIs live in the header file
+[include/cutlass/gemm/kernel/gemm_universal.h](../../include/cutlass/gemm/kernel/gemm_universal.h).
+
+### Kernel API design differences
+
+The CUTLASS 2.x Kernel API was more closely tied
+to the Device API, as we mentioned above.
+In particular, the 2.x Device API specified the grid shape
+used to launch the Kernel API.
+In CUTLASS 3.0, the Kernel API controls its own grid shape,
+while the device adapter simply queries the kernel with which it needs to be launched.
+
+This change is required to support various kernel schedules
+that may need their own schedule specific grid planning logic.
+For example, persistent kernel schedules generally only launch with
+as many threadblocks as the number of multiprocessors on the GPU.
+
+All CUTLASS 3 `kernel::GemmUniversal` specializations expose the following (static) API:
+
+```c++
+// Returns true if the kernel can execute the provided GEMM arguments.
+static bool
+can_implement(Arguments const& args);
+
+// Returns a dim3 representing the threadblock shape. 
+static constexpr dim3
+get_block_shape();
+
+// Returns a dim3 representing the grid shape in terms of threadblocks.
+static constexpr dim3
+get_grid_shape(Params const& params);
+```
+
+The device adapter simply queries the kernel for these three before launching it on the device.
+CUTLASS 3.0 provides a meta-function to detect whether a `cutlass::gemm::kernel::*` implements
+the 3.x API or 2.x API:
+
+```c++
+// include/cutlass/gemm/gemm.h
+
+namespace cutlass:gemm::detail {
+  
+// The following metafunction is used to detect whether a
+// `kernel::Gemm` or `kernel::GemmUniversal` implements the CUTLASS 3.x API,
+// by checking whether the problem shape type is aliased within.
+template <class GemmKernel, class = void>
+struct IsCutlass3GemmKernel;
+
+} // namespace cutlass:gemm::detail
+```
+
+Users can dispatch their generic code against 2.x and 3.x specializations with
+this as a type trait for the kernel API version.
+
+## Threadblock API and Inner Loops
+
+Much of the CUTLASS 3 GEMM hierarchy for mainloops and inner loops diverges
+from that of CUTLASS 2.x.  With that also comes the introduction of the
+`cutlass::gemm::collective` layer as a direct replacement and a superset
+of the 2.x `cutlass::gemm::threadblock` layer. Going forward,
+CUTLASS 3.x will discontinue new developments in the following namespaces.
+
+* `cutlass::*::threadblock::*` 
+* `cutlass::*::warp::*`
+* `cutlass::gemm::thread::*`
+* `cutlass::arch::*` (except `barrier.h`)
+
+`cutlass::gemm::collective`s are a superset of the threadblock layer where
+all new mainloops will be developed. Users should look to the `CollectiveMma` type
+if they wish to author custom mainloop code in the 3.x API.
+
+Similarly, for the GEMM inner loops, `cute::MMA_Atom`s replace the
+`gemm::warp` and `gemm::thread` layer code. Going forward, all new PTX instructions
+and associated metadata development will occur directly inside [`cute/arch/*.hpp`](/include/cute/arch/) and [`cute/atom/*.hpp`](/include/cute/atom/).
+
+The desired inner loop MMA iteration order and tiling can be achieved through careful
+selection of the atom layout, value layout, and permutations of the `cute::TiledMma`.
+
+For epilogues, the `cutlass::epilogue::collective` layer replaces `cutlass::threadblock::collective`.  However, the thread-level epilogue elementwise operations
+in `cutlass::epilogue::thread` will continue to be used in 3.x kernels as well, albeit, with
+a more idiomatic epilogue vectorization strategy.
+[Example 50](/examples/50_hopper_gemm_with_epilogue_swizzle/50_hopper_gemm_with_epilogue_swizzle.cu)
+shows how to use 2.x epilogue thread operators with 3.0 API kernels.
+
+## Porting from 2.x to 3.0 API
+
+### CUTLASS 2.x layout tags and CUTLASS 3.0 major modes
+
+CUTLASS 2.x and CUTLASS 3.0 use both
+different wording and different types
+to describe the permitted layouts
+of GEMM's input matrices A and B.
+
+CUTLASS 3.0 does not use the terms "column major"
+or "row major" to describe matrix layouts.
+Starting with CUTLASS 3.0, adoption of CuTe allows us to decouple
+
+* the coordinate mode order (logical shape) of layouts from
+
+* the index space stride order of the backing storage.
+
+In line with our switch to a conceptual GEMM hierarchy, we view the major modes not from a BLAS-3 perspective.
+Rather, we divide the modes into two categories.
+
+* "Inner modes" or "K-modes" are contracted over during the GEMM.
+  Therefore, they are not present in the output tensor.
+
+* "Outer modes" or "MN-modes" are preserved in the output.
+
+Now, instead of `RowMajor` or `ColumnMajor`, whose major stride depends on whether we are referring to the
+A or the B matrix, we uniformly employ the "K major" or "MN major" terminology and enforce the convention of all tensors having the shape `[M/N, K, L]` regardless of which mode is major.  That is,
+
+* the input matrix A has shape M x K,
+* the input matrix B has shape N x K, and
+* the input/output matrices C/D have shape M x N.
+
+Note that this convention for B
+differs from the BLAS's GEMM interface,
+which specifies that B has shape K x N.
+
+CUTLASS 3.0 uses these names of the modes
+to specify which mode of a matrix has stride 1.
+For the matrix A,
+
+* "M major" means that the matrix is stride 1
+  in the M mode, and
+* "K major" means that the matrix is stride 1
+  in the K mode.
+
+For the matrix B,
+
+* "N major" means that the matrix is stride 1
+  in the N mode (which for B is mode 0,
+  because the convention is that B is N x K); and
+* "K major" means that the matrix is stride 1
+  in the K mode (which for B is mode 1).
+
+CUTLASS 2.x defines "layout tag" classes
+`cutlass::layout::ColumnMajor` and `cutlass::layout::RowMajor`,
+that live in the header file
+[`cutlass/layout/matrix.h`](/include/cutlass/layout/matrix.h).
+The interpretation of these layouts in GEMM
+depends on whether they are applied
+to the input matrix A or B. For the matrix A, "column major" means 
+that mode corresponding to M extent has stride 1,
+and "row major" means that mode corresponding to K extent has stride 1.
+This is the usual computer science definition
+of column major and row major for a rank-2 array.
+For the matrix B, the opposite holds:
+"column major" means that mode corresponding to N extent has stride 1,
+and "row major" means that mode corresponding to K extent has stride 1.
+
+Using the convention of `[outer, inner, batch]` mode order for tensor logical shapes
+avoids potential confusion with the meaning of column major and row major
+changing depending on whether they are applied to A or B.
+
+The table below summarizes our mode order convention and
+mapping of 2.x layout tags to corresponding M-major, N-major, or K-major strides.
+
+| Matrix | CUTLASS 2.x layout | 2.x Shape  | Logical major mode| 3.x Shape/Stride  | Major ordinal |
+| ---    | ---                | ---        | ---               | ---               | ---           |
+| A      | `ColumnMajor`      | M x K      | M major           | M x K x L         | 0 (outer)     |
+| A      | `RowMajor`         | M x K      | K major           | N x K x L         | 1 (inner)     |
+| B      | `RowMajor`         | K x N      | N major           | N x K x L         | 0 (outer)     |
+| B      | `ColumnMajor`      | K x N      | K major           | N x K x L         | 1 (inner)     |
+| C      | `ColumnMajor`      | M x N      | M major           | M x N x L         | 0 (outer)     |
+| C      | `RowMajor`         | M x N      | N major           | M x N x L         | 1 (inner)     |
+
+Notice that in CUTLASS 3.0, interpretation of layouts no longer changes based on
+whether we are talking about the A or B matrix. M and N major inputs always have a
+static size-1 stride in their 0th (outer) mode. Similarly, K major inputs
+always contain the static size-1 stride in their 1st mode. This uniformity in stride order
+allows us to represent tensor layouts much more cleanly and treat both A and B equally in our interfaces.
+See for example the following snippet from our [`kernel/sm70_gemm.hpp`](/include/cutlass/gemm/kernel/sm70_gemm.hpp)
+for Ampere kernel schedules.
+
+```c++
+// Represent the full tensors
+Tensor mA_mkl = make_tensor(make_gmem_ptr(params.mainloop.ptr_A), make_shape(M,K,L), params.mainloop.dA); // (m,k,l)
+Tensor mB_nkl = make_tensor(make_gmem_ptr(params.mainloop.ptr_B), make_shape(N,K,L), params.mainloop.dB); // (n,k,l)
+
+// Get batch slice
+Tensor mA_mk = mA_mkl(_,_,get<3>(blk_coord_mnkl)); // (m,k)
+Tensor mB_nk = mB_nkl(_,_,get<3>(blk_coord_mnkl)); // (n,k)
+
+// Slice to get the tiles for which this thread block is responsible
+Tensor gA = local_tile(mA_mk, blk_shape, take<0,3>(blk_coord_mnkl), Step<_1, X,_1>{}); // (BLK_M,BLK_K,k)
+Tensor gB = local_tile(mB_nk, blk_shape, take<0,3>(blk_coord_mnkl), Step< X,_1,_1>{}); // (BLK_N,BLK_K,k)
+```
+
+As seem in this snippet, all input tensors have the logical shape `[outer, inner, batch]`,
+and the strides could represent either outer or inner
+(or any other complex hierarchical stride) major storage.
+CuTe layouts always maintain the logical consistency of the coordinate spaces regardless of the strides.
+
+By convention, in CUTLASS 3.0, we treat the M and N mode as the 0th mode,
+and K mode as the 1st mode of the stride.
+
+### Conversions between 2.x tags and 3.0 types
+
+Starting with CUTLASS 3.0, all layouts are described using
+`cute::Shape` and `cute::Stride` which compose into a `cute::Layout<Shape, Stride>`. 
+In CUTLASS 2.x, various layout tags such as `cutlass::layout::RowMajor` are used to specialize
+template implementations. These tag types only encode information about the tensor strides,
+as 2.x layouts did not incorporate any concept of tensor shape in the layout tags themselves.
+Users may find a need to convert between CUTLASS 2.x layout tags, and 3.0
+CuTe stride types. CUTLASS 3.0 `gemm::collective::CollectiveBuilder` interfaces
+also accept these 2.x layout tags as input parameters in their template API as a convenience for users.
+At every entry point into CUTLASS 3.0, these tags get converted to their corresponding CuTe Stride type with
+metafunctions that best approximate their corresponding `cute::Stride`.
+
+* `cutlass::gemm::detail::TagToStrideA_t<LayoutTag>`
+* `cutlass::gemm::detail::TagToStrideB_t<LayoutTag>`
+* `cutlass::gemm::detail::TagToStrideC_t<LayoutTag>`
+
+By convention, and to match user expectations, the `cute::Stride` types that these
+map onto always contain one static mode corresponding to the layout tag, and two 64-bit
+dynamic stride modes corresponding to the minor mode and the batch mode. Batch
+mode is included by default as all CUTLASS 3.0 kernels support packed batch-mode GEMMs
+out of the box.
+
+The [`cutlass/gemm/gemm.h#440`](../../include/cutlass/gemm/gemm.h#440)
+header file includes functions
+that can be useful for converting
+from CUTLASS 3.0 `cute::Stride`s back to CUTLASS 2.x layout tags.
+
+* `cutlass::gemm::detail::StrideToLayoutTagA_t<CuteStride>`
+* `cutlass::gemm::detail::StrideToLayoutTagB_t<CuteStride>`
+* `cutlass::gemm::detail::StrideToLayoutTagC_t<CuteStride>`
+
+These metafunctions take the CuTe Stride as a template parameter and
+attempt to find the size-1 stride in the idiomatic M, N, or K modes
+to best approximate a corresponding 2.x layout tag type.
+Note that this may not work in general for any `cute::Stride`
+as the mapping between the stride and tag type is not bijective.
+
+These mapping utilities are kept in a `detail` namespace
+as we do not guarantee stability of their implementation.
+Their behavior may change in future releases as we add new features.
+However, we do expect these type names to remain stable. For users who want
+these 2.x reflective types from an assembled kernel with a more stable API,
+the specialization of `cutlass::gemm::device::GemmUniversalAdapter`
+for CUTLASS 3.0 kernel provides all aliases for all 2.x type aliases
+in addition to the layout tags. You can see how they are used in the header file
+[`cutlass/gemm/device/gemm_universal_adapter.h`](/include/cutlass/gemm/device/gemm_universal_adapter.h).
+Here is an excerpt.
+
+```c++
+  // Map back to 2.x type as best as possible
+  using LayoutA = gemm::detail::StrideToLayoutTagA_t<typename GemmKernel::StrideA>;
+  using LayoutB = gemm::detail::StrideToLayoutTagB_t<typename GemmKernel::StrideB>;
+  using LayoutC = gemm::detail::StrideToLayoutTagC_t<typename GemmKernel::StrideC>;
+  using LayoutD = gemm::detail::StrideToLayoutTagC_t<typename GemmKernel::StrideD>;
+
+  // Legacy: Assume MultiplyAdd only since we do not use this tag type in 3.0
+  using MathOperator = cutlass::arch::OpMultiplyAdd;
+
+  // If our TiledMMA's instruction thread layout size is larger than 1,
+  // we know it's a tensorop
+  using OperatorClass = std::conditional_t<
+      (cute::size(typename GemmKernel::TiledMma::AtomThrID{}) > 1),
+      cutlass::arch::OpClassTensorOp, cutlass::arch::OpClassSimt>;
+
+  // Assume TiledMma's ShapeMNK is the same as 2.x's ThreadblockShape
+  using ThreadblockShape = cutlass::gemm::GemmShape<
+      cute::size<0>(TileShape{}),
+      cute::size<1>(TileShape{}),
+      cute::size<2>(TileShape{})>;
+
+  using ClusterShape = cutlass::gemm::GemmShape<
+      cute::size<0>(typename GemmKernel::DispatchPolicy::ClusterShape{}),
+      cute::size<1>(typename GemmKernel::DispatchPolicy::ClusterShape{}),
+      cute::size<2>(typename GemmKernel::DispatchPolicy::ClusterShape{})>;
+
+  // We get the instruction shape directly from our TiledMma's atom shape
+  using InstructionShape = cutlass::gemm::GemmShape<
+      cute::size<0>(typename CollectiveMainloop::TiledMma::AtomShape_MNK{}),
+      cute::size<1>(typename CollectiveMainloop::TiledMma::AtomShape_MNK{}),
+      cute::size<2>(typename CollectiveMainloop::TiledMma::AtomShape_MNK{})>;
+
+  static int constexpr kStages = CollectiveMainloop::DispatchPolicy::Stages;
+  static int const kThreadCount = GemmKernel::MaxThreadsPerBlock;
+
+  // Warp shape is not a primary API type in 3.x,
+  // but we can best approximate it by inspecting the TiledMma::TiledShape_MNK.
+  // For this, we make the assumption that we always have 4 warps along M,
+  // and the rest along N, with none along K.  We also always round up
+  // the warp count to 4 if the tiled mma is smaller than 128 threads.
+  static constexpr int WarpsInMma = std::max(4, cute::size(typename GemmKernel::TiledMma{}) / 32);
+  static constexpr int WarpsInMmaM = 4;
+  static constexpr int WarpsInMmaN = cute::ceil_div(WarpsInMma, WarpsInMmaM);
+  using WarpCount = cutlass::gemm::GemmShape<WarpsInMmaM, WarpsInMmaN, 1>;
+  using WarpShape = cutlass::gemm::GemmShape<
+      cute::size<0>(typename CollectiveMainloop::TiledMma::TiledShape_MNK{}) / WarpsInMmaM,
+      cute::size<1>(typename CollectiveMainloop::TiledMma::TiledShape_MNK{}) / WarpsInMmaN,
+      cute::size<2>(typename CollectiveMainloop::TiledMma::TiledShape_MNK{})>;
+
+  // Inspect TiledCopy for A and B to compute the alignment size
+  static int constexpr kAlignmentA = gemm::detail::get_alignment_count_from_gmem_tiled_copy<
+      typename CollectiveMainloop::GmemTiledCopyA, ElementA>();
+  static int constexpr kAlignmentB = gemm::detail::get_alignment_count_from_gmem_tiled_copy<
+      typename CollectiveMainloop::GmemTiledCopyB, ElementB>();
+```
+
+CUTLASS's library and profiler use these reflective interfaces to 
+obtain the kernel's configuration parameters. Users can use these to approximate the CUTLASS 2.x types
+for 3.0 API kernels.  However, the reflective interfaces cannot always match the types exactly,
+as the mappings are not always bijective.
+
+# Copyright
+
+Copyright (c) 2023 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: BSD-3-Clause
+
+```
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions are met:
+
+  1. Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+  3. Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```
--- a/media/docs/cutlass_3x_design.md
+++ b/media/docs/cutlass_3x_design.md
@ -0,0 +1,117 @@
+[README](/README.md#documentation) > **CUTLASS 3.0 Design and Hierarchy**
+
+# CUTLASS 3.0 Design
+
+CUTLASS 3.0 is a major enhancement over the abstractions of CUTLASS 2.x
+and aims to make usage of all layers of the GEMM hierarchy easier and more composable
+while still achieving peak performance on Hardware.
+
+## CUTLASS 3.0 design goals
+
+CUTLASS 3.0 has the following design goals, in no particular order.
+
+- Simplify expressing and manipulating data and thread layouts across
+  the GEMM hierarchy with CuTe layouts and layout algebra.
+
+- Improve code readability and learning curve by
+  reducing the number of named types.
+
+- Functional correctness by default,
+  actionable static asserts otherwise.
+
+- Single, clear points of performance tuning and custom kernel extensions.
+
+- Support for NVIDIA Hopper GPUs with great performance using
+  features such as Tensor Cores, tensor memory accelerator, and thread block clusters.
+
+##  A new Conceptual GEMM Hierarchy
+
+CUTLASS 2.x decomposes the moving parts of a GEMM operation
+across a hierarchy that closely mirrors the organization of GPU
+architectures. This discussed in detail within the
+[CUTLASS 2.x GEMM API documentation](/media/docs/gemm_api.md).
+This design, however, sometimes results in a coupling that is too tight
+to extend to newer GPU features that might not fit into the same architectural
+hierarchy. For instance, Hopper's warp-group wide instructions do not naturally
+fit into any warp or thread layer GEMM concept in CUTLASS 2.x. Even for Volta tensor cores,
+instructions that atomically exist at the quad-pair granularity are first tiled at
+the warp level before use. This hints at the brittleness of the abstraction power.
+
+CUTLASS 3.0 detaches its interface layers from the hardware,
+centering them instead around the natural structure of GEMM algorithms
+not tied to any particular GPU generation.
+This makes CUTLASS's code more robust to GPU architecture evolution,
+less prone to implementation detail leakage, and provides users
+with a consistent interface to hardware acceleration regardless of
+the architecture specific details.
+
+The new conceptual GEMM hierarchy is discussed in detail in the dedicated
+[CUTLASS 3.0 GEMM API documentation readme](/media/docs/gemm_api_3x.md),
+along with code examples of the core concepts and types. 
+
+## Adoption of CuTe Layout and Tensors
+
+CUTLASS 3.0 introduces a new core library, CuTe, to describe and manipulate tensors of threads and data.
+CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly packages the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. 
+
+CUTLASS 3.0 adopts CuTe throughout the GEMM hierarchy in its templates, greatly simplifying the design,
+improving code composability, and readability. More documentation specific to CuTe can be found in its [dedicated documentation directory](/media/docs/cute/00_quickstart.md).
+
+![CuTe helps reduce named iterator types down to a single vocabulary type, `Layout`](/media/images/cutlass-reduction-in-named-iterators.png)
+
+Programming massively parallel systems with various layers of logical thread and data hierarchies is not a trivial task. 
+
+- `cute::Layout`s always maintain logical consistency of their coordinates,
+  allowing us to check pre- and post-conditions at compile time for all static inner loops.
+- Explicit thread to data mapping allows users and kernel authors to inspect and reason about operations
+  from a single point in the source code.
+- Layouts provide a single point of performance tuning, as most optimizations can be done by careful
+  selection of thread and data layouts.
+- Formalized algebra makes manipulation of and reasoning about thread->data mapping explicit in source code.
+- Single vocabulary type (`cute::Layout`) subsumes every iterator and layout in CUTLASS 2.x CUTLASS 2.x uses many bespoke thread maps, iterators, and data layouts. Iterators are fundamentally 1-D, whereas most layouts we encounter in the GPU hierarchy are fundamentally n-D.
+
+## Reducing the number of named types and iterator concepts
+
+CUTLASS 2.x design preferred introducing bespoke named types for each
+architecture specific thread and data layout. For instance, `gemm::treadblock` namespace
+contains implementation for `MmaMultistage`, `MmaPlanarComplexMultistage`, `MmaPipelined` etc.
+despite them providing mainloops for GEMMs. To spell these types the same way in generic code,
+CUTLASS 2.x provides aliases through its `default_x_configuration.h` files, however,
+these aliases make the code much harder to read as the user has to perform type substitution
+mentally in order to understand the codebase.
+
+CUTLASS 3.0 greatly reduces the number of named types used throughout by
+
+- Replacing all iterator concepts for all memory domains with `cute::Tensor`s
+- Dispatching mainloop and epilogue implementations on tag-dispatch policies rather than naming new types
+- Dispatching kernel layer schedules on tag-dispatch policies rather than naming new types
+
+Reducing the number of named types has many benefits:
+
+- It *makes writing generic code easier*, as the primary type names share the same lexical
+  without aliasing through configuration providers.
+- It *flattens the learning curve of CUTLASS* by greatly reducing the mental context required
+  as the library only exposes a handful of named types.
+- It *provides a clear, singular extension point* for users to plug in their customizations
+  through the dispatch policies.
+
+## Correctness by default, Performance through clear, individual points of tuning
+
+CUTLASS 2.x maintained its thread layouts as implicit indexing math implemented
+as a part of 1D iterators. This meant that the thread to data layout mapping
+was implicit in the imperative structure of the C++ code itself and did not have
+a formal algebra we could use to manipulate these mappings. Each iterator
+had to re-implement its indexing and mapping logic. This made it hard to learn
+how this mapping was performed for existing iterators, and even harder to
+implement custom layout functions for the core inner loops of a GEMM.
+
+CUTLASS 3.0 replaces all iterator concepts from CUTLASS 2.x
+with a single layout type for thread and data tensors.
+CuTe's formalized layout algebra is then used at every layer of
+the GEMM hierarchy to manipulate the mapping between the two.
+CuTe layouts always maintain logical consistency, and for fully static layouts
+(such as in the core unrolled inner loops), provide
+compile time checks that break builds if this consistency is violated.
+In this way, CuTe reifies the thread-to-data-layout mapping, 
+makes it easier to write code that is "correct by construction".
+If the code compiles, it's probably correct. 
--- a/media/docs/doxygen_mainpage.md
+++ b/media/docs/doxygen_mainpage.md
@ -1,14 +1,14 @@
-# CUTLASS 2.0
+# CUTLASS 3.0

-_CUTLASS 2.0 - November 2019_
+_CUTLASS 3.0 - January 2023_

 CUTLASS is a collection of CUDA C++ template abstractions for implementing
 high-performance matrix-multiplication (GEMM) at all levels and scales within CUDA.
 It incorporates strategies for hierarchical decomposition and data movement similar
 to those used to implement cuBLAS.  CUTLASS decomposes these "moving parts" into
 reusable, modular software components abstracted by C++ template classes.  These
-thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized
-and tuned via custom tiling sizes, data types, and other algorithmic policy. The
+components can be specialized
+and tuned via custom tiling sizes, data types, and other algorithmic policies. The
 resulting flexibility simplifies their use as building blocks within custom kernels
 and applications.

@ -16,107 +16,25 @@ To support a wide variety of applications, CUTLASS provides extensive support fo
 mixed-precision computations, providing specialized data-movement and
 multiply-accumulate abstractions for 8-bit integer, half-precision floating
 point (FP16), single-precision floating point (FP32), and double-precision floating
-point (FP64) types.  Furthermore, CUTLASS demonstrates warp-synchronous matrix multiply
-operations for targeting the programmable, high-throughput _Tensor Cores_ implemented 
-by NVIDIA's Volta and Turing architectures.
+point (FP64) types.  Furthermore, CUTLASS exploits the _Tensor Cores_ and asynchronous
+memory copy operations of the latest NVIDIA GPU architectures.

+# What's New in CUTLASS 3.0

-# What's New in CUTLASS 2.0
+For an overview of CUTLASS 3.0's GEMM interface levels,
+please refer to the
+[CUTLASS 3.0 GEMM API document](./gemm_api_3x.md).
+To learn how to migrate code using CUTLASS 2.x's interface
+to CUTLASS 3.0, please refer to the
+[backwards compatibility document](./cutlass_3x_backwards_compatibility.md).

-CUTLASS 2.0 is a substantial refactoring from the previous version, intended to offer:
+# GEMM examples

- Better performance over 1.x, particularly for kernels targeting Turing Tensor Cores
- Robust and durable templates that reliably span the design space
- Encapsulated functionality that may be reusable in other contexts
-
-
-# Example CUTLASS GEMM
-
-The following illustrates an example function that defines a CUTLASS GEMM kernel
-with single-precision inputs and outputs. This is an excerpt from the CUTLASS SDK 
-[basic_gemm example](https://github.com/NVIDIA/cutlass/tree/master/examples/00_basic_gemm/basic_gemm.cu).
-
-~~~~~~~~~~~~~~~~~~~~~{.cpp}
-//
-// CUTLASS includes needed for single-precision GEMM kernel
-//
-
-// Defines cutlass::gemm::device::Gemm, the generic Gemm computation template class.
-
-#include <cutlass/gemm/device/gemm.h>
-
-/// Define a CUTLASS GEMM template and launch a GEMM kernel.
-cudaError_t cutlass_sgemm_nn(
-  int M,
-  int N,
-  int K,
-  float alpha,
-  float const *A,
-  int lda,
-  float const *B,
-  int ldb,
-  float beta,
-  float *C,
-  int ldc) {
-
-  // Define type definition for single-precision CUTLASS GEMM with column-major
-  // input matrices and 128x128x8 threadblock tile size (chosen by default).
-  //
-  // To keep the interface manageable, several helpers are defined for plausible compositions
-  // including the following example for single-precision GEMM. Typical values are used as
-  // default template arguments. See `cutlass/gemm/device/default_gemm_configuration.h` for more details.
-  //
-  // To view the full gemm device API interface, see `cutlass/gemm/device/gemm.h`
-
-  using ColumnMajor = cutlass::layout::ColumnMajor;
-
-  using CutlassGemm = cutlass::gemm::device::Gemm<float,        // Data-type of A matrix
-                                                  ColumnMajor,  // Layout of A matrix
-                                                  float,        // Data-type of B matrix
-                                                  ColumnMajor,  // Layout of B matrix
-                                                  float,        // Data-type of C matrix
-                                                  ColumnMajor>; // Layout of C matrix
-
-  // Define a CUTLASS GEMM type
-
-  CutlassGemm gemm_operator;
-
-  // Construct the CUTLASS GEMM arguments object.
-  //
-  // One of CUTLASS's design patterns is to define gemm argument objects that are constructible
-  // in host code and passed to kernels by value. These may include pointers, strides, scalars,
-  // and other arguments needed by Gemm and its components.
-  //
-  // The benefits of this pattern are (1.) a structured, composable strategy for passing host-constructible
-  // arguments to kernels and (2.) minimized initialization overhead on kernel entry.
-  //
-
-  CutlassGemm::Arguments args({M , N, K},  // Gemm Problem dimensions
-                              {A, lda},    // Tensor-ref for source matrix A
-                              {B, ldb},    // Tensor-ref for source matrix B
-                              {C, ldc},    // Tensor-ref for source matrix C
-                              {C, ldc},    // Tensor-ref for destination matrix D (may be different memory than source C matrix)
-                              {alpha, beta}); // Scalars used in the Epilogue
-
-  //
-  // Launch the CUTLASS GEMM kernel.
-  //
-
-  cutlass::Status status = gemm_operator(args);
-
-  //
-  // Return a cudaError_t if the CUTLASS GEMM operator returned an error code.
-  //
-
-  if (status != cutlass::Status::kSuccess) {
-    return cudaErrorUnknown;
-  }
-
-  // Return success, if no errors were encountered.
-
-  return cudaSuccess;
-}
-~~~~~~~~~~~~~~~~~~~~~
+For a code example showing how to define
+a GEMM kernel using CUTLASS, please refer to
+[the quickstart guide](./quickstart.md).
+The [`examples` directory](../../examples)
+has a variety of examples.

 # Copyright

--- a/media/docs/efficient_gemm.md
+++ b/media/docs/efficient_gemm.md
@ -219,6 +219,21 @@ which has to happen at the end among the participating warps.
 This is because each warp computes using only a "slice" of CtaTileK,
 so each warp only has a partial sum before the reduction.

+### Warp Specialization
+
+Starting with Hopper, CUTLASS 3.0 incorporates the concept of [Warp Specialization](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#spatial-partitioning-also-known-as-warp-specialization)
+as part of the kernel design. A thread block is partitioned into two sets of warps, [*producer* warp group](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp) and [*consumer* warp group](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp). The *producer* warp group loads data from global memory into shared memory buffers using the new [Tensor Memory Accelerator (TMA)](https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/). 
+
+[*Producer* warp group (DMA)](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) waits for the shared memory buffers to be signaled as [empty](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) by the *consumer* warp group using the newly added **Async Pipeline class** ([refer](/media/docs/pipeline.md)). Once the data is written into the shared memory, TMA is also updates the barrier associated with that stage to notify affected threads that the buffer has been [filled](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp). The [*Consumer* warp group (MMA)](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) on the other hand waits for the *producer* warp group to signal that the buffer is [filled](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) and then launches tensor core MMA operations. Finally, the *consumer* warp group [releases](/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) the buffers for the next set of TMA loads to happens.
+
+**Warp-Specialized Persistent kernel design**
+
+Another flavor of Warp Specialized kernel design being introduced starting with Hopper is the [*Warp-Specialized Persistent*](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_persistent.hpp) kernel. Like Warp Specialized kernel the concepts of warp groups and barrier synchronization between warp groups remain the same in the persistent design. The distinctive feature of the Warp-Specialized Persistent kernel are the following :  
+* Persistent thread blocks launched to occupy as many SMs as mentioned in the [KernelHardwareInfo](include/cutlass/kernel_hardware_info.hpp) struct. These persistent thread blocks are used to tile the output and thus (potentially) compute multiple output tiles through their lifetime. The main benefit this adds is amortization of the thread-block launch and kernel prologue overheads which are typical of all kernels.
+* Presence of one two *consumer* warp groups which allows for *epilogue* of one *consumer* warp group to be overlapped with the math operations of the other *consumer* warp group - thus maximizing tensor core utilization.   
+
+Each *consumer* warp group is assigned a different output tile. The *producer* warp group synchronizes using the [Ordered Sequence Barrier](/include/cutlass/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order. Since each thread block now computes multiple output tiles, the shape of the grid launch and the scheduling of tiles to the thread blocks is managed using the new [*Tile Scheduler*](/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp). The *Tile Scheduler* considers the shape of the *clusters* as well as the available number of available SMs to compute a valid scheduling of the output tiles to launched thread blocks.
+
 # Resources

 The following additional resources describe design and implementation details of GEMMs
--- a/media/docs/functionality.md
+++ b/media/docs/functionality.md
@ -4,12 +4,15 @@

 # Functionality

+Note : CUTLASS-3 requires users to use CUDA 11.4 or newer, and SM70 or newer, for the target toolkit and architecture, respectively.
+Please refer to the [Compatibility](/README.md#Compatibility) section for more details.
+
 - N - Column Major Matrix
 - T - Row Major matrix
- {N,T} x {N,T} - All combinations, i.e. NN, NT, TN, TT
+- {N,T} x {N,T} - All combinations, i.e., NN, NT, TN, TT
 - [NHWC](/include/cutlass/layout/tensor.h#L63-206) - 4 dimension tensor used for convolution
 - [NCxHWx](/include/cutlass/layout/tensor.h#L290-395) - Interleaved 4 dimension tensor used for convolution
- f - float point
+- f - floating point
 - s - signed int
 - b - bit
 - cf - complex float
@ -22,42 +25,55 @@

 ## Device-level GEMM

-The following table summarizes device-level GEMM kernels in CUTLASS, organized by opcode class, data type, and layout.
+The following tables summarize device-level GEMM kernels in CUTLASS, organized by opcode class, data type, and layout.
 Hyperlinks to relevant unit tests demonstrate how specific template instances may be defined.

+### CUTLASS 3.x Kernels
+
 |**Opcode Class** | **Compute Capability** | **CUDA Toolkit** | **Data Type**                  | **Layouts**            | **Unit Test**    |
 |-----------------|------------------------|------------------|--------------------------------|------------------------|------------------|
-| **Simt**        | 50,60,61,70,75         |  9.2+            | `f32 * f32 + f32 => f32`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_sgemm_nt_sm50.cu)                |
-| **Simt**        | 50,60,61,70,75         |  9.2+            | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_dgemm_nt_sm50.cu)                |
-| **Simt**        | 60,61,70,75            |  9.2+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_hgemm_nt_sm50.cu)                |
-| **Simt**        | 61,70,75               |  9.2+            | `s8 * s8 + s32 => {s32,s8}`    | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_igemm_nt_sm50.cu)              |
-| **WmmaTensorOp**    | 70                 |  9.2+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f16_sm70.cu)     |
-| **WmmaTensorOp**    | 70                 |  9.2+            | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f32_sm70.cu)                |
-| **WmmaTensorOp**    | 75                 |  10.0+           | `s8 * s8 + s32 => {s32, s8}`   | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s8t_wmma_tensor_op_s32_sm72.cu) |
-| **WmmaTensorOp**    | 75                 |  10.0+           | `s4 * s4 + s32 => {s32, s4}`   | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s4t_wmma_tensor_op_s32_sm75.cu)                |
-| **WmmaTensorOp**    | 75                 |  10.0+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_b1t_wmma_tensor_op_s32_sm75.cu)                |
-| **TensorOp**        | 70                 |  10.1+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f16_sm70.cu)                |
-| **TensorOp**        | 70                 |  10.1+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f32_sm70.cu)                |
-| **TensorOp**        | 75                 |  10.2+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm75.cu) |
-| **TensorOp**        | 75                 |  10.2+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm75.cu) |
-| **TensorOp**        | 75                 |  10.2+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 75                 |  10.2+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 75                 |  10.2+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `bf16 * bf16 + f32 => {bf16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_bf16n_bf16t_bf16t_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `tf32 * tf32 + f32 => f32`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `cf32 * cf32 + cf32 => cf32`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_cf32n_cf32t_cf32t_tensor_op_tf32_f32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `cf64 * cf64 + cf64 => cf64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_sm80.cu), [Gaussian 3m](/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_gaussian_sm80.cu) |
-| **SpTensorOp**      | 80                 |  11.1+           | `f16 * f16 + f32 => {f16, f32}`    | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80                 |  11.1+           | `bf16 * bf16 + f32 => {bf16, f32}` | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80                 |  11.1+           | `tf32 * tf32 + f32 => f32`         | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f32n_f32n_f32t_tensor_op_f32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80                 |  11.1+           | `s8 * s8 + s32 => {s8, s32}`       | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80                 |  11.1+           | `s4 * s4 + s32 => {s4, s32}`       | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `f16 * f16 + { f16, f32 } => { f16, f32 }`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `bf16 * bf16 + { f16, f32 } => { bf16, f32 }`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_bf16_bf16_bf16_tensor_op_f32.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `{f32, tf32} * {f32, tf32} + f32 => f32`| { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_f32_f32_f32_tensor_op_f32.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_s8_s8_s8_tensor_op_s32.cu) |
+ 
+
+### CUTLASS 2.x Kernels
+
+|**Opcode Class** | **Compute Capability** | **CUDA Toolkit** | **Data Type**                  | **Layouts**            | **Unit Test**    |
+|-----------------|------------------------|------------------|--------------------------------|------------------------|------------------|
+| **Simt**        | 50+                    |  11.4+            | `f32 * f32 + f32 => f32`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_sgemm_nt_sm50.cu)                |
+| **Simt**        | 50+                    |  11.4+            | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_dgemm_nt_sm50.cu)                |
+| **Simt**        | 60+                    |  11.4+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_hgemm_nt_sm50.cu)                |
+| **Simt**        | 61+                    |  11.4+            | `s8 * s8 + s32 => {s32,s8}`    | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_igemm_nt_sm50.cu)              |
+| **WmmaTensorOp**    | 70+                |  11.4+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f16_sm70.cu)     |
+| **WmmaTensorOp**    | 70+                |  11.4+            | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f32_sm70.cu)                |
+| **WmmaTensorOp**    | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s8t_wmma_tensor_op_s32_sm72.cu) |
+| **WmmaTensorOp**    | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s4t_wmma_tensor_op_s32_sm75.cu)                |
+| **WmmaTensorOp**    | 75+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_b1t_wmma_tensor_op_s32_sm75.cu)                |
+| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f16_sm70.cu)                |
+| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f32_sm70.cu)                |
+| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `bf16 * bf16 + f32 => {bf16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_bf16n_bf16t_bf16t_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `cf32 * cf32 + cf32 => cf32`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_cf32n_cf32t_cf32t_tensor_op_tf32_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `cf64 * cf64 + cf64 => cf64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_sm80.cu), [Gaussian 3m](/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_gaussian_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`    | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `bf16 * bf16 + f32 => {bf16, f32}` | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`         | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f32n_f32n_f32t_tensor_op_f32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `s8 * s8 + s32 => {s8, s32}`       | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `s4 * s4 + s32 => {s4, s32}`       | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu) |
+| **TensorOp**        | 90+                |  11.8+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm90.cu) |


 ## Device-level Implicit GEMM convolution
@ -68,19 +84,19 @@ One can find and/or create equivalent dgrad and wgrad convolutional operators.

 |**Opcode Class** | **Compute Capability** | **CUDA Toolkit** | **Data Type**                  | **Layouts**      | **Unit Test**    |
 |-----------------|------------------------|------------------|--------------------------------|------------------|------------------|
-| **Simt**            | 50,60,61,70,75     |  9.2+            | `f32 * f32 + f32 => f32`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm50.cu)                |
-| **Simt**            | 50,60,61,70,75     |  9.2+            | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm50.cu)                |
-| **TensorOp**        | 70                 |  10.1+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm70.cu) |
-| **TensorOp**        | 75                 |  10.2+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm75.cu) |
-| **TensorOp**        | 75                 |  10.2+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 75                 |  10.2+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm75.cu) |
-| **Simt**            | 80                 |  11.0+           | `f32 * f32 + f32 => f32`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu)                |
-| **Simt**            | 80                 |  11.0+           | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu)                |
-| **TensorOp**        | 80                 |  11.0+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `f16 * f16 + f16 => f16`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `tf32 * tf32 + f32 => f32`     | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm80.cu) |
+| **Simt**            | 50+                |  11.4+            | `f32 * f32 + f32 => f32`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm50.cu)                |
+| **Simt**            | 50+                |  11.4+            | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm50.cu)                |
+| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm70.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm75.cu) |
+| **Simt**            | 80+                |  11.4+           | `f32 * f32 + f32 => f32`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu)                |
+| **Simt**            | 80+                |  11.4+           | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu)                |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f16 => f16`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`     | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm80.cu) |



--- a/media/docs/gemm_api_3x.md
+++ b/media/docs/gemm_api_3x.md
@ -0,0 +1,701 @@
+![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS GEMM API")
+
+[README](/README.md#documentation) > **CUTLASS 3.0 GEMM API**
+
+# CUTLASS 3.0 GEMM API
+
+CUTLASS presents a uniform programming model
+for matrix multiply-accumulate (MMA) operations
+at different levels of the GPU system hierarchy.
+CUTLASS 3.0 has GEMM APIs corresponding to the following levels
+in order of highest to the lowest level.
+
+1. Device
+2. Kernel
+3. Collective
+4. Tiled MMA and Copy
+5. Atom
+
+This document will cover the first three levels in detail:
+Device, Kernel, and Collective.
+It also briefly discusses the Tiled MMA/Copy and Atom level,
+and then refers readers to CuTe's tutorial for more information.
+
+# CUTLASS GEMM Model
+
+CUTLASS implements algorithms that express
+the classical "triply nested loop" GEMM algorithm
+with a tiled structure mirroring the above hierarchy.
+
+The following pseudocode describes the model for a GEMM kernel
+targeting a warp-synchronous matrix multiply instruction like `mma.sync.`
+The entire operation is referred to as "Gemm,"
+as it is assumed that an epilogue operation
+performs the general matrix update similar to BLAS.
+This is pseudocode and is only meant to illustrate which parts of the layers
+correspond to the inner or outer loops of the GEMM.
+
+```c++
+// cutlass::gemm::kernel::GemmUniversal: ClusterTileM and ClusterTileN loops
+//   are either rasterized by the hardware or scheduled by the kernel in persistent kernels.
+// Parallelism over thread block clusters
+for (int cluster_m = 0; cluster_m < GemmM; cluster_m += ClusterTileM) {
+  for (int cluster_n = 0; cluster_n < GemmN; cluster_n += ClusterTileN) {
+
+    // cutlass::gemm::collective::CollectiveMma: mainloop that iterates over all k-tiles
+    // No loop unrolling is performed at this stage
+    for (int k_tile = 0; k_tile < size<2>(gmem_tensor_A); k_tile++) {
+
+      // loops inside cute::gemm(tiled_mma, a, b, c); Dispatch 5: (V,M,K) x (V,N,K) => (V,M,N)
+      // TiledMma uses the hardware instruction provided through its Mma_Atom
+      // TiledMma's atom layout, value layout, and permutations define the iteration order
+      for (int tiled_mma_k = 0; tiled_mma_k < size<2>(A); tiled_mma_k++) {
+        for (int tiled_mma_m = 0; tiled_mma_m < size<1>(A); tiled_mma_m++) {
+          for (int tiled_mma_n = 0; tiled_mma_n < size<1>(B); tiled_mma_n++) {
+
+            // TiledMma's vector mode dispatches to the underlying instruction.
+            mma.call(d, a, b, c);
+          } // tiled_mma_n
+        } // tiled_mma_m
+      } // tiled_mma_k
+    } // k_tile mainloop
+  } // cluster_m
+} // cluster_n
+```
+
+The first three nested `for` loops
+correspond to parallelism over thread block clusters.
+The code does not actually express them as explicit `for` loops.
+Instead, the parallelization scheme over tiles
+is implied by CUDA grid launch semantics.
+However, for persistent kernels,
+these three loops are expressed in the source code 
+as a single `while` loop that queries the
+[work tile scheduler](/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp)
+for problem tiles on which to compute.
+
+Inside the three nested `for` loops,
+one finds code that pulls matrix tiles
+from global memory into more "local" memory
+(like shared memory or registers)
+and computes MMAs.
+These tiled copy and tiled mma iterations are generally
+fully static and get fully unrolled.
+
+# CUTLASS GEMM Components
+
+CUTLASS expresses the above loop nest
+with the following components which are specialized for
+data type, layout, and math instruction.
+
+| API level            | API Class and/or function names                   |
+| ---                  | ---                                               |
+| Device               | `cutlass::gemm::device::GemmUniversalAdapter`     |
+| Kernel               | `cutlass::gemm::kernel::GemmUniversal`            |
+| Collective           | `cutlass::gemm::collective::CollectiveMma` <br /> `cutlass::epilogue::collective::DefaultEpilogue` <br /> `cutlass::epilogue::collective::Epilogue`        <br /> |
+| Tiled (MMA and Copy) | `cute::TiledMma` and `cute::TiledCopy` <br /> `cute::gemm()` and `cute::copy()` |
+| Atom                 | `cute::Mma_Atom` and `cute::Copy_Atom` |
+
+In CUTLASS 3.0, we assemble kernels
+by first composing a collective mainloop and collective epilogue
+together at the kernel layer,
+and then wrapping them with a host-side adapter
+to form a GEMM handle to that kernel.
+
+The following sections describe these components
+in the order a user should instantiate them
+in order to assemble a kernel.  This order is
+
+1. assemble the required collective mainloop and epilogues,
+
+2. compose them together to build a kernel type, and
+
+3. wrap up the kernel with a device layer adapter.
+
+This order is also reflected in the [CUTLASS 3.0 Hopper kernel examples](/examples/48_hopper_warp_specialized_gemm) as seen in the excerpt below.
+
+```c++
+// Step 1: Generate the required collective layer mainloop specialization
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TilesShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAuto,
+    cutlass::gemm::collective::KernelScheduleAuto
+  >::CollectiveOp;
+
+// Step 2: Specify the collective layer epilogue type
+using CollectiveEpilogue = cutlass::epilogue::collective::DefaultEpilogue<
+    cutlass::gemm::TagToStrideC_t<LayoutC>,
+    cutlass::gemm::TagToStrideC_t<LayoutC>,
+    cutlass::epilogue::thread::LinearCombination<ElementC, 1, ElementAccumulator, ElementAccumulator>>;
+
+// Step 3: Compose the mainloop and epilogue together at the kernel layer
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    cute::Shape<int,int,int,int>, // ProblemShape [M,N,K,L]
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+// Step 4: Wrap up the kernel::GemmUniversal kernel class
+// with the device adapter to obtain a host-side handle to the kernel
+using GemmHandle = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+```
+
+Towards the end, we also briefly cover CuTe's tiled mma and copy as well as the atom layer APIs,
+before redirecting users to CuTe-specific documentation for further details.
+
+## Collective API
+
+A Collective is "the largest collection of threads
+onto which mma atoms and copy atoms are tiled."
+That is, it is the largest number of threads in a grid
+that can cooperate by leveraging hardware features
+for accelerated communication and synchronization.
+These hardware features include
+
+* asynchronous array copy
+  (e.g., from global memory to shared memory);
+
+* MMA instructions
+  for small tiles that live in shared memory;
+
+* synchronization operations for clusters,
+  thread blocks, and/or warps; and/or
+
+* hardware acceleration (such as barriers)
+  for ensuring that data dependencies
+  between asynchronous operations are met.
+
+A Collective uses the `TiledMma` and `TiledCopy` API (see below)
+to access operations that copy and perform MMA on tiles.
+
+Different units of parallelism
+(e.g., threads, warps, or thread blocks)
+in a Collective might have different roles.
+For example, in "warp-specialized" algorithms,
+some warps may be responsible for copying data,
+while others may be responsible for computation.
+Nevertheless, the different units of parallelism
+still need to share data and coordinate access
+to the shared data. For example,
+the producer warps in a warp-specialized algorithm
+that copy input matrix tiles into shared memory
+need to let the consumer MMA warp(s) know
+that their MMA inputs are ready.
+We contrast this with the `kernel::` layer API,
+which schedules the collectives over *independent* tiles in the grid.
+
+The Collective API includes both the "mainloop"
+of matrix multiply-accumulate, and the epilogue.
+This API is the composition point for optimizations
+such as mainloop fusions and epilogue fusions.
+It is responsible for implementing
+the `k_tile` loop in the above triply nested loop pseudocode.
+
+### Collective Mainloops
+
+The `cutlass::gemm::collective::CollectiveMma` class
+is the primary interface to the collective
+matrix multiply-accumulate (MMA) mainloops.
+"Mainloop" refers to the "main loop" over tiles --
+the "cluster tile k" loop in the pseudocode
+near the top of this document.
+Any looping over multiple tiles that
+the algorithm might need to do would happen here.
+
+The `CollectiveMma` class is declared in the header
+[cutlass/gemm/collective/collective_mma.hpp](/include/cutlass/gemm/collective/collective_mma.hpp).
+
+```c++
+namespace cutlass::gemm::collective {
+
+template <
+  class DispatchPolicy,
+  class TileShape,
+  class ElementA,
+  class StrideA,
+  class ElementB,
+  class StrideB,
+  class TiledMma,
+  class GmemTiledCopyA,
+  class SmemLayoutAtomA,
+  class SmemCopyAtomA,
+  class TransformA,
+  class GmemTiledCopyB,
+  class SmemLayoutAtomB,
+  class SmemCopyAtomB,
+  class TransformB
+>
+struct CollectiveMma {
+  static_assert(sizeof(ElementA) == 0, "Could not find a mainloop specialization.");
+};
+
+} // namespace cutlass::gemm::collective
+```
+
+- `DispatchPolicy` is the most important type for a collective, and is
+[covered in more detail below](#collective-dispatch-policies).
+
+- `StrideA` and `StrideB` are instances of type `cute::Stride` that represent the global memory layout of A and B tensors. These strides are required to be rank-3, representing the modes `[outer, inner, batch]`. Each of the 3 ranks can be a multi-modal hierarchical stride; this would apply if implementing a tensor contraction.
+
+- `TiledMma` is an instance of `cute::TiledMma`.
+
+- `GmemTiledCopyA` and `GmemTiledCopyB` are instances of `cute::TiledCopy` types. Both tiled operation types are [covered in more detail below](#tiled-mma-and-copy).
+
+- `SmemLayoutAtomA` and `SmemLayoutAtomB` are instances of type `cute::Layout` and represent the smallest
+layout that will get tiled over the entire collective's shared memory. This layout does _not_ include the
+pipeline mode, and therefore, both are expected to be rank 2 layouts of shape [`outer`, `inner`].
+
+- `SmemCopyAtomA` and `SmemCopyAtomB` are `Copy_Atom`s to be used for moving data from shared memory
+into register memory.
+
+Notice that CUTLASS 3.0 mainloops do not accept a dedicated accumulator element type.
+We obtain the accumulator type from the `typename TiledMma::ValTypeC`. Note also that
+top level API's `ElementA` and `ElementB` can defer from those of the MMA facing
+`typename TiledMma::ValTypeA` and `typename TiledMma::ValTypeB`, allowing TMA or user
+supplied transform operations to perform type conversions.
+
+### Collective Dispatch Policies
+
+`CollectiveMma` implementations are not generic.
+Instead, they must be specialized for each algorithm and GPU architecture.
+Users can dispatch to a `CollectiveMma` specialization
+by picking template arguments matching that specialization.
+CUTLASS 3.0 adopts a tag-based dispatch policy type to specialize
+mainloop implementations and add tuning knobs to them.
+
+Below is an example of one of the dispatch policies that is used to dispatch to a Hopper TMA
+warp-specialized mainloop implementation:
+
+```c++
+// n-buffer in smem (Hopper TMA),
+// pipelined with Hopper GMMA and TMA,
+// warp-specialized dynamic schedule
+template<
+  int Stages_,
+  class ClusterShape_ = Shape<_1,_1,_1>,
+  class KernelSchedule = KernelTmaWarpSpecialized
+>
+struct MainloopSm90TmaGmmaWarpSpecialized {
+  constexpr static int Stages = Stages_;
+  using ClusterShape = ClusterShape_;
+  using ArchTag = arch::Sm90;
+  using Schedule = KernelSchedule;
+};
+```
+
+The `Stages_` template parameter lets the user freely vary the number of pipeline stages,
+while the `ClusterShape_` type allows for parameterization over the shape of the threadblock
+cluster over which TMA multicast will take place.
+
+The collective dispatch policy is also the primary point of composing various kernel schedules
+freely with any mainloop. Each mainloop policy either prescribes a `Schedule` with which
+it needs to be run, or exposes a template API that lets the user pick a subset of the following schedules:
+
+```c++
+struct KernelMultistage { };
+struct KernelTma { };
+struct KernelTmaWarpSpecialized { };
+struct KernelTmaWarpSpecializedPersistent { };
+```
+
+- A single kernel schedule can support multiple mainloop implementations. For example,
+`KernelMultistage` can be composed with many different mainloop implementations across GPU
+architectures such as `MainloopSm70TwoStage`, `MainloopSm80CpAsyncUnpredicated`, `MainloopSm90CpAsyncGmma`, and many more.
+
+- A single mainloop can be composed with multiple
+possible kernel schedules. For example, the `MainloopSm90TmaGmmaWarpSpecialized` can be
+composed with either the `KernelTmaWarpSpecialized` or `KernelTmaWarpSpecializedPersistent`
+kernel schedules.
+
+As [discussed in the CUTLASS 3.0 design documentation](cutlass_3x_design.md), adopting tag
+dispatch policies for our core vocabulary types allows us to maintain a single type name for
+all operations that conceptually belong to the same class. This design has the following benefits.
+
+- It *avoids code duplication* in cases where mainloops can be composed with multiple kernels or vice versa.
+- It *makes writing generic code easier*, as the primary type name `CollectiveMma` does not change across any implementation.
+- It *provides a clear, singular extension point* for users to plug in new, custom mainloops implementations specialized on their own dispatch policies.
+
+### Collective Builder for `CollectiveMma`s
+
+The primary `CollectiveMma` is intended to be an expert user interface that allows full control over
+all the properties of the collective's GPU micro-kernel. However, often a user just wants an
+off-the-shelf GEMM mainloop implementation parameterized on simple configuration parameters. CUTLASS 3.0
+provides [`cutlass::gemm::collective::CollectiveBuilder`](include/cutlass/gemm/collective/collective_builder.hpp) for such scenarios.
+
+```c++
+namespace cutlass::gemm::collective {
+template <
+  class ArchTag,
+  class OpClass,
+  class ElementA,
+  class GmemLayoutA,
+  int AlignmentA,
+  class ElementB,
+  class GmemLayoutB,
+  int AlignmentB,
+  class ElementAccumulator,
+  class TileShape_MNK,
+  class ClusterShape_MNK,
+  class StageCountType,
+  class KernelScheduleType,
+  class Enable = void
+>
+struct CollectiveBuilder {
+  static_assert(sizeof(ElementA) == 0, "Could not build a collective for given parameters.");
+};
+} // namespace cutlass::gemm::collective
+```
+
+`CollectiveBuilder` accepts CUTLASS 2.x equivalent input template arguments, and attempts to build
+the best performing `CollectiveMma` from the given parameters.
+
+- `ArchTag` is one of the SM architectures tags from `cutlass::arch::Sm*`.
+- `OpClass` is one of the operator class tags from `cutlass::arch::Sm*`.
+- `ElementA` and `ElementB` are the logical value types of the A resp. B tensors.
+- `ElementAccumulator` is the accumulator type to be used in the instruction.
+- `GmemLayoutA` and `GmemLayoutB` are CUTLASS 2.x layout tags, `layout::RowMajor` or `layout::ColumnMajor`.
+- `AlignmentA` and `AlignmentB` are global memory alignments of A and B tensors in terms of element count.
+- `TileShape_MNK` is an instance of `cute::Shape` that is rank-3, representing the MxNxK collective tile shape.
+- `ClusterShape_MNK` is an instance of `cute::Shape` that is rank-3, representing the MxNxK threadblock cluster tile shape.
+- `StageCountType` is either `collective::StageCountAuto` or an instance of `collective::StageCount<N>`.
+- `KernelScheduleType` is either `collective::KernelScheduleAuto` or one of the specific kernel schedule tags discussed in the [dispatch policy section](#collective-dispatch-policies) above.
+
+`StageCountAuto` allows the collective builder to compute the size of a single stage's size in shared memory
+and maximize the shared memory usage assuming 1 threadblock / multiprocessor occupancy.
+
+`KernelScheduleAuto` allows the collective builder to pick the best kernel schedule available for the
+given set of parameters, or let's the user override this with a specific kernel schedule type.
+
+Note that collective builders are still in beta, and their functionality
+does not map onto the full design space that the primary expert `CollectiveMma` API
+allows for. We expect their supported mainloop types to expand in future releases, but 
+with 3.0, only SM90 tensorop kernels are supported through the builder API. The builder API
+may also change in the future as we adopt user feedback.
+
+If the builder is able to provide a collective mainloop type for the given set of parameters,
+it will be aliased within as `CollectiveOp`. For more information on how to
+parameterize kernels conveniently with the collective builder, please see example [49_hopper_gemm_schedules_with_collective_builder](49_hopper_gemm_schedules_with_collective_builder).
+
+### Epilogue
+
+The collective epilogue implements element-wise operations
+involving the output matrix.  Users can provide a custom
+epilogue, or use one of the standard epilogues.
+These live in the directory
+[include/cutlass/epilogue/collective/](../../include/cutlass/epilogue/collective/),
+and include classes like
+`cutlass::epilogue::collective::DefaultEpilogue`
+and
+`cutlass::epilogue::collective::Epilogue`.
+CUTLASS's provided collective epilogues
+do not live under `include/cutlass/gemm`
+or in the `cutlass::gemm` namespace,
+because they can be used for computations
+other than GEMM.
+
+## Kernel API
+
+The kernel is "a collection of all clusters in the grid."
+The kernel layer schedules have four main responsibilities.
+
+- Ordering the execution of collectives within the kernel, performing any synchronization between that may be necessary
+- Marshalling the threads of a warp specialized schedules into their respective roles
+- Performing any necessary grid swizzling logic
+- Tiling the input tensors with the threadblock cluster value tile before invoking the collectives on them
+
+The Kernel API is the entry point for a grid of thread blocks
+that may or may not be organized in a cluster.
+It is the composition point for fusing back-to-back GEMMs,
+epilogues, and/or other operations.
+
+The entry point API for CUTLASS 3.0 kernel is the class
+`cutlass::gemm::kernel::GemmUniversal`, found in the header file
+[include/cutlass/gemm/kernel/gemm_universal.hpp](../../include/cutlass/gemm/kernel/gemm_universal.hpp).
+`GemmUniversal` is a stateless universal device kernel
+that implements GEMM as the composition of two parts:
+
+* a collective mainloop, and
+* a collective epilogue
+
+```cpp
+namespace cutlass::gemm::kernel {
+/*
+ * Stateless universal device GEMM kernel type that treats GEMM as
+ * a composition of a collective mainloop and a collective epilogue.
+ *
+ * Supports both the 2.x and 3.x APIs based on whether the first type is
+ * a cute::tuple<> or not.
+ * 2.x API implementation: cutlass/gemm/kernel/gemm_universal.h
+ * 3.x API implementation: cutlass/gemm/kernel/gemm_*.hpp
+ *
+ * In the following declaration, the name preceding the 'Or' refers to
+ * 3.x API type argument order, and the name succeeding the 'Or' refers to
+ * 2.x API type argument order. Template arguments without two names
+ * belong to the 3.x API only.
+**/
+template <
+  class ProblemShapeOrThreadblockMma_, // (m, n, k) or (m, n, k, l)
+  class CollectiveMainloopOrEpilogue_,
+  class CollectiveEpilogueOrThreadblockSwizzle_,
+  class GridSwizzle_ = void,
+  class Enable = void
+>
+class GemmUniversal;
+} // namespace cutlass::gemm::kernel
+```
+
+*Stateless* means that the caller --
+for example, the Device API described above --
+manages the kernel's state.
+The kernel just takes input and output parameters (`Params`).
+
+*Universal* means that `GemmUniversal` works
+for both CUTLASS 3.0 and 2.x interfaces
+and across a broad range of kernel schedules.
+If `GemmUniversal`'s first template argument is a `cute::Shape`,
+then `GemmUniversal` assumes that the remaining template arguments
+implement the 3.0 APIs.  Otherwise, `GemmUniversal` assumes that
+the remaining template arguments implement the 2.x APIs.
+Starting with CUTLASS 3.0, the problem shape has been promoted
+to a top-level template API for the GEMM kernel.
+This supports fully static GEMM instantiations
+where the user expects to know some or all
+of the problem shapes at compile time
+in order to extract even more performance.
+
+The *collective mainloop* implements MMA on local tiles.
+The *collective epilogue* addresses any operations after the MMA,
+such as applying the `beta * C` part of `C := beta * C + alpha * A * B`.
+We will explain *collective* in more detail below.
+
+Specializations of `kernel::GemmUniversal` for 3.0 APIs live in 
+any of various `gemm_*.hpp` files in the directory
+[include/cutlass/gemm/kernel/](../../include/cutlass/gemm/kernel/).
+Specializations for 2.x APIs can be found in the header file
+[include/cutlass/gemm/kernel/gemm_universal.h](../../include/cutlass/gemm/kernel/gemm_universal.h).
+
+CUTLASS 3.x implements various embodiments of `kernel::GemmUniversal`.
+Each kernel layer schedule is specialized
+for a GEMM scheduling algorithm and GPU architecture.
+Specializations of `kernel::GemmUniversal` for 3.0 APIs live in 
+any of various `include/cutlass/gemm/kernel/{arch_tag}*.hpp` files in the directory
+[include/cutlass/gemm/kernel/](../../include/cutlass/gemm/kernel/).
+Which specialization to dispatch to is decided through the dispatch policy's `Schedule` type.
+
+For example, the header file
+[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_persistent.hpp](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_persistent.hpp)
+has a specialization of `kernel::GemmUniversal` for Hopper
+that uses a warp-specialized mainloop with a persistent scheduling algorithm,
+while the header file
+[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp)
+has a specialization of `GemmUniversal` for Hopper
+that uses a warp-specialized but non-persistent algorithm.
+
+To support composition between supported kernel schedules and mainloop dispatch policies without having to
+duplicate collective mainloop implementations, GEMM kernel layer schedules can be composed with
+any mainloop that specifies their corresponding kernel schedule as their `Schedule` type in the policy.
+This is discussed in detail in the [collective dispatch policy section](#collective-dispatch-policies) above.
+
+```c++
+// An example of the SM90 KernelMultistage kernel's
+// specialization logic that allows it to be composed
+// with many mainloops such as `MainloopSm80CpAsync`
+// and `MainloopSm70TwoStage`.
+template <
+  class ProblemShape_,
+  class CollectiveMainloop_,
+  class CollectiveEpilogue_,
+  class GridSwizzle_
+>
+class GemmUniversal<
+  ProblemShape_,
+  CollectiveMainloop_,
+  CollectiveEpilogue_,
+  GridSwizzle_,
+  std::enable_if_t<std::is_base_of_v<KernelMultistage, typename CollectiveMainloop_::DispatchPolicy::Schedule>>>
+```
+
+## Device API
+
+The Device API is a universal, kernel-agnostic host interface
+for kernel launch and managing the lifetime of 
+reusable host-side parameters.
+
+This API is how users' host-side .cu code
+invokes CUTLASS's single-GPU GEMM kernels.
+It serves the same purpose as cuBLAS and behaves similarly.
+
+The entry point for the Device GEMM API is the class
+`cutlass::gemm::device::GemmUniversalAdapter`.
+This class lives in the header file
+[include/cutlass/gemm/device/gemm_universal_adapter.h](/include/cutlass/gemm/device/gemm_universal_adapter.h).
+`GemmUniversalAdapter` is a stateful, reusable handle,
+which is parameterized on the `cutlass::gemm::kernel` type.
+
+```c++
+/*! 
+  GemmUniversalAdapter is a stateful, reusable GEMM handle built around a kernel
+  of type cutlass::gemm::kernel::*
+
+  It manages the lifetime of the underlying `kernel::Params` struct, and exposes APIs
+  to create it from the host facing arguments. For power users, new static methods
+  are exposed in 3.x APIs that bypass the stateful methods or args->params lowering.
+
+  It supports kernel types that implement both the 2.x and 3.0 APIs,
+  however, this is done by specializing the implementation of GemmUniversalAdapter
+  on the two kernel API types, and thus, GemmUniversalAdapter's behavior might
+  differ between the two specializations.
+*/
+template <class GemmKernel_, class Enable = void>
+class GemmUniversalAdapter;
+```
+
+*Stateful* means that the handle instance contains state
+that the kernel needs to run.
+This means that the user must initialize the handle first,
+then use the initialized handle instance to run the kernel.
+Statefulness also means that the handle can manage the lifetime
+of the kernel's `Params` -- the parameters of the kernel itself.
+An important duty of `GemmUniversalAdapter`
+is to map from the user's `Arguments` --
+what the user sees as the kernel's parameters --
+to the `Params` that the kernel actually sees.
+For power users, the class exposes new static methods
+in 3.0 APIs that can bypass stateful methods
+or go directly to `Params` without intermediate `Arguments`.
+
+*Reusable* means that the handle instance can be used
+to call the kernel multiple times with different arguments
+(e.g., different matrices).
+Reusing the handle may be more efficient than just
+creating a new handle for each kernel invocation.
+
+*Parameterized on the kernel type* means that
+the `GemmUniversalAdapter` class' behavior
+depends on the GEMM kernel type (see the next section).
+Specifically, `GemmUniversalAdapter` has a template parameter
+`GemmKernel`, which is the GEMM kernel type.
+Valid template arguments for `GemmKernel` are
+
+* `cutlass::gemm::kernel::GemmUniversal`,
+  implementing CUTLASS 3.x API kernels;
+* `cutlass::gemm::kernel::GemmUniversal`,
+  implementing CUTLASS 2.x API kernels; or
+* Any valid CUTLASS 2.x `kernel` layer GEMM that
+  was previously composable with the `device::GemmUniversalAdapter`.
+
+`GemmUniversalAdapter` presents a single
+host-side interface to both 3.0 and 2.x kernels.
+CUTLASS accomplishes this by
+specializing `GemmUniversalAdapter`'s implementation
+on either the 2.x API implementing kernel layer GEMMs, or on the 3.x API
+implementing kernel layer GEMMs. The metafunction [`cutlass::gemm::detail::IsCutlass3GemmKernel`](cutlass_3x_backwards_compatibility.md#kernel-api-design-differences)
+is what `GemmUniversalAdapter` uses to distinguish between 2.x and 3.x kernels.
+
+`GemmUniversalAdapter` sets up and launches the kernel, using the 
+CUDA extended launch API for threadblock cluster support if required.
+Note, `GemmUniversalAdapter` does *not* specify the grid shape.
+The kernel controls the grid shape
+and other kernel-specific launch parameters.
+This makes it possible for all 3.0 kernels
+to use the same kernel launch code,
+thus factoring out kernel launch from the actual kernel.
+
+## Tiled MMA and Copy
+
+The Tiled MMA or Copy are tilings of MMA atoms resp. Copy atoms
+across threads and data, with possible permutations applied to the 
+resulting tiling. This layer is most analogous to the warp level
+tiling of MMA instructions in CUTLASS 2.x. However, it views the tiling
+from the perspective of all threads participating in the operation
+and generalizes the concept to copy operations as well. The purpose
+of this layer is to build composable GPU micro-kernels out of a plethora
+of hardware accelerated math and data movement operations, each with their
+unit layouts in threads and data. The tiled MMA and Copy types present
+all these various hardware accelerated CuTe Atoms with a single, consistent
+API.
+
+The resulting tiled operation acts as a single MMA or copy operation
+that users can invoke in the "inner" loop
+of the three-nested-loops pseudocode
+at the top of this document using `cute::gemm()` or `cute::copy()`.
+
+We call this API "tiled" because it constructs
+larger operations out of the Atoms provided by CuTe,
+as if fitting together individual tiles
+to build a reusable component of a mosaic.
+For example, CuTe might provide an MMA Atom
+that users can call on a single warp,
+for fixed M, N, and K dimensions.
+CUTLASS can then use CuTe operations like `make_tiled_mma`
+to turn this Atom into an operation
+that works on an entire thread block,
+for larger M, N, and K dimensions.
+
+## Atom API
+
+An "Atom" is the smallest collection of threads and data
+that must participate in the execution of a hardware-accelerated
+math or copy operation.
+
+An Atom is "atomic" (indivisible) not in the sense of
+concurrent memory operations like `atomicAdd`
+(which are "indivisible in time (causality)"),
+but in the sense of indivisibility in "space" --
+the number of values and the groups of parallel workers
+that must participate in the operation together.
+
+An Atom uses CuTe Layouts to express the required
+dimensions and strides of its input and output arrays.
+Generally these are fixed at compile time.
+
+The Atom API wraps calls to actual hardware instructions
+that accelerate MMA or copy operations.
+Users can ask for GPU architecture-specific implementations,
+or just pick generic implementations and rely on
+whatever GPU architectures were enabled.
+
+For more information about Atoms,
+please refer to CuTe's tutorial, e.g., the sections on
+
+* [algorithms](./cute/04_algorithms.md) like `gemm` and `copy`,
+
+* [MMA Atoms](./cute/0t_mma_atom.md#cute-mma-atoms), and
+
+* [a GEMM example](./cute/0x_gemm_tutorial.md).
+
+# Copyright
+
+Copyright (c) 2023 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: BSD-3-Clause
+
+```
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions are met:
+
+  1. Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+  3. Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```
--- a/media/docs/layout.md
+++ b/media/docs/layout.md
@ -2,6 +2,11 @@

 [README](/README.md#documentation) > **Layouts and Tensors**

+Note: This document talks about CUTLASS 2.x layout tag types.
+CUTLASS 3.0 deprecates all legacy 2.x layout tags in favour of a single `cute::Layout<Shape, Stride>`
+vocabulary type for all thread and data tensors. Please refer to the
+[documentation for cute layouts](media/docs/cute/01_layout.md) for more details about CUTLASS 3.0's definition of "layout".
+
 # Layouts and Tensors

 _Tensors_ are mathematical objects represented by a multidimensional array of numeric elements in memory.
--- a/media/docs/pipeline.md
+++ b/media/docs/pipeline.md
@ -0,0 +1,210 @@
+# Synchronization primitives
+
+## Overview of CUDA's synchronization methods
+
+The CUDA programming model provides 3 abstractions:
+
+* hierarchical parallelism -- that is, parallel threads
+  grouped into hierarchical units such as blocks and clusters;
+
+* shared memory, through which parallel threads that are
+  in the same hierarchical unit can communicate; and
+
+* synchronization methods for threads.
+
+These abstractions help developers extract
+both fine-grained and coarse-grained parallelism,
+by making it possible for them to subdivide problems
+into independent components,
+and to insert synchronization at appropriate points.
+
+Over the years CUDA has introduced several synchronization primitives
+that operate at different levels of the hierarchy.
+These include
+
+* [thread block - level](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions) synchronization (e.g., `__syncthreads()`);
+
+* [warp-level](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/) synchronization (e.g., `__syncwarp()`); and
+
+* [thread-level](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#memory-fence-functions) fence operations.
+
+As an extension to this, starting with the Hopper architecture, CUDA added the following improvements:
+
+* [thread block clusters](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-block-clusters) --
+  a new level in the thread hierarchy representing
+  a group of thread blocks that can coordinate and share data;
+
+* synchronization instructions for a thread block cluster and threads within a cluster scope.
+
+## CUTLASS's abstractions for Hopper features
+
+CUTLASS now includes abstractions
+for the following features introduced in Hopper.
+
+1. Thread block cluster - level synchronization and query
+   [APIs](/include/cute/arch/cluster_sm90.hpp)
+
+2. Abstractions for new
+   [barrier instructions](/include/cutlass/arch/barrier.h)
+   which help with efficient synchronization
+   of threads within a thread block cluster.
+
+### Asynchronous pipelines
+
+In order to write a performant GEMM Kernel,
+software pipelining is critical to hide the latency of global memory loads.
+(Please refer to the
+[Efficient GEMM](/media/docs/efficient_gemm.md#pipelining) document.)
+Different threads or groups of threads
+may have different roles in the pipeline.
+Some are "producers" that load data or perform computations
+to satisfy other threads' input data dependencies.
+The same or different threads may be "consumers"
+that do other work with those input data dependencies,
+once they are satisfied.
+Starting with the Hopper architecture,
+the presence of hardware-accelerated synchronization instructions
+make it possible for "producer" and "consumer" threads
+to communicate with each other efficiently
+about their data dependencies.
+
+Implementing a persistent GEMM algorithm calls for managing
+dozens of different kinds of asynchronously executing operations
+that synchronize using multiple barriers organized as a circular list.
+This complexity is too much for human programmers to manage by hand.
+As a result, we have developed
+[asynchronous Pipeline classes](/include/cutlass/pipeline.hpp).
+These classes help developers orchestrate a pipeline
+of asynchronous producer and consumer threads,
+without needing to worry about lower-level hardware details.
+These classes serve a similar function as the various
+[pipeline abstractions](https://nvidia.github.io/libcudacxx/extended_api/synchronization_primitives/pipeline.html)
+in libcu++.
+
+#### Pipeline methods 
+  
+##### Producer acquire 
+
+The `producer_acquire` method is to be used by asynchronous producer threads
+before issuing other instructions associated with a particular pipeline stage
+(e.g., copy or write).
+
+This is a blocking instruction
+which blocks further execution of consumer threads
+unless the particular stage waiting to be acquired
+is released by a consumer.
+
+We say that a pipeline at its start is "empty" if producer threads are free to produce and do not need to wait for a consumer release -- that is, if an acquire operation is expected to succeed.  If the pipeline at its start is empty, then we can either skip performing producer acquire operations during the first pass through the pipeline stages, or use the `make_producer_start_state` method.  The latter ensures that the acquire operation will succeed at the start of a pipeline.
+
+##### Producer commit
+
+The `producer_commit` method is to be issued by asynchronous producer threads
+after the instructions associated with a particular stage
+(e.g., shared memory writes) have completed,
+in order to notify the waiting asynchronous consumer threads.
+This is a nonblocking instruction.
+
+This API may result in a No-Op in some cases,
+if the producer instructions also update the barrier stage associated automatically
+(e.g., TMA_based producer threads using the  `PipelineTmaAsync ` class).
+
+##### Consumer wait
+
+The `consumer_wait` method is to be used by consumer threads
+before consuming data from a particular pipeline stage
+which is expected to be produced by producer threads.  
+
+This is a blocking instruction.  That is,
+until the producer threads have committed to a particular stage,
+this instruction is expected to block further execution of consumer threads.
+
+##### Consumer release
+
+The `consumer_release` method is to be used by consumer threads
+to signal waiting producer threads that they have finished consuming data
+associated with a particular stage of the pipeline.
+This is a nonblocking instruction.
+
+#### Pipeline example
+
+```c++
+// 4-stage Pipeline
+static constexpr int NumStages = 4;
+using MainloopPipeline = typename cutlass::PipelineAsync<NumStages>;
+using PipelineState = typename cutlass::PipelineState<NumStages>;
+
+// 2 producer threads and 1 consumer thread 
+typename MainloopPipeline::Params params;
+params.producer_arv_count = 2;
+params.consumer_arv_count = 1;
+MainloopPipeline pipeline(shared_storage.storage, params);
+  
+// Producer threads
+if (thread_idx == 0 or thread_idx == 1) {
+  PipelineState smem_pipe_write = cutlass::make_producer_start_state<MainloopPipeline>();
+  for ( ; iter > 0; --iter) {
+    pipeline.producer_acquire(smem_pipe_write);
+
+    // Producer ops
+    // If any memory operations are involved, then we also need
+    // to guarantee that writes are completed and visible to consumer(s).
+
+    pipeline.producer_commit(smem_pipe_write.index());
+    ++smem_pipe_write;
+  }
+}
+else if (thread_idx == 2) {
+  PipelineState smem_pipe_read;
+  for (; iter > 0; --iter) {
+    pipeline.consumer_wait(smem_pipe_read);
+
+    // Consumer ops
+
+    pipeline.consumer_release(smem_pipe_read);
+    ++smem_pipe_read;
+  }
+}
+```
+
+In this example, we create an instance of the asynchronous pipeline class `PipelineSync`,
+and then synchronize among 3 asynchronously executing threads:
+2 producer threads and 1 consumer thread.
+
+Please note that this is a basic example.
+There are different versions possible,
+depending on what the producer and consumer threads are doing.
+Please refer to our [unit tests](/test/unit/pipeline)
+and the other [pipeline classes](/include/cutlass/pipeline.hpp)
+for more details.
+
+# Copyright
+
+Copyright (c) 2023 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: BSD-3-Clause
+
+```
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions are met:
+
+  1. Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+  3. Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```
--- a/media/docs/profiler.md
+++ b/media/docs/profiler.md
@ -13,7 +13,7 @@ The CUTLASS Profiler may be compiled with:
 $ make cutlass_profiler -j
 ```

-To limit compilation time, only one tile size (typically 128x128) is instantiated for each data type, 
+To limit compilation time, only one tile size (typically 128x128) and threadblock cluster size (typically 2x1x1) is instantiated for each data type, 
 math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an 
 empty `build/` directory.
 ```bash
@ -168,8 +168,8 @@ Example:
 The CUTLASS Profiler is capable of executing GEMM and Sparse GEMM problems.

 The CUTLASS Profiler can be built with cuBLAS enabled to use as a reference implementation. If CMake detects
-the cuBLASS library available in the system, it is included as a dependency. This may be explicitly overridden
-with CMake flag `CUTLASS_ENABLE_CUBLAS`. 
+the cuBLAS library available in the system, it is included as a dependency. This may be explicitly overridden
+with CMake flag `CUTLASS_ENABLE_CUBLAS`.

 ## GEMM Arguments

@ -197,6 +197,9 @@ GEMM
  [int]       --cta_m,--threadblock-shape::m                    Threadblock shape in the M dimension.
  [int]       --cta_n,--threadblock-shape::n                    Threadblock shape in the N dimension.
  [int]       --cta_k,--threadblock-shape::k                    Threadblock shape in the K dimension.
+  [int]       --cluster_m,--cluster-shape-shape::m              Cluster shape in the M dimension.
+  [int]       --cluster_n,--cluster-shape-shape::n              Cluster shape in the N dimension.
+  [int]       --cluster_k,--cluster-shape-shape::k              Cluster shape in the K dimension.
  [int]       --stages,--threadblock-stages                     Number of stages of threadblock-scoped matrix multiply.
  [int]       --warps_m,--warp-count::m                         Number of warps within threadblock along the M dimension.
  [int]       --warps_n,--warp-count::n                         Number of warps within threadblock along the N dimension.
@ -342,7 +345,50 @@ To faclitate generation of pivot tables and charts, additional columns may be pr
 $ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn            \
                                    --m=3456 --n=4096 --k=8:4096:8 --output=report.csv \
                                    --tags=cutlass:2.2,date:2020-06-08
-```  
+```
+
+## CUTLASS 3.0 GEMM procedural names
+
+CUTLASS 3.0 introduces a new naming convention for GEMMs used by the profiler targeting the NVIDIA
+Hopper architecture and beyond so as to indicate new features of the kernel within the name
+(e.g., the cluster shape).
+
+To best illustrate this naming convention, we will walk through the meaning of each of the components
+in a GEMM kernel used by the profiler:
+```
+cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_128x128x64_2x1x1_0_ntn_align8
+```
+
+The components within this name are as follows:
+
+* `cutlass3x`: indicates that the kernel was generated through the CUTLASS 3.0 API
+* `sm90`: indicates that the kernel targets NVIDIA GPUs with compute capability 90
+* `tensorop`: indicates that the kernel makes use of NVIDIA Tensor Cores
+(as opposed to `simt`, which indicates the use of "CUDA cores")
+* `s`: indicates that the Tensor Core instruction being used accumulates in single precision
+(as opposed to `h`, which indicates half precision)
+* `64x128x16gemm`: indicates that the shape of the Tensor Core instruction being used (MxNxK) is 64x128x16
+* `f16_f16_f32_f16`: indicates that the data types for operands A, B, and C are each `f16`
+(half precision) and that accumulation is performed using `f32` (single precision)
+* `128x128x64`: indicates that the thread block shape used in the GEMM (MxNxK) is 128x128x64
+* `2x1x1`: indicates that the cluster shape being used is 2x1x1
+* `0`: indicates that the kernel uses the CollectiveBuilder's automatic stage calculation to determine the
+number of pipeline stages in the kernel. Note that `0` does not mean that no stages are used. A nonzero value indicates that automatic stage calculation is not performed and indicates the number of pipeline stages to be used.
+This 0 is only added to the kernel's procedural name, the profiler will still report the actual stage count
+when printing the kernel argument details (`--stages=N`) and kernel discovery will still support filtering through the `--stages` argument.
+* `ntn`: indicates that the layouts for operands A, B, and C are column major ("n"; non-transposed),
+row major ("t"; transposed), and column major, respectively.
+* `align8`: indicates that the maximum alignment between operands A and B is 8.
+
+Note that in some special cases where the input A/B types do not match that of the MMA
+instruction's, the MMA facing input type is added to the instruction string as well.
+
+```
+cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4
+```
+
+* `s64x128x8tf32gemm`: indicates that the MMA consumes inputs in `tf32` format, and therefore
+the kernel performs rounding of the `f32` values in global memory while loading them into shared memory.

 # Convolution

--- a/media/docs/programming_guidelines.md
+++ b/media/docs/programming_guidelines.md
@ -6,32 +6,23 @@

 ## Hierarchical Organization

-CUTLASS embodies a design paradigm exemplified by the [CUB library](https://nvlabs.github.io/cub/) 
-for expressing collective operations. Objects expose an interface for a problem that is then decomposed 
-into concurrent subtasks executed by cooperating threadblocks, warps, and threads. For example, a grid-level 
-object may be constructed with base pointers to the start of a GEMM operation, add a threadblock-dependent 
-offset to partition the problem, and then compute a per-threadblock GEMM. This in turn performs some 
-operations as a collection of cooperating threads, while it may partition other parts of the task into 
-warp-level subtasks.
-
-Consequently, CUTLASS components are organized by the computation then by the layer of
-the following hierarchy.
-
-* *device*: an operation is _device-wide_ and may launch one or more kernels on the GPU
-* *kernel*: an operation is implemented by a CUDA kernel with definitions for `__shared__` memory and constant memory allocations
-* *threadblock*: an operation is collectivey executed by a threadblock; any component calling `__syncthreads()` is likely to be threadblock-scope
-* *warp*: an operation is collectively executed by a warp; threads within the context of a warp are referred to as _lane_
-* *thread*: an operation is performed by an individual thread with no other data sharing or interaction with other threads
-* *instruction*: an operation corresponds to an individual hardware or PTX instruction
+The [CUTLASS 3.0 GEMM API](./gemm_api_3x.md) document
+explains CUTLASS 3.0's hierarchical organization,
+based conceptually on parallelization strategy.
+This differs from CUTLASS 2.x's approach,
+which more closely mirrors the GPU hardware hierarchy
+of thread blocks, warps, and threads.

 ## Design Patterns

-CUTLASS strives to achieve the highest performance possible on NVIDIA GPUs while also offering a
-flexible composition that an be easily applied to solve new problems related to Deep Learning and
-linear algebra. Though we intend to make CUTLASS as simple and straightforward as possible, given
-a tradeoff between simplicity and performance, CUTLASS chooses performance. Consequently, several
-design patterns are necessary to yield a composable structure while also satisfying these performance
-objectives. This section is intended to provide more detail.
+CUTLASS aims for the highest performance possible on NVIDIA GPUs.
+It also offers flexible components that can be assembled and customized
+to solve new problems related to deep learning and linear algebra.
+Given a tradeoff between simplicity and performance,
+CUTLASS chooses performance.
+Consequently, several design patterns are necessary
+to yield a composable structure
+while also satisfying these performance objectives.

 ### Templates

@ -75,8 +66,9 @@ objects for each data member.

 To be consistent, this pattern defines a convention in which classes define internal shared memory storage requirements. 
 Classes should consider all SharedStorage structures to be opaque other than their own child class. When the lifetimes 
-of child objects are known to be non-overlapping, unions may be used to alias multiple SharedStorage objects to the same
-shared memory region and reduce overall SMEM capacity.
+of child objects are known to be non-overlapping, `union`s may be used to alias multiple SharedStorage objects to the same
+shared memory region and reduce overall shared memory capacity.  Developers should carefully note that C++ `union` rules
+require that they only access the most recently written ("active") member of the `union`; this differs from C rules.

 ### Loop Unrolling

@ -104,123 +96,578 @@ for (int idx = 0; idx < kN; ++idx) {      // Loop has constant number of iterati

 ## Style

-### C++ Style
+### No automatic code formatting

-CUTLASS source code follows the 
-[Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html) with exceptions and extensions.
+Do not use any kind of automatic code formatting,
+like `clang-format`, on CUTLASS code.

-Design choices should be consistent with the 
-[CppCoreGuidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md) recommendations by Stroustrup and Sutter.
+### C++ style

-### CUDA Built-in Variables
+#### CUTLASS is a C++ project

-Avoid direct access to CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, and `gridDim` within
-CUTLASS components except in special circumstances. 
+CUTLASS is a C++ project.  CUDA C++ is a C++ dialect.
+Therefore, we write using standard C++ idioms as much as possible.
+We aim for portability to as many compilers as possible,
+by writing host code in Standard C++
+and device code in CUDA C++
+that resembles Standard C++ as much as possible.
+This improves usability
+for the general community of C++ developers,
+and makes it easier for new staff to join the project.

-Using built-in 'global' variables directly within resuable components necessitates that all components
-use them consistently which may not be possible if CUTLASS components are used in other contexts.
+#### Follow Standard C++ idioms where possible

-Instead, components should accept a linear ID identifying threads, warps, and threadblocks from calling
-code. The top-level kernel may then decide how to map threads, warps, and blocks to the problem it is
-solving.
+Regarding "standard C++ idioms,"
+CUTLASS source code follows the following guidelines,
+with deviations only because of compiler limitations
+or where performance absolutely requires it.
+"Performance requires it" implies measurement.
+Deviations should be limited in scope
+and we should always strive to eliminate them.

-### Use CUTLASS Fundamental Types
+* [C++ Core Guidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md)

-Use the [fundamental types](fundamental_types.md) defined in CUTLASS consistently. Doing so contributes
-to a framework of interoperable, consistent components.
+* [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html)

-In particular, be sure to use:
+#### Spacing and line length

-* [Numeric types](fundamental_types.md#numeric-types) to represent numeric data in host and device code
-* [Containers](fundamental_types.md#containers) to store data in register-backed arrays
-* [functional.h](fundamental_types.md#functional) to perform numeric operations in generic code
-* [Layouts](layout.md) to store stride and partially specialize template classes
-* [`TensorRef` and `TensorView`](layout.md#tensorref) to pass pointers and layout objects
+* Use spaces, not tabs.

-Avoid defining alternative implementations of the same functionality. Instead, prefer to enhance
-or extend additional components where it makes sense.
+* Use 2 spaces to indent.

-### Classes and Structs
+* Max 100 characters per line.

-Type names use `CapitalLetters` except when implementations are a _perfect_ drop-in replacement for
-Standard Library components.
+Lines longer than 100 characters typically wrap unfavorably
+when viewed in Github's pretty printer.

-Follow the [CppCoreGuidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rc-struct) 
-to decide whether to use `class` or `struct`. Namely,
-* use `class` when the object must maintain an invariant. Data members related to the invariant should be private.
-* use `struct` when the class has no invariant to maintain, and data members may vary arbitrarily.
+#### Function indentation

-### Class Members
+When calling a function or function object with a long name,
+break the line right after the invoking open parenthesis.
+Here is an example.
+
+```c++
+detail::very_long_function_object_name<TemplateArgument>{}(
+  params.long_parameter_name, some_operator.another_long_function_name());
+```
+
+When declaring functions, indent function parameters like this.
+
+```c++
+void possibly_an_unusually_long_function_name(
+  std::uint32_t foo
+  std::uint32_t const* bar,
+  TypeA a,
+  TypeB b,
+  TypeC c)
+{
+  // ... the function's body ...
+}
+```
+
+For function definitions only,
+break the line between the parenthesis
+that closes the function's parameters,
+and the curly bracket
+that opens the function's body.
+
+#### If-else brackets and spacing
+
+* Always use braces with conditionals such as `if`.
+
+* Use a space after control flow keywords
+  such as `if`, `for`, and `while`.
+
+* Use a space after the parenthesis closing a conditional
+  such as `if`, and the curly bracket opening a scope.
+
+* Use a new line between the closing brace
+  of an `if` branch, and the `else` keyword.
+
+```c++
+if (condition) {
+  // ... code ...
+}
+else {
+  // ... other code ...
+}
+
+for (int k = 0; k < num_iters; ++k) {
+  // ... still more code ...
+}
+```
+
+#### East const
+
+CUTLASS uses the
+["East const"](http://slashslash.info/2018/02/a-foolish-consistency/)
+convention.
+That is, the `const` or `constexpr` keyword
+goes after the type, not before.
+The general rule is that `const` or `constexpr`
+modifies the type to the left of it.
+Here are some examples.
+
+```c++
+float constexpr compile_time_constant = 42.3f;
+
+float const const_float = /* whatever */;
+float const& reference_to_const_float = const_float;
+float const* pointer_to_const_float = &const_float;
+float const* const const_pointer_to_const_float = &const_float;
+
+float nonconst_float;
+float& reference_to_nonconst_float = nonconst_float;
+float* pointer_to_nonconst_float = &nonconst_float;
+float* const pointer_to_nonconst_float = &nonconst_float;
+```
+
+Contrast this with "West const" style, e.g.,
+
+```c++
+const float const_float = /* whatever */;
+const float* pointer_to_const_float = &const_float;
+```
+
+#### Alignment of reference and pointer types
+
+For reference and pointer types,
+align the `&` resp. `*` flush against the type
+that it modifies.  This is called "left alignment."
+
+For example, do this:
+
+```c++
+int const& var;
+int const* var;
+```
+
+and not this.
+
+```c++
+int const &var;
+int const *var;
+```
+
+#### Avoid calling functions "fast" or "optimized"
+
+Putting words like "fast" or "optimized"
+in the name of a function
+assumes that the "fast" path is actually faster.
+That might be true now, but later changes
+(in the code, compilers, or GPU hardware)
+might make it false.  In that case,
+your name could be unintentionally misleading.
+Consider instead a name that briefly describes
+the algorithm or feature that is relevant for optimization.
+For example, `compute_on_host` is more meaningful
+than `compute_slowly`, and computing on host
+might be faster in some cases
+(e.g., if the data are already on host
+and the algorithm is not GPU-friendly).
+
+CUTLASS code has not always followed this rule in the past.
+Some functions and classes might have words like "fast" in their name.
+New code should follow this rule, however.
+
+#### Avoid creating unconstrained templated functions with common names
+
+See [C++ Core Guidelines T.47](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#t47-avoid-highly-visible-unconstrained-templates-with-common-names):
+"Avoid highly visible unconstrained templates
+with common names."
+Argument-dependent lookup (ADL) means that
+if users call a function name without specifying the namespace,
+the compiler can find overloads
+of that function in any namespace.
+This can lead to ambiguous overloads in users' code,
+just because they happened to include one of your header files
+that exposes an unconstrained function template.
+The following illustrates this
+with an unconstrained swap overload in the `cutlass` namespace.
+
+```c++
+#include <cassert>
+#include <memory>
+#include <utility>
+
+// Uncomment the line below to observe unwarranted build errors.
+//#define BAD_CUTLASS_SWAP 1
+
+namespace cutlass {
+struct Bar {
+  float f;
+};
+} // namespace cutlass
+
+#ifdef BAD_CUTLASS_SWAP
+namespace cutlass {
+
+template<class T>
+void swap(T& a, T& b) // don't do this
+{
+  T tmp = a;
+  a = b;
+  b = tmp;
+}
+
+} // namespace cutlass
+#endif // BAD_CUTLASS_SWAP
+
+namespace other {
+
+#ifdef BAD_CUTLASS_SWAP
+using cutlass::swap;
+#endif // BAD_CUTLASS_SWAP
+
+// Imagine for the sake of this example
+// that "foo" is a less common name,
+// and that T is constrained via
+// std::enable_if or a requires clause.
+template<class T>
+void foo(T& a, T& b)
+{
+  // The usual idiom for using std::swap is the "swap two-step":
+  //
+  // 1. import std::swap into the current scope, then
+  // 2. call swap without namespace qualification.
+  //
+  // That won't build if we have another swap
+  // overload available in the scope already.
+
+  using std::swap;
+  swap(a, b); // OBSERVE UNWARRANTED BUILD ERROR HERE
+}
+
+} // namespace other
+
+int main()
+{
+  int x = 42;
+  int y = 43;
+  other::foo(x, y);
+  assert(x == 43);
+  assert(y == 42);
+
+  cutlass::Bar a{42.0};
+  cutlass::Bar b{43.0};
+  other::foo(a, b);
+  assert(a.f == 43.0);
+  assert(b.f == 42.0);
+
+  // GCC 7.5 std::unique_ptr::reset calls swap,
+  // leading to the same issue as above.
+  // GCC 12.2's implementation of std::unique_ptr
+  // does not have this issue.  Nevertheless,
+  // breaking the swap two-step will break users' code,
+  // just by them happening to include your headers.
+  auto ptr = std::make_unique<cutlass::Bar>(cutlass::Bar{666.0f});
+  ptr.reset(new cutlass::Bar{777.0f}); // OBSERVE UNWARRANTED BUILD ERROR HERE
+
+  return 0;
+}
+```
+
+#### Function return values and in-out parameters
+
+##### Prefer return values to output parameters
+
+In general, avoid in-out mutable references to return a value.
+If you need to return multiple values,
+you can return them by `struct` or `tuple`,
+rather than by output references.
+This includes the special case of error reporting
+by returning either a value or an error code.
+Please see the next section for details.
+
+```c++
+// Instead of passing in-out mutable references ...
+void not_preferred(float& input_and_output); // not preferred
+
+// keep functions pure and return value types instead
+float preferred(float input); // preferred
+```
+
+##### Return multiple values by struct or tuple
+
+Sometimes a function needs to return multiple values.  In that case, consider the following, in decreasing order of preference.
+
+1. Return a `struct`.  This lets you name the fields
+   (for more self-documenting code),
+   yet still permits use of structured binding.
+
+2. Return a `tuple`.  If you need a tuple type
+   that works on device, use `cute::tuple`.
+   (Please note that `cute::tuple` does not work
+   for all the types that work in `std::tuple`.
+   CuTe's documentation explains.)
+
+Here is an example of the struct approach for named values.
+For a comparable example in the C++ Standard,
+please see [`std::allocate_at_least`](https://en.cppreference.com/w/cpp/memory/allocate_at_least),
+which returns `std::allocation_result`.
+
+```c++
+struct my_computation_result {
+  float value = 0.0f;
+  float relative_error = 0.0f;
+  bool success = false;
+};
+
+my_computation_result my_computation(float tolerance);
+
+void foo(float tolerance)
+{
+  // Approach 1: Use structured binding.  The names
+  // you choose on the left-hand side have nothing
+  // to do with the struct, so it's up to you
+  // to get the order right.  On the other hand,
+  // this code works whether my_computation returns
+  // a struct or a tuple.
+  auto [val, rel_err, ok] = my_computation(tolerance);
+
+  // Approach 2: Keep the struct and use its named fields.
+  // This approach prevents errors like mixing the order of return types.
+  // However, it only works for structs, not for tuples.
+
+  auto result = my_computation(tolerance);
+  if (not result.success) {
+    // computation did not succeed
+  }
+  else if (result.relative_error > tolerance) {
+    // successful but relative error too large
+  }
+  else {
+    // successful and relative error is in bounds
+  }
+}
+```
+
+##### Reporting errors from a function that returns one or more values
+
+We may want to return one or more values
+from a function that could fail
+or otherwise report errors.
+That is, the function either
+
+* returns one or more valid values, or
+
+* does not return any values and reports an error,
+
+but NOT BOTH.  We contrast this with cases
+when it's meaningful to report both a result
+and whether the result is satisfactory.
+For example, when solving
+a system of nonlinear equations iteratively,
+users may want the approximate computed solution,
+even if the iteration did not succeed
+by converging to the desired tolerance
+in the desired number of steps.
+(Users may want to invest more steps,
+or use the current approximation
+to jump-start a different algorithm.)
+
+We're talking here about the "either valid value(s),
+or error, but not both" case.
+For this case, C++ offers a few options.
+
+1. Return the value(s), or throw an exception on error
+
+2. `std::expected` (requiring C++23) or something like it
+
+3. `std::optional` (for a Boolean error state)
+   or something like it
+
+4. `std::variant` (a C++17 fall-back for `std::expected`)
+   or something like it
+
+5. C-style interface: return an error code,
+   and "return" the values as output parameters
+
+We usually cannot or do not want to
+throw exceptions on device.
+Some code projects forbid exceptions entirely
+(on host or device)
+and tell the compiler to disable them.
+If we exclude a C-style interface (the last option)
+as not idiomatic C++, then for host-only code,
+`std::expected`, `std::optional`, and `std::variant`
+all work.
+For code that needs to build and run on device,
+we can fall back to libcu++ equivalents
+in the `cuda::std::` namespace, when they exist.
+Otherwise, we must resort to returning a struct or tuple
+with the value and the error information,
+and ask users not to use the value on error.
+This is acceptable if the value can be constructed
+cheaply with a reasonable default.
+
+##### Performance of different value-or-error reporting methods
+
+[P1886R0](https://wg21.link/P1886R0)
+(Ben Craig, "Error speed benchmarking")
+surveys different ways in Standard C++
+to report errors from a function
+that returns one or more values,
+and compares their (host-only) performance
+with different compilers.
+
+##### Use aggregate initialization when returning a struct or tuple
+
+Use aggregate initialization when returning a struct or tuple.
+This avoids duplication of the return type name.
+
+```c++
+struct foo_result {
+  float value = 0.0f;
+  float error = 0.0f;
+  bool success = false;
+};
+
+foo_result foo(std::span<const float> input)
+{
+  // ... code  ...
+
+  // Prefer this.  We know what type the function returns.
+  return {val, err, ok}; // prefer this
+
+  // Naming foo_result again here is unnecessary.
+  // return foo_result{val, err, ok};
+}
+```
+
+However, note that this won't work if the function returns `auto`.
+The general rule is to avoid code duplication.
+
+```c++
+auto foo(std::span<const float> input)
+{
+  // ... code  ...
+
+  if constexpr (some_condition) {
+    return foo_result{val, err, ok};
+  }
+  else {
+    return bar_result{val, err, ok};
+  }
+}
+```
+
+##### Prefer using the actual return type to auto, if you know the type
+
+C++ lets you use `auto` to deduce the type returned from a function.
+
+* If you know the actual type, prefer using the type instead of `auto`.
+
+* Use [Constructor Type Argument Deduction](https://en.cppreference.com/w/cpp/language/class_template_argument_deduction)
+  (CTAD) if you know that a function returns some type
+  (e.g., `Tensor`), but don't know the type's template arguments.
+
+* Use `auto` in structured bindings (where you have to use it anyway).  This also makes your code agnostic of whether the return type is a `struct`, `tuple`, `pair`, or other tuple-like type.
+
+* Be careful using `auto` with types that provide expression templates.
+
+Contrast this with "Almost Always Auto" (AAA) style.
+We deliberately choose not to follow AAA style,
+for the following reasons.
+
+* Using the actual type when we know it can help prevent common loss-of-precision errors in mixed-precision computations, an important use case for CUTLASS.
+
+* CTAD gives us much of the brevity of AAA, with more clarity.
+
+* Using the actual type instead of `auto` can prevent common dangling errors with expression templates.
+
+#### Classes and structs
+
+Type names use `CamelCase`.
+That is, words start with capital letters.
+The remaining letters in the word are lower case,
+and words are joined with no intervening underscores.
+The only exception is when implementations are
+a drop-in replacement for C++ Standard Library components.
+
+Follow the
+[C++ Core Guidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rc-struct)
+to decide whether to use `class` or `struct`.
+
+* Use `class` when the object must maintain an invariant.
+  Data members related to the invariant should be `private`.
+
+* Use `struct` when the class has no invariant to maintain,
+  and data members may vary arbitrarily with respect to each other.
+
+Prefer nonmember functions and statelessness where possible.
+Member functions imply invariants.
+More invariants make code maintenance and testing harder.
+
+#### Class members

 Methods and members are written using `snake_case`.

 Private data and function members have suffix `_`.

-### Constant names
-
-CUTLASS makes extensive use of constants and compile-time evaluation. Constant variable names should have
-prefix `k` and use mixed case. True compile-time constsants should be defined as `constexpr` to enable
-dependent `constexpr` functions.
-
-CUTLASS uses ["East const"](http://slashslash.info/2018/02/a-foolish-consistency/) style, placing `constexpr` keyword
-after the type name.
-
-```c++
-float constexpr kPi = 3.14159f;
-```
-
-### Class Member Order
+#### Class Member Order

 Members within classes and structures should be organized as follows:

 1. Type and constant definitions
+
 2. Data members
+
 3. Constructors
+
 4. Other methods

-This convention follows the [CUB library](https://nvlabs.github.io/cub/) and is also described by 
-[Howard Hinnant](https://howardhinnant.github.io/classdecl.html). Unsurprisingly, it approximates 
-the usual ordering of chapters in a typical Systems and Controls textbook. That is,
-(1.) identify relevant constants, (2.) define a state-space representation of the dynamical system 
-under study (i.e. the data members), and (3.) devote subsequent chapters to definining dynamical behavior
-of the system (i.e. the methods).
+This convention follows the
+[CUB library](https://nvlabs.github.io/cub/)
+and is also described by 
+[Howard Hinnant](https://howardhinnant.github.io/classdecl.html).
+It also approximates the usual ordering of chapters
+in a typical Systems and Controls textbook.
+That is, it
+
+1. identifies relevant constants,
+
+2. defines a state-space representation
+   of the dynamical system under study
+   (the class's data members), and then
+
+3. devotes the remaining "chapters" to defining
+   the system's dynamical behavior
+   (the class's methods).
+
+Here is an example class.

-_Example_:
 ```c++
 class A {
 public:
-  // Type definitions
+  // type definitions
 protected:
-  // protected Type definitions
+  // protected type definitions
 private:
-  // private Type definitions
+  // private type definitions

 public:
-  // Data members
+  // data members
 protected:
  // protected data members
+  // STRONGLY TO BE AVOIDED;
+  // please see C++ Core Guidelines
 private:
  // private data members

 public:
-  // Methods
+  // methods
 protected:
  // protected methods
 private:
  // private methods
-
 };
-
 ```

-### File Names
+#### Use scoped enums

-Files should be named using `snake_case` with extension `.h` for header files, `.cu` for CUDA sources,
-and `.cpp` for C++ host-only source files.
-
-### Use scoped enums
-
-Use scoped enums added in C++11 for enumerated types. Use capital letters for the enumerated type name
+Use scoped enums (a C++11 feature) for enumerated types.
+Use capital letters for the enumerated type name
 and prefix `k` for enumerators like other constants.

 ```c++
@ -232,63 +679,129 @@ enum class MatrixOperation {
 };
 ```

-### Namespaces
+#### Namespaces

-Namespaces are all lower case. The top-level namespace is `cutlass::`. The second nested namespace refers
-top the general category of operation performed by its members, and the third nested namespace refers to
-the CUDA execution model scope (if applicable).
+Namespaces are all lower case.
+The top-level namespace is `cutlass::`.
+The second nested namespace refers to
+the general category of operation
+performed by its members: e.g., `gemm::`.
+The third nested namespace refers to
+the operations' position in the conceptual hierarchy:
+e.g., `device::`, `kernel::`, or `collective::`.

-The bodies of namespace definitions should not be intented, and comments on the closing brace are welcome.
+The bodies of namespace definitions should not be indented.
+Comments on the closing brace to indicate
+the namespace being closed are welcome.

 ```c++
 namespace cutlass {
 namespace gemm {
-namespace warp {
-
-struct MmaTensorCore {
+namespace kernel {

+struct AnotherGemmKernel {
+  // ... contents ...
 };

-} // namespace warp
+} // namespace kernel
 } // namespace gemm
 } // namespace cutlass
 ```

-### Macros
+#### File Names

-Avoid defining macros except where preprocessing is obligatory. In particular, 
-avoid using macros for constants.
+New files should be named using `snake_case`
+with extension `.hpp` for header files,
+`.cu` for CUDA sources,
+and `.cpp` for C++ host-only source files.

-Several existing macros defined in `cutlass/cutlass.h` are useful for working around compiler-dependent
-behavior.
+Header files with extension `.h`
+are CUTLASS 2.x legacy headers.

-Annotations for device code:
-* `CUTLASS_HOST_DEVICE` for functions running on the host and the device
-* `CUTLASS_DEVICE` for functions running on the device only
+#### Macros

-Loop unrolling:
-* `CUTLASS_PRAGMA_UNROLL` for full unrolling of loops with constant trip counts
-* `CUTLASS_PRAGMA_NO_UNROLL` to prevent unrolling
+Only use macros when the preprocessor
+is the only way to accomplish the task.
+Do not use macros for literal constants.
+Instead, if inside the body of a function,
+use `constexpr` values,
+and if at namespace scope, use
+[`inline constexpr` variables](https://en.cppreference.com/w/cpp/language/inline)
+(a C++17 feature).

-### #pragma once
+"Namespace" macros by starting them with the module name, e.g., `CUTLASS_`.
+Macros and ONLY MACROS use all capital letters with underscores between words.
+For example:
+
+```c++
+#define CUTLASS_MACROS_USE_ALL_CAPS inline __host__ __device__
+```
+
+Header files such as
+[cutlass/cutlass.h](../../include/cutlass/cutlass.h)
+and
+[cute/config.hpp](../../include/cutlass/cutlass.h)
+offer macros for expressing compiler-dependent behavior.
+These include
+
+* replacements for `__device__` and/or `__host__`
+  annotations:
+
+  * `CUTLASS_HOST_DEVICE` or `CUTE_HOST_DEVICE`
+    for functions that run on the host and the device,
+
+  * `CUTLASS_DEVICE` or `CUTE_DEVICE`
+    for functions that run on the device only, and
+
+  * `CUTE_HOST`
+    for functions that run on the host only; and
+
+* annotations to loop unrolling:
+
+  * `CUTLASS_PRAGMA_UNROLL` or `CUTE_UNROLL`
+    for full unrolling of loops with constant trip counts, and
+
+  * `CUTLASS_PRAGMA_NO_UNROLL` or `CUTE_NO_UNROLL` to prevent unrolling.
+
+#### Guard all headers with `#pragma once`

 Use `#pragma once` to guard all headers.

-```c++
-/*!
+### CUDA C++ style

-*/
+#### CUDA Built-in Variables

-#pragma once
+Avoid direct access to CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, and `gridDim` within
+CUTLASS components except in special circumstances. 

-...
-```
+Using built-in global variables directly within resuable components necessitates that all components
+use them consistently which may not be possible if CUTLASS components are used in other contexts.

-### Source Line Length
+Instead, components should accept a linear ID identifying threads, warps, and threadblocks from calling
+code. The top-level kernel may then decide how to map threads, warps, and blocks to the problem it is
+solving.

-Avoid lines longer than 100 characters. These typically wrap unfavorably when viewed in
-Github's pretty printer.
+#### Use CUTLASS's and CuTe's fundamental types and operations

+Use the
+[fundamental types and operations](fundamental_types.md)
+defined in CUTLASS consistently.
+This contributes to a framework of interoperable, consistent components.
+It reduces code duplication, which reduces build and test times.
+It also saves developer effort.
+
+CUTLASS's fundamental types and operations include
+
+* [Numeric types](fundamental_types.md#numeric-types) to represent numeric data in host and device code, and
+
+* [functional.h](fundamental_types.md#functional) to perform numeric operations in generic code.
+
+CUTLASS 3.0 uses CuTe components to represent data layouts and multidimensional arrays.
+Please refer to the [CuTe Tutorial](./cute/00_quickstart.md) for details.
+CuTe has replaced CUTLASS 2.x components such as
+[Containers](fundamental_types.md#containers),
+[Layouts](layout.md), and
+[`TensorRef` and `TensorView`](layout.md#tensorref).

 # Copyright

--- a/media/docs/quickstart.md
+++ b/media/docs/quickstart.md
@ -7,9 +7,9 @@
 ## Prerequisites

 CUTLASS requires:
- NVIDIA CUDA Toolkit (9.2 or later required, [11.1](https://developer.nvidia.com/cuda-toolkit) recommended)
- CMake 3.12+
- host compiler supporting C++11 or greater (g++ 7.3.0 or Microsoft Visual Studio 2015 recommended)
+- NVIDIA CUDA Toolkit (11.4 or later required, [12.0](https://developer.nvidia.com/cuda-toolkit) recommended)
+- CMake 3.18+
+- host compiler supporting C++17 or greater (minimum g++ 7.5.0)
 - Python 3.6+

 CUTLASS may be optionally compiled and linked with
@ -24,13 +24,13 @@ $ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc

 $ mkdir build && cd build

-$ cmake .. -DCUTLASS_NVCC_ARCHS=80               # compiles for NVIDIA Ampere GPU architecture
+$ cmake .. -DCUTLASS_NVCC_ARCHS=90a             # compiles for NVIDIA Hopper GPU architecture
 ```

 If your goal is strictly to build only the CUTLASS Profiler and to minimize compilation time, we suggest
 executing the following CMake command in an empty `build/` directory.
 ```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_ENABLE_TESTS=OFF -DCUTLASS_UNITY_BUILD_ENABLED=ON
+$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_ENABLE_TESTS=OFF -DCUTLASS_UNITY_BUILD_ENABLED=ON
 ```

 This reduces overall compilation time by excluding unit tests and enabling the unit build.
@ -39,13 +39,13 @@ You may reduce build times by compiling only certain operations by setting the `
 executed from an empty `build/` directory. This only compiles 2-D convolution kernels.

 ```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_OPERATIONS=conv2d
+$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_OPERATIONS=conv2d
 ```

-You may also filter kernels by name by supplying a filter string with flag `CUTLASS_LIBRARY_KERNELS`. 
+You may also filter kernels by name by supplying a filter string with flag `CUTLASS_LIBRARY_KERNELS`. For example the below command selects only CUTLASS-3 kernels.

 ```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=s16816gemm,s16816fprop*128x128
+$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=cutlass3x*
 ```
 See more examples on selectively compiling CUTLASS GEMM and convolution kernels [here](quickstart.md#example-cmake-commands).

@ -180,6 +180,10 @@ To minimize compilation time, specific GPU architectures can be enabled via the
 selected by [CUDA Compute Capability.](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities)

 **NVIDIA Ampere Architecture.**
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS=90a              # compiles for NVIDIA Hopper GPU architecture
+```
+
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS=80               # compiles for NVIDIA Ampere GPU architecture
 ```
@ -204,32 +208,10 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS="60;61"          # compiles for NVIDIA Pascal GP
 $ cmake .. -DCUTLASS_NVCC_ARCHS="50;53"          # compiles for NVIDIA Maxwell GPU architecture
 ```

-## Clang
-
-For experimental purposes, CUTLASS has been verified to compile with the following versions of Clang and CUDA.
-
-* [clang 8.0](https://github.com/llvm/llvm-project/releases/download/llvmorg-8.0.1/clang+llvm-8.0.1-amd64-unknown-freebsd11.tar.xz) using the 
-[CUDA 10.0 Toolkit](https://developer.nvidia.com/cuda-10.0-download-archive).
-* [clang release/13.x](https://github.com/llvm/llvm-project/tree/release/13.x) using [CUDA 11.4](https://developer.nvidia.com/cuda-toolkit-archive)
-
-At this time, compiling with clang enables the CUTLASS SIMT GEMM kernels (sgemm, dgemm, hgemm, igemm)
-but does not enable TensorCores.
-
-```bash
-$ mkdir build && cd build
-
-$ cmake -DCUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ..
-# Add -DCMAKE_CXX_FLAGS=-D__NV_NO_HOST_COMPILER_CHECK=1 -DCMAKE_CUDA_FLAGS=-D__NV_NO_HOST_COMPILER_CHECK=1 if compiler
-# checks fail during CMake configuration.
-
-$ make test_unit -j
-```
-
-
 ## Using CUTLASS within other applications

 Applications should list [`/include`](/include) within their include paths. They must be
-compiled as C++11 or greater.
+compiled as C++17 or greater.

 **Example:** print the contents of a variable storing half-precision data.
 ```c++
@ -345,6 +327,136 @@ Note, the above could be simplified as follows using helper methods defined in `
  });
 ```

+## Launching a GEMM kernel using CUTLASS 3.0 or newer
+
+**Example:** launch a mixed-precision GEMM targeting Hopper Tensor Cores. 
+
+```c++
+#include "cutlass/cutlass.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+
+using namespace cute;
+
+int main(int argc, char const **args) {
+
+  // A matrix configuration
+  using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+  using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+  constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+  // B matrix configuration
+  using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+  using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+  constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+  // C/D matrix configuration
+  using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+  using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+
+  // Core kernel configurations
+  using ElementAccumulator  = float;                                          // Element type for internal accumulation
+  using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
+  using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+  using TilesShape          = Shape<_128,_128,_64>;                           // Threadblock-level tile size
+  using ClusterShape        = Shape<_1,_2,_1>;                                // Shape of the threadblocks in a cluster
+  using StageCountType = cutlass::gemm::collective::StageCountAuto;           // Stage count maximized based on the tile size
+  using KernelSchedule = cutlass::gemm::collective::KernelScheduleAuto;       // Kernel to launch based on the default setting in the Collective Builder 
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      ArchTag, OperatorClass,
+      ElementA, LayoutA, AlignmentA,
+      ElementB, LayoutB, AlignmentB,
+      ElementAccumulator,
+      TilesShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAuto,
+      cutlass::gemm::collective::KernelScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveEpilogue = cutlass::epilogue::collective::DefaultEpilogue<
+      cutlass::gemm::TagToStrideC_t<LayoutC>,
+      cutlass::gemm::TagToStrideC_t<LayoutC>,
+      cutlass::epilogue::thread::LinearCombination<ElementC, 1, ElementAccumulator, ElementAccumulator>>;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int>, // Indicates ProblemShape
+      CollectiveMainloop,
+      CollectiveEpilogue
+  >;
+
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+  Gemm gemm_op;
+  cutlass::Status status;
+
+  //
+  // Define the problem size
+  //
+
+  int M = 512;
+  int N = 256;
+  int K = 128;
+
+  float alpha = 1.25f;
+  float beta = -1.25f;
+
+  //
+  // Allocate device memory
+  //
+
+  cutlass::DeviceAllocation<typename Gemm::ElementA> block_A;
+  cutlass::DeviceAllocation<typename Gemm::ElementB> block_B;
+  cutlass::DeviceAllocation<typename Gemm::ElementC> block_C;
+  cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_D;
+
+  using StrideA = typename Gemm::GemmKernel::StrideA;
+  using StrideB = typename Gemm::GemmKernel::StrideB;
+  using StrideC = typename Gemm::GemmKernel::StrideC;
+  using StrideD = typename Gemm::GemmKernel::StrideD;
+
+  StrideA stride_A;
+  StrideB stride_B;
+  StrideC stride_C;
+  StrideD stride_D;
+
+  stride_A = make_cute_packed_stride(StrideA{}, cute::make_shape(M, K, Int<1>{}));
+  stride_B = make_cute_packed_stride(StrideB{}, cute::make_shape(N, K, Int<1>{}));
+  stride_C = make_cute_packed_stride(StrideC{}, cute::make_shape(M, N, Int<1>{}));
+  stride_D = make_cute_packed_stride(StrideD{}, cute::make_shape(M, N, Int<1>{}));
+
+  block_A.reset(M * K);
+  block_B.reset(K * N);
+  block_C.reset(M * N);
+  block_D.reset(M * N);
+
+  //
+  // Launch GEMM on the device
+  //
+ 
+  status = gemm_op({
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {M, N, K},
+    block_A.get(),
+    stride_A,
+    block_B.get(),
+    stride_B,
+    {block_C.get(), stride_C, block_D.get(), stride_D, {alpha, beta}}
+  });
+
+  if (status != cutlass::Status::kSuccess) {
+    return -1;
+  }
+
+  return 0;
+}
+```
+
 # CUTLASS Library

 The [CUTLASS Library](/tools/library) defines an API for managing and executing collections of compiled
--- a/media/docs/terminology.md
+++ b/media/docs/terminology.md
@ -4,10 +4,10 @@

 # CUTLASS Terminology

-`AlignedBuffer<T, N>`: statically sized array type; union-safe, no construction guarantee for elements
+**cute::Layout**: A `cute::Layout` vocabulary type composed of the hierarchical `cute::Shape` and `cute::Stride`
+tuples that is used throughout CUTLASS 3.0 to represent and manipulate thread and data layouts. More details are included in the [CuTe specific tensor type documentation](/media/docs/cute/03_tensor.md).

-`Array<T, N>`: container for holding numeric types - handles bit packing for small numeric types (e.g. int4_t, uint4_t, bin1_t)
-  `sizeof(Array<T, N>)` - gives expected value in units of bytes with minimum storage of `1 B`: (sizeof_bits<T>::value * N) / 8
+**cute::Tensor**: A pointer backed by a `cute::Layout` used to represent a tensor. More details are included in the [CuTe specific tensor type documentation](/media/docs/cute/03_tensor.md).

 **Capacity**: (scalar) physical number of elements in memory required to store a multidimensional object; expressed as the type's LongIndex type
  - example: the capacity of a column-major matrix is `lda * N`
@ -28,8 +28,6 @@

 **Numeric Type**: a CUTLASS data type used to represent real-valued quantities; is trivially copyable.

-**Operator**: an object performing a computation on matrix or tensor objects. May be further refined by scope within the execution model hierarchy.
-
 **Pitch Linear**: linear memory allocation obtained from a user-defined 2-D size, which specifies the 
 contiguous and strided dimensions of a tile. 

@ -61,17 +59,27 @@ contiguous and strided dimensions of a tile.

 **Tile**: partitions of a tensor that have constant extents and layout known at compile time

-**Tile Iterator**: abstraction for accessing and traversing a sequence of tiles in a tensor; CUTLASS specifies 
-  [formal concepts for tile iterators](tile_iterator_concept.md)
-
-**Thread Map**: abstraction for defining how threads are mapped to a given tile.
-
 **Trait**: characteristics of a fully-specialized type, typically used in metaprogramming reflection

 **View**: an object containing references to a data structure that it does not own; typically, construction of views is lightweight

 **Warp**: a collection of hardware threads executing in lock-step; warp-level operations typically rely on cooperation among the threads within the warp

+`AlignedBuffer<T, N>`: statically sized array type; union-safe, no construction guarantee for elements
+
+`Array<T, N>`: container for holding numeric types - handles bit packing for small numeric types (e.g. int4_t, uint4_t, bin1_t)
+  `sizeof(Array<T, N>)` - gives expected value in units of bytes with minimum storage of `1 B`: (sizeof_bits<T>::value * N) / 8
+
+**Operator**: an object performing a computation on matrix or tensor objects. May be further refined by scope within the execution model hierarchy. Deprecated starting CUTLASS 3.0,
+replaced by [MMA and Copy atoms from CuTe](/media/docs/cute/0t_mma_atom.md).
+
+**Tile Iterator**: abstraction for accessing and traversing a sequence of tiles in a tensor; CUTLASS specifies 
+  [formal concepts for tile iterators](tile_iterator_concept.md). Deprecated starting CUTLASS 3.0.
+  Replaced by `cute::Layout` in equivalent usage scenarios to represent data tensors.
+
+**Thread Map**: abstraction for defining how threads are mapped to a given tile. Deprecated starting CUTLASS 3.0.
+  Replaced by `cute::Layout` in equivalent usage scenarios to represent thread tensors.
+
 # Copyright

 Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
--- a/media/docs/tile_iterator_concept.md
+++ b/media/docs/tile_iterator_concept.md
@ -4,9 +4,15 @@

 # Tile Iterator Concepts

+Note: CUTLASS 3.0 deprecates all tile access iterators in favour of CuTe's single
+vocabulary type `cute::Tensor`, which is parameterized on `cute::Layout`.
+`cute::Tensor`s can therefore be manipulated with the same layout algebra as all CuTe layouts.
+This removes the need for bespoke types that encapsulate iterator properties.
+The following text thus only applies to legacy CUTLASS 2.x API and related types.
+
 CUTLASS 2.x implements generic algorithms on tiles of matrix or tensors of constant size. These may
 be considered as partitions of tensors of infinite size, with a range of partitions accessible
-by _tile iterators_. 
+by _tile iterators_.

 Various data structures may make operations such as random access to tiles inexpensive,
 while data structures may not offer random access at all. For example, iterating over a linked
@ -14,7 +20,9 @@ list of matrices requires sequential traversal. Algorithms implemented in terms
 should require only the minimum set of operators be defined for tile iterators.

 This document describes a set of C++ concepts which may be used to define tile iterators used
-by CUTLASS algorithms. Each concept specifies members and type definitions that a tile iterator
+by CUTLASS algorithms.  ("Concept" here does not refer to a C++20 concept that uses the `concept` keyword.
+Rather, it refers to a set of requirements on a type.)
+Each concept specifies members and type definitions that a tile iterator
 must implement. Frequently, a tile iterator implements several concepts, and its members are
 the union of the members from each individual concept. These definitions were inspired by
 [Boost "New style" iterator concepts](https://www.boost.org/doc/libs/1_40_0/libs/iterator/doc/new-iter-concepts.html).
@ -23,7 +31,6 @@ The set of all possible combinations of these concepts is quite large, however m
 templates can be described by one of several combinations. The section 
 Frequently Used Tile Iterator Concepts describes several common interfaces used throughout CUTLASS.

-
 ## Definitions

 **_Base Tile Iterator Concept_.** All tile iterators must describe an _Element_ type as well as a _Shape_.
--- a/media/docs/utilities.md
+++ b/media/docs/utilities.md
@ -2,6 +2,13 @@

 [README](/README.md#documentation) > **CUTLASS Utilities**

+Note: This document discusses utilities commonly used with code that targets CUTLASS 2.x.
+Although CUTLASS 3.0's primary entry point APIs do not transact in these `cutlass::*` tensor types anymore,
+users can still find them convenient for managing allocations with trivial affine layouts.
+For more advanced host side tensor management, [`cute::Tensor`](/media/docs/cute/03_tensor.md)s
+can be used on either host or device for any memory space and full expressive power of
+[`cute::Layout`](/media/docs/cute/01_layout.md)s.
+
 # CUTLASS Utilities

 CUTLASS utilities are additional template classes that facilitate recurring tasks. These are
--- a/media/images/cute/HMMA.8x8x4.NT.png
+++ b/media/images/cute/HMMA.8x8x4.NT.png
--- a/media/images/cute/HMMA.8x8x4.quadpair.AB.png
+++ b/media/images/cute/HMMA.8x8x4.quadpair.AB.png
--- a/media/images/cute/HMMA.8x8x4.quadpair.C.png
+++ b/media/images/cute/HMMA.8x8x4.quadpair.C.png
--- a/media/images/cute/gmma_coremat_cd_fp16.png
+++ b/media/images/cute/gmma_coremat_cd_fp16.png
--- a/media/images/cute/gmma_wg_n_slice.png
+++ b/media/images/cute/gmma_wg_n_slice.png
--- a/media/images/cute/logical_divide-and-zipped_divide-2.png
+++ b/media/images/cute/logical_divide-and-zipped_divide-2.png
--- a/media/images/cute/logical_divide-and-zipped_divide.png
+++ b/media/images/cute/logical_divide-and-zipped_divide.png
--- a/media/images/cutlass-3.0-gemm-peak-performance.png
+++ b/media/images/cutlass-3.0-gemm-peak-performance.png
--- a/media/images/cutlass-reduction-in-named-iterators.png
+++ b/media/images/cutlass-reduction-in-named-iterators.png