Updates for 3.4 release. (#1305)

2024-01-16 10:42:51 -08:00
parent acba5beee5
commit 2f589ffa76
166 changed files with 5996 additions and 4702 deletions
--- a/media/docs/cute/00_quickstart.md
+++ b/media/docs/cute/00_quickstart.md
@ -6,12 +6,12 @@ The core abstraction of CuTe are the hierarchically multidimensional layouts whi

 ## System Requirements

-CuTe shares CUTLASS 3.0's software requirements,
+CuTe shares CUTLASS 3.x's software requirements,
 including NVCC with a C++17 host compiler.

 ## Knowledge prerequisites

-CuTe is a CUDA C++ library.  It requires C++17
+CuTe is a CUDA C++ header-only library.  It requires C++17
 (the revision of the C++ Standard that was released in 2017).

 Throughout this tutorial, we assume intermediate C++ experience.
@ -29,8 +29,10 @@ and how to launch kernels.
 ## Building Tests and Examples

 CuTe's tests and examples build and run as part of CUTLASS's normal build process.
+
 CuTe's unit tests live in the [`test/unit/cute`](../../../test/unit/cute) subdirectory.
-Its examples live in the [`examples/cute`](../../../examples/cute) subdirectory.
+
+CuTe's examples live in the [`examples/cute`](../../../examples/cute) subdirectory.

 ## Library Organization

@ -38,9 +40,9 @@ CuTe is a header-only C++ library, so there is no source code that needs buildin

 |        Directory       |        Contents        |
 |------------------------|------------------------|
-| [`include/cute`](../../../include/cute) | Each header in the top level corresponds to one of the fundamental building blocks of CuTe, such as [`Layout`](../../../include/cute/layout.hpp) or [`Tensor`](../../../include/cute/tensor.hpp). |
-| [`include/cute/container`](../../../include/cute/container) | Implementations of STL-like container objects, such as tuple, array, aligned array, and array views.  |
-| [`include/cute/numeric`](../../../include/cute/numeric) | Templates that handle nonstandard floating-point types, unsigned integers, complex numbers, and integer sequence - like fundamental numeric data types.  |
+| [`include/cute`](../../../include/cute) | Each header in the top level corresponds to one of the fundamental building blocks of CuTe, such as [`Layout`](../../../include/cute/layout.hpp) and [`Tensor`](../../../include/cute/tensor.hpp). |
+| [`include/cute/container`](../../../include/cute/container) | Implementations of STL-like objects, such as tuple, array, and aligned array.  |
+| [`include/cute/numeric`](../../../include/cute/numeric) | Fundamental numeric data types that include nonstandard floating-point types, nonstandard integer types, complex numbers, and integer sequence.  |
 | [`include/cute/algorithm`](../../../include/cute/algorithm) | Implementations of utility algorithms such as copy, fill, and clear that automatically leverage architecture-specific features if available. |
 | [`include/cute/arch`](../../../include/cute/arch) | Wrappers for architecture-specific matrix-matrix multiply and copy instructions. |
 | [`include/cute/atom`](../../../include/cute/atom) | Meta-information for instructions in `arch` and utilities like partitioning and tiling.
@ -57,7 +59,7 @@ Other files in this directory discuss specific parts of CuTe.

 * [`01_layout.md`](./01_layout.md) describes `Layout`, CuTe's core abstraction.

-* [`02_layout_operations.md`](./02_layout_operations.md) describes more advanced `Layout` operations and the CuTe layout algebra.
+* [`02_layout_algebra.md`](./02_layout_algebra.md) describes more advanced `Layout` operations and the CuTe layout algebra.

 * [`03_tensor.md`](./03_tensor.md) describes `Tensor`,
  a multidimensional array abstraction which composes `Layout`
@ -74,5 +76,44 @@ Other files in this directory discuss specific parts of CuTe.
 * [`0y_predication.md`](./0y_predication.md) explains what to do
  if a tiling doesn't fit evenly into a matrix.

-* [`0z_tma_tensors.md`](./0z_tma_tensors.md) summarizes
-  how CuTe supports TMA loads and stores.
+* [`0z_tma_tensors.md`](./0z_tma_tensors.md) explains an advanced `Tensor` type that CuTe uses to support TMA loads and stores.
+
+## Quick Tips
+
+### How do I print CuTe objects on host or device?
+
+The `cute::print` function has overloads for almost all CuTe types, including Pointers, Integers, Strides, Shapes, Layouts, and Tensors.  When in doubt, try calling `print` on it.  
+
+CuTe's print functions work on either host or device.
+Note that on device, printing is expensive.
+Even just leaving print code in place on device,
+even if it is never called
+(e.g., printing in an `if` branch that is not taken at run time),
+may generate slower code.
+Thus, be sure to remove code that prints on device after debugging.
+
+You might also only want to print on thread 0 of each threadblock, or threadblock 0 of the grid.  The `thread0()` function returns true only for global thread 0 of the kernel, that is, for thread 0 of threadblock 0.  A common idiom for printing CuTe objects to print only on global thread 0.
+
+```c++
+if (thread0()) {
+  print(some_cute_object);
+}
+```
+
+Some algorithms depend on some thread or threadblock,
+so you may need to print on threads or threadblocks other than zero.
+The header file
+[`cute/util/debug.hpp`](../../../include/cute/util/debug.hpp),
+among other utilities,
+includes the function `bool thread(int tid, int bid)`
+that returns `true` if running on thread `tid` and threadblock `bid`.
+
+#### Other output formats
+
+Some CuTe types have special printing functions that use a different output format.
+
+The `cute::print_layout` function will display any rank-2 layout in a plain test table. This is excellent for visualizing the map from coordinates to indices.
+
+The `cute::print_tensor` function will display any rank-1, rank-2, rank-3, or rank-4 tensor in a plain text multidimensional table. The values of the tensor are printed so you can verify the tile of data is what you expect after a copy, for example.
+
+The `cute::print_latex` function will print LaTeX commands that you can use to build a nicely formatted and colored tables via `pdflatex`. This work for `Layout`, `TiledCopy`, and `TiledMMA`, which can be very useful to get a sense of layout patterns and partitioning patterns within CuTe.
--- a/media/docs/cute/01_layout.md
+++ b/media/docs/cute/01_layout.md
@ -1,188 +1,187 @@
 # CuTe Layouts

-## Layout
-
 This document describes `Layout`, CuTe's core abstraction.
-A `Layout` maps from a logical coordinate space
+Fundamentally, a `Layout` maps from coordinate space(s)
 to an index space.

 `Layout`s present a common interface to multidimensional array access
 that abstracts away the details of how the array's elements are organized in memory.
 This lets users write algorithms that access multidimensional arrays generically,
-so that layouts can change, without users' code needing to change.
+so that layouts can change, without users' code needing to change. For example, a row-major MxN layout and a column-major MxN layout can be treated identically in software. 

 CuTe also provides an "algebra of `Layout`s."
 `Layout`s can be combined and manipulated
 to construct more complicated layouts
-and to partition them across other layouts.
+and to tile layouts across other layouts.
 This can help users do things like partition layouts of data over layouts of threads.

-## Layouts and Tensors
+## Fundamental Types and Concepts

-Any of the `Layout`s discussed in this section can be composed with data -- e.g., a pointer or an array -- to create a `Tensor`.
-The `Layout`'s logical coordinate space represents the logical "shape" of the data,
-e.g., the modes of the `Tensor` and their extents.
-The `Layout` maps a logical coordinate into an index,
-which is an offset to be used to index into the array of data.
+### Integers

-For details on `Tensor`, please refer to the
-[`Tensor` section of the tutorial](./03_tensor.md).
+CuTe makes great use of dynamic (known only at run-time) and static (known at compile-time) integers.

-## Shapes and Strides
+* Dynamic integers (or "run-time integers") are just ordinary integral types like `int` or `size_t` or `uint16_t`. Anything that is accepted by `std::is_integral<T>` is considered a dynamic integer in CuTe.

-A `Layout` is a pair of `Shape` and `Stride`.
-Both `Shape` and `Stride` are `IntTuple` types.
+* Static integers (or "compile-time integers") are instantiations of types like `std::integral_constant<Value>`. These types encode the value as a `static constexpr` member. They also support casting to their underlying dynamic types, so they can be used in expressions with dynamic integers. CuTe defines its own CUDA-compatibe static integer types `cute::C<Value>` along with overloaded math operators so that math on static integers results in static integers. CuTe defines shortcut aliases `Int<1>`, `Int<2>`, `Int<3>` and `_1`, `_2`, `_3` as conveniences, which you should see often within examples.
+
+CuTe attempts to handle static and dynamic integers identically. In the examples that follow, all dynamic integers could be replaced with static integers and vice versa. When we say "integer" in CuTe, we almost always mean a static OR dynamic integer.
+
+CuTe provides a number of traits to work with integers.
+* `cute::is_integral<T>`: Checks whether `T` is a static or dynamic integer type.
+* `cute::is_std_integral<T>`: Checks whether `T` is a dynamic integer type. Equivalent to `std::is_integral<T>`.
+* `cute::is_static<T>`: Checks whether `T` is an empty type (so instantiations cannot depend on any dynamic information). Equivalent to `std::is_empty`.
+* `cute::is_constant<N,T>`: Checks that `T` is a static integer AND its value is equivalent to `N`.
+
+See the [`integral_constant` implementations](../../../include/cute/numeric/integral_constant.hpp) for more information.
+
+### Tuple
+
+A tuple is a finite ordered list of zero or more elements.
+The [`cute::tuple` class](../../../include/cute/container/tuple.hpp) behaves like `std::tuple`, but works on device and host. It imposes restrictions on its template arguments and strips down the implementation for performance and simplicity.

 ### IntTuple

-An `IntTuple` is defined recursively as either a single integer, or a tuple of `IntTuple`s.
-This means that `IntTuple`s can be arbitrarily nested.
+CuTe defines the IntTuple concept as either an integer, or a tuple of IntTuples. Note the recursive definition.
+In C++, we define [operations on `IntTuple`](../../../include/cute/int_tuple.hpp).
+
+Examples of `IntTuple`s include:
+* `int{2}`, the dynamic integer 2.
+* `Int<3>{}`, the static integer 3.
+* `make_tuple(int{2}, Int<3>{})`, the tuple of dynamic-2, and static-3.
+* `make_tuple(uint16_t{42}, make_tuple(Int<1>{}, int32_t{3}), Int<17>{})`, the tuple of dynamic-42, tuple of static-1 and dynamic-3, and static-17.
+
+CuTe reuses the `IntTuple` concept for many different things,
+including Shape, Stride, Step, and Coord
+(see [`include/cute/layout.hpp`](../../../include/cute/layout.hpp)).
+
 Operations defined on `IntTuple`s include the following.

-* `get<I>(IntTuple)`: The `I`th element of the `IntTuple`.  For an `IntTuple` consisting of a single integer, `get<0>` is just that integer.
-
 * `rank(IntTuple)`: The number of elements in an `IntTuple`. A single integer has rank 1, and a tuple has rank `tuple_size`.

+* `get<I>(IntTuple)`: The `I`th element of the `IntTuple`, with `I < rank`. For single integers, `get<0>` is just that integer.
+
 * `depth(IntTuple)`: The number of hierarchical `IntTuple`s. A single integer has depth 0, a tuple of integers has depth 1, a tuple that contains a tuple of integers has depth 2, etc.

 * `size(IntTuple)`: The product of all elements of the `IntTuple`.

-We write `IntTuple`s with parenthesis to denote the hierarchy. For example, `6`, `(2)`, `(4,3)`, `(3,(6,2),8)` are all `IntTuple`s.
+We write `IntTuple`s with parentheses to denote the hierarchy. For example, `6`, `(2)`, `(4,3)`, and `(3,(6,2),8)` are all `IntTuple`s.

-## Layout
+### Shapes and Strides

-A `Layout` is then a pair of `IntTuple`s. The first element defines the abstract *shape* of the `Layout`, and the second element defines the *strides*, which map from coordinates within the shape to the index space.
+Both `Shape` and `Stride` are `IntTuple` concepts.

-Since a `Layout` is just a pair of `IntTuple`s, we can define operations on `Layout`s analogous to those defined on `IntTuple`.
+### Layout

-* `get<I>(Layout)`: The `I`th sub-layout of the `Layout`.
+A `Layout` is a tuple of (`Shape`, `Stride`).
+Semantically, it implements a mapping from
+any coordinate within the Shape to an index via the Stride.

-* `rank(Layout)`: The number of modes in a `Layout`.
+### Tensor

-* `depth(Layout)`: The number of hierarchical `Layout`s. A single integer has depth 0, a tuple of integers has depth 1, a tuple that contains a tuple of integers has depth 2, etc.
+A `Layout` can be composed with data -- e.g., a pointer or an array -- to create a `Tensor`. The index generated by the `Layout` is used to subscript an iterator to retrieve the appropriate data. For details on `Tensor`, please refer to the
+[`Tensor` section of the tutorial](./03_tensor.md).
+
+## Layout Creation and Use
+
+A `Layout` is a pair of `IntTuple`s: the `Shape` and the `Stride`. The first element defines the abstract *shape* of the `Layout`, and the second element defines the *strides*, which map from coordinates within the shape to the index space.
+
+We define many operations on `Layout`s analogous to those defined on `IntTuple`.
+
+* `rank(Layout)`: The number of modes in a `Layout`. Equivalent to the tuple size of the `Layout`'s shape.
+
+* `get<I>(Layout)`: The `I`th sub-layout of the `Layout`, with `I < rank`.
+
+* `depth(Layout)`: The depth of the `Layout`'s shape. A single integer has depth 0, a tuple of integers has depth 1, a tuple of tuples of integers has depth 2, etc.

 * `shape(Layout)`: The shape of the `Layout`.

 * `stride(Layout)`: The stride of the `Layout`.

-* `size(Layout)`: The logical extent of the `Layout`. Equivalent to `size(shape(Layout))`.
+* `size(Layout)`: The size of the `Layout` function's domain.  Equivalent to `size(shape(Layout))`.
+
+* `cosize(Layout)`: The size of the `Layout` function's codomain (not necessarily the range). Equivalent to `A(size(A) - 1) + 1`.

 ### Hierarchical access functions

-`IntTuple`s and thus `Layout`s can be arbitrarily nested.
+`IntTuple`s and `Layout`s can be arbitrarily nested.
 For convenience, we define versions of some of the above functions
 that take a sequence of integers, instead of just one integer.
 This makes it possible to access elements
-inside of nested `IntTuple` or `Layout`.
-For example, we permit `get<I...>(x)`, where `I...` here
-and throughout this section is a "C++ parameter pack"
-that denotes zero or more (integer) template arguments.
-That is, `get<I0,I1,...,IN>(x)` is equivalent to
-`get<IN>(` $\dots$ `(get<I1>(get<I0>(x)))` $\dots$ `))`,
-where the ellipses are pseudocode and not actual C++ syntax.
-These hierarchical access functions include the following.
+inside of nested `IntTuple` or `Layout` more easily.
+For example, we permit `get<I...>(x)`, where `I...` is a "C++ parameter pack" that denotes zero or more (integer) template arguments. These hierarchical access functions include the following.
+
+* `get<I0,I1,...,IN>(x) := get<IN>(...(get<I1>(get<I0>(x)))...)`. Extract the `IN`th of the ... of the `I1`st of the `I0`th element of `x`.

 * `rank<I...>(x)  := rank(get<I...>(x))`. The rank of the `I...`th element of `x`.

 * `depth<I...>(x) := depth(get<I...>(x))`. The depth of the `I...`th element of `x`.

+* `shape<I...>(x)  := shape(get<I...>(x))`. The shape of the `I...`th element of `x`.
+
 * `size<I...>(x)  := size(get<I...>(x))`. The size of the `I...`th element of `x`.

-### Vector examples
+In the following examples, you'll see use of `size<0>` and `size<1>` to determine loops bounds for the 0th and 1st mode of a layout or tensor.

-We define a vector as any `Shape` and `Stride` pair with `rank == 1`.
-For example, the `Layout`
-
-```
-Shape:  (8)
-Stride: (1)
-```
-
-defines a contiguous 8-element vector.
-For a vector with the same Shape but a Stride of `(2)`,
-the interpretation is that the eight elements
-are stored at positions 0, 2, 4, $\dots$, 14.
-
-By the above definition, we *also* interpret
-
-```
-Shape:  ((4,2))
-Stride: ((1,4))
-```
-
-as a vector, since its shape is rank 1. The inner shape describes a 4x2 layout of data in column-major order, but the extra pair of parenthesis suggest we can interpret those two modes as a single 1-D 8-element vector instead. Due to the strides, the elements are also contiguous.
-
-### Matrix examples
-
-Generalizing, we define a matrix as any `Shape` and `Stride` pair with rank 2. For example,
-
-```
-Shape:  (4,2)
-Stride: (1,4)
-  0   4
-  1   5
-  2   6
-  3   7
-```
-
-is a 4x2 column-major matrix, and
-
-```
-Shape:  (4,2)
-Stride: (2,1)
-  0   1
-  2   3
-  4   5
-  6   7
-```
-
-is a 4x2 row-major matrix.
-
-Each of the modes of the matrix can also be split into *multi-indices* like the vector example.
-This lets us express more layouts beyond just row major and column major. For example,
-
-```
-Shape:  ((2,2),2)
-Stride: ((4,1),2)
-  0   2
-  4   6
-  1   3
-  5   7
-```
-
-is also logically 4x2, with a stride of 2 across the rows but a multi-stride down the columns.
-Since this layout is logically 4x2,
-like the column-major and row-major examples above,
-we can _still_ use 2-D coordinates to index into it.
-
-## Constructing a `Layout`
+### Constructing a Layout

 A `Layout` can be constructed in many different ways.
 It can include any combination of compile-time (static) integers
 or run-time (dynamic) integers.

 ```c++
-auto layout_8s = make_layout(Int<8>{});
-auto layout_8d = make_layout(8);
+Layout s8 = make_layout(Int<8>{});
+Layout d8 = make_layout(8);

-auto layout_2sx4s = make_layout(make_shape(Int<2>{},Int<4>{}));
-auto layout_2sx4d = make_layout(make_shape(Int<2>{},4));
+Layout s2xs4 = make_layout(make_shape(Int<2>{},Int<4>{}));
+Layout s2xd4 = make_layout(make_shape(Int<2>{},4));

-auto layout_2x4 = make_layout(make_shape (2, make_shape (2,2)),
-                              make_stride(4, make_stride(2,1)));
+Layout s2xd4_a = make_layout(make_shape (Int< 2>{},4),
+                             make_stride(Int<12>{},Int<1>{}));
+Layout s2xd4_col = make_layout(make_shape(Int<2>{},4),
+                               LayoutLeft{});
+Layout s2xd4_row = make_layout(make_shape(Int<2>{},4),
+                               LayoutRight{});
+
+Layout s2xh4 = make_layout(make_shape (2,make_shape (2,2)),
+                           make_stride(4,make_stride(2,1)));
+Layout s2xh4_col = make_layout(shape(s2xh4),
+                               LayoutLeft{});
 ```

 The `make_layout` function returns a `Layout`.
-It deduces the returned `Layout`'s template arguments from the function's arguments.
+It deduces the types of the function's arguments and returns a `Layout` with the appropriate template arguments.
 Similarly, the `make_shape` and `make_stride` functions
 return a `Shape` resp. `Stride`.
-CuTe often uses these `make_*` functions,
-because constructor template argument deduction (CTAD)
-does not work for `cute::tuple` as it works for `std::tuple`.
+CuTe often uses these `make_*` functions
+due to restrictions around constructor template argument deduction (CTAD) and to avoid having to repeat static or dynamic integer types.

-## Using a `Layout`
+When the `Stride` argument is omitted, it is generated from the provided `Shape` with `LayoutLeft` as default. The `LayoutLeft` tag constructs strides as an exclusive prefix product of the `Shape` from left to right, without regard to the `Shape`'s hierarchy. This can be considered a "generalized column-major stride generation". The `LayoutRight` tag constructs strides as an exclusive prefix product of the `Shape` from right to left, without regard to the `Shape`'s hierarchy. For shapes of depth one, this can be considered a "row-major stride generation", but for hierarchical shapes the resulting strides may be surprising. For example, the strides of `s2xh4` above could be generated with `LayoutRight`.

-The fundamental use of a `Layout` is to map between logical coordinate space(s) and an index space. For example, to print an arbitrary rank-2 layout, we can write the function
+Calling `print` on each layout above results in the following
+
+```
+s8        :  _8:_1
+d8        :  8:_1
+s2xs4     :  (_2,_4):(_1,_2)
+s2xd4     :  (_2,4):(_1,_2)
+s2xd4_a   :  (_2,4):(_12,_1)
+s2xd4_col :  (_2,4):(_1,_2)
+s2xd4_row :  (_2,4):(4,_1)
+s2xh4     :  (2,(2,2)):(4,(2,1))
+s2xh4_col :  (2,(2,2)):(_1,(2,4))
+```
+
+The `Shape:Stride` notation is used quite often for `Layout`. The `_N` notation is shorthand for a static integer while other integers are dynamic integers. Observe that both `Shape` and `Stride` may be composed of both static and dynamic integers.
+
+Also note that the `Shape` and `Stride` are assumed to be *congruent*. That is, `Shape` and `Stride` have the same tuple profiles. For every integer in `Shape`, there is a corresponding integer in `Stride`. This can be asserted with
+```cpp
+static_assert(congruent(my_shape, my_stride));
+```
+
+### Using a Layout
+
+The fundamental use of a `Layout` is to map between coordinate space(s) defined by the `Shape` and an index space defined by the `Stride`. For example, to print an arbitrary rank-2 layout in a 2-D table, we can write the function

 ```c++
 template <class Shape, class Stride>
@ -200,20 +199,24 @@ void print2D(Layout<Shape,Stride> const& layout)
 which produces the following output for the above examples.

 ```
-> print2D(layout_2sx4s)
-  0   2   4   6
-  1   3   5   7
-> print2D(layout_2sx4d)
-  0   2   4   6
-  1   3   5   7
-> print2D(layout_2x4)
-  0   2   1   3
-  4   6   5   7
+> print2D(s2xs4)
+  0    2    4    6  
+  1    3    5    7  
+> print2D(s2xd4_a)
+  0    1    2    3  
+ 12   13   14   15
+> print2D(s2xh4_col)
+  0    2    4    6  
+  1    3    5    7  
+> print2D(s2xh4)
+  0    2    1    3  
+  4    6    5    7 
 ```

-The multi-indices within the `layout_2x4` example are handled as expected and interpreted as a rank-2 layout.
+We can see static, dynamic, row-major, column-major, and hierarchical layouts printed here. The statement `layout(m,n)` provides the mapping of
+the logical 2-D coordinate (m,n) to the 1-D index.

-Note that for `layout_2x4`, we're using a 1-D coordinate for a 2-D multi-index in the second mode. In fact, we can generalize this and treat all of the above layouts as 1-D layouts.  For instance, the following `print1D` function
+Interestingly, the `s2xh4` example isn't row-major or column-major. Furthermore, it has three modes but is still interpreted as rank-2 and we're using a 2-D coordinate. Specifically, `s2xh4` has a 2-D multi-mode in the second mode, but we're still able to use a 1-D coordinate for that mode. More on this in the next section, but first we can generalize this another step. Let's use a 1-D coordinate and treat all of the modes of each layout as a single multi-mode.  For instance, the following `print1D` function

 ```c++
 template <class Shape, class Stride>
@ -228,39 +231,305 @@ void print1D(Layout<Shape,Stride> const& layout)
 produces the following output for the above examples.

 ```
-> print1D(layout_8s)
-  0   1   2   3   4   5   6   7
-> print1D(layout_8d)
-  0   1   2   3   4   5   6   7
-> print1D(layout_2sx4s)
-  0   1   2   3   4   5   6   7
-> print1D(layout_2sx4d)
-  0   1   2   3   4   5   6   7
-> print1D(layout_2x4)
-  0   4   2   6   1   5   3   7
+> print1D(s2xs4)
+  0    1    2    3    4    5    6    7  
+> print1D(s2xd4_a)
+  0   12    1   13    2   14    3   15  
+> print1D(s2xh4_col)
+  0    1    2    3    4    5    6    7  
+> print1D(s2xh4)
+  0    4    2    6    1    5    3    7  
 ```

-This shows explicitly that all of the layouts are simply folded views of an 8-element array.
+Any multi-mode of a layout, including the entire layout itself, can accept a 1-D coordinate. More on this in the following sections.
+
+CuTe provides more printing utilities for visualizing Layouts. The `print_layout` function produces a formatted 2-D table of the Layout's mapping.
+
+```text
+> print_layout(s2xh4)
+(2,(2,2)):(4,(2,1))
+      0   1   2   3 
+    +---+---+---+---+
+ 0  | 0 | 2 | 1 | 3 |
+    +---+---+---+---+
+ 1  | 4 | 6 | 5 | 7 |
+    +---+---+---+---+
+```
+
+The `print_latex` function generates LaTeX that can be compiled with `pdflatex` into a color-coded vector graphics image of the same 2-D table.
+
+### Vector Layouts
+
+We define a vector as any `Layout` with `rank == 1`.
+For example, the layout `8:1` can be interpreted as an 8-element vector whose indices are contiguous. 
+
+```
+Layout:  8:1
+Coord :  0  1  2  3  4  5  6  7
+Index :  0  1  2  3  4  5  6  7 
+```
+
+Similarly, 
+the layout `8:2` can be interpreted as an 8-element vector where the indices of the elements are strided by `2`.
+
+```
+Layout:  8:2
+Coord :  0  1  2  3  4  5  6  7
+Index :  0  2  4  6  8 10 12 14
+```
+
+By the above rank-1 definition, we *also* interpret layout `((4,2)):((2,1))` as a vector, since its shape is rank-1. The inner shape looks like a 4x2 column-major matrix, but the extra pair of parenthesis suggest we can interpret those two modes as a 1-D 8-element vector. The strides tell us that the first `4` elements are strided by `2` and then there are `2` of those first elements strided by `1`.
+
+```
+Layout:  ((4,2)):((2,1))
+Coord :  0  1  2  3  4  5  6  7
+Index :  0  2  4  8  1  3  5  7 
+```
+
+We can see the second set of `4` elements are duplicates of the first `4` with an extra stride of `1`.
+
+Consider the layout `((4,2)):((1,4))`. Again, it's `4` elements strided by `1` and then `2` of those first elements strided by `4`.
+
+```
+Layout:  ((4,2)):((1,4))
+Coord :  0  1  2  3  4  5  6  7
+Index :  0  1  2  3  4  5  6  7 
+```
+
+As a function from integers to integers, it's identical to `8:1`. It's the identity function.
+
+### Matrix examples
+
+Generalizing, we define a matrix as any `Layout` that is rank-2. For example,
+
+```
+Shape :  (4,2)
+Stride:  (1,4)
+  0   4
+  1   5
+  2   6
+  3   7
+```
+
+is a 4x2 column-major layout with stride-1 down the columns and stride-4 across the rows, and
+
+```
+Shape :  (4,2)
+Stride:  (2,1)
+  0   1
+  2   3
+  4   5
+  6   7
+```
+
+is a 4x2 row-major layout with stride-2 down the columns and stride-1 across the rows. Majorness is simply which mode has stride-1.
+
+Just like the vector layouts, each of the modes of the matrix can also be split into *multi-modes*.
+This lets us express more layouts beyond just row-major and column-major. For example,
+
+```
+Shape:  ((2,2),2)
+Stride: ((4,1),2)
+  0   2
+  4   6
+  1   3
+  5   7
+```
+
+is also logically 4x2, with stride-2 across the rows but a multi-stride down the columns. The first `2` elements down the column have a stride of `4` and then there is a copy of those with stride-1. Since this layout is logically 4x2,
+like the column-major and row-major examples above,
+we can _still_ use 2-D coordinates to index into it.
+
+## Layout Concepts
+
+In this section, we'll introduce the coordinate sets that `Layout`s accept and how the coordinate mappings and index mappings are computed.
+
+### Layout compatibility
+
+We say that layout A is *compatible* with layout B if the shape of A is compatible with the shape of B.
+Shape A is compatible with shape B if
+
+* the size of A is equal to the size of B and
+* all coordinates within A are valid coordinates within B.
+
+For example:
+* Shape 24 is NOT compatible with Shape 32.
+* Shape 24 is compatible with Shape (4,6).
+* Shape (4,6) is compatible with Shape ((2,2),6).
+* Shape ((2,2),6) is compatible with Shape ((2,2),(3,2)).
+* Shape 24 is compatible with Shape ((2,2),(3,2)).
+* Shape 24 is compatible with Shape ((2,3),4).
+* Shape ((2,3),4) is NOT compatible with Shape ((2,2),(3,2)).
+* Shape ((2,2),(3,2)) is NOT compatible with Shape ((2,3),4).
+* Shape 24 is compatible with Shape (24).
+* Shape (24) is NOT compatible with Shape 24.
+* Shape (24) is NOT compatible with Shape (4,6).
+
+That is, *compatible* is a weak partial order on Shapes as it is reflexive, antisymmetric, and transitive.
+
+### Layouts Coordinates
+
+With the notion of compatibility above, we emphasize that every `Layout` accepts multiple kinds of coordinates. Every `Layout` accepts coordinates for any `Shape` that is compatible with it. CuTe provides mappings between these sets of coordinates via a colexicographical order.
+
+Thus, all Layouts provide two fundamental mappings:
+
+* the map from an input coordinate to the corresponding natural coordinate via the `Shape`,
+* and the map from a natural coordinate to the index via the `Stride`.
+
+#### Coordinate Mapping
+
+The map from an input coordinate to a natural coordinate is the application of a colexicographical order (reading right to left, instead of "lexicographical," which reads left to right) within the `Shape`.
+
+Take the shape `(3,(2,3))`, for example. This shape has three coordinate sets: the 1-D coordinates, the 2-D coordinates, and the natural (h-D) coordinates.
+
+|  1-D  |   2-D   |   Natural   | |  1-D  |   2-D   |       Natural   |
+| ----- | ------- | ----------- |-| ----- | ------- | ----------- |
+|  `0`  | `(0,0)` | `(0,(0,0))` | |  `9`  | `(0,3)` | `(0,(1,1))` |
+|  `1`  | `(1,0)` | `(1,(0,0))` | | `10`  | `(1,3)` | `(1,(1,1))` |
+|  `2`  | `(2,0)` | `(2,(0,0))` | | `11`  | `(2,3)` | `(2,(1,1))` |
+|  `3`  | `(0,1)` | `(0,(1,0))` | | `12`  | `(0,4)` | `(0,(0,2))` |
+|  `4`  | `(1,1)` | `(1,(1,0))` | | `13`  | `(1,4)` | `(1,(0,2))` |
+|  `5`  | `(2,1)` | `(2,(1,0))` | | `14`  | `(2,4)` | `(2,(0,2))` |
+|  `6`  | `(0,2)` | `(0,(0,1))` | | `15`  | `(0,5)` | `(0,(1,2))` |
+|  `7`  | `(1,2)` | `(1,(0,1))` | | `16`  | `(1,5)` | `(1,(1,2))` |
+|  `8`  | `(2,2)` | `(2,(0,1))` | | `17`  | `(2,5)` | `(2,(1,2))` |
+
+Each coordinate into the shape `(3,(2,3))` has two *equivalent* coordinates and all equivalent coordinates map to the same natural coordinate. To emphasize again, because all of the above coordinates are valid inputs, a Layout with Shape `(3,(2,3))` can be used as if it is a 1-D array of 18 elements by using the 1-D coordinates, a 2-D matrix of 3x6 elements by using the 2-D coordinates, or a h-D tensor of 3x(2x3) elements by using the h-D (natural) coordinates.
+
+The previous 1-D print demonstrates how CuTe identifies 1-D coordinates with a colexicographical ordering of 2-D coordinates.  Iterating from `i = 0` to `size(layout)` and indexing into our layout with the single integer coordinate `i`, traverses the 2-D coordinates in this "generalized-column-major" order, even if the layout maps coordinates to indices in a row-major or more complex fashion.
+
+The function `cute::idx2crd(idx, shape)` is responsible for the coordinate mapping. It will take any coordinate within the shape and compute the equivalent natural coordinate for that shape.
+```cpp
+auto shape = Shape<_3,Shape<_2,_3>>{};
+print(idx2crd(   16, shape));                                // (1,(1,2))
+print(idx2crd(_16{}, shape));                                // (_1,(_1,_2))
+print(idx2crd(make_coord(   1,5), shape));                   // (1,(1,2))
+print(idx2crd(make_coord(_1{},5), shape));                   // (_1,(1,2))
+print(idx2crd(make_coord(   1,make_coord(1,   2)), shape));  // (1,(1,2))
+print(idx2crd(make_coord(_1{},make_coord(1,_2{})), shape));  // (_1,(1,_2))
+```
+
+#### Index Mapping
+
+The map from a natural coordinate to an index is performed by taking the inner product of the natural coordinate with the `Layout`'s `Stride`.
+
+Take the layout `(3,(2,3)):(3,(12,1))`, for example. Then a natural coordinate `(i,(j,k))` will result in the index `i*3 + j*12 + k*1`. The indices this layout computes are shown in the 2-D table below where `i` is used as the row coordinate and `(j,k)` is used as the column coordinate.
+
+```
+       0     1     2     3     4     5     <== 1-D col coord
+     (0,0) (1,0) (0,1) (1,1) (0,2) (1,2)   <== 2-D col coord (j,k)
+    +-----+-----+-----+-----+-----+-----+
+ 0  |  0  |  12 |  1  |  13 |  2  |  14 |
+    +-----+-----+-----+-----+-----+-----+
+ 1  |  3  |  15 |  4  |  16 |  5  |  17 |
+    +-----+-----+-----+-----+-----+-----+
+ 2  |  6  |  18 |  7  |  19 |  8  |  20 |
+    +-----+-----+-----+-----+-----+-----+
+```
+
+The function `cute::crd2idx(c, shape, stride)` is responsible for the index mapping. It will take any coordinate within the shape, compute the equivalent natural coordinate for that shape (if it is not already), and compute the inner product with the strides.
+```cpp
+auto shape  = Shape <_3,Shape<  _2,_3>>{};
+auto stride = Stride<_3,Stride<_12,_1>>{};
+print(crd2idx(   16, shape, stride));       // 17
+print(crd2idx(_16{}, shape, stride));       // _17
+print(crd2idx(make_coord(   1,   5), shape, stride));  // 17
+print(crd2idx(make_coord(_1{},   5), shape, stride));  // 17
+print(crd2idx(make_coord(_1{},_5{}), shape, stride));  // _17
+print(crd2idx(make_coord(   1,make_coord(   1,   2)), shape, stride));  // 17
+print(crd2idx(make_coord(_1{},make_coord(_1{},_2{})), shape, stride));  // _17
+```
+
+## Layout Manipulation
+
+### Sublayouts
+
+Sublayouts can be retrieved with `layout<I...>`
+```cpp
+Layout a   = Layout<Shape<_4,Shape<_3,_6>>>{}; // (4,(3,6)):(1,(4,12))
+Layout a0  = layout<0>(a);                     // 4:1     
+Layout a1  = layout<1>(a);                     // (3,6):(4,12)
+Layout a10 = layout<1,0>(a);                   // 3:4
+Layout a11 = layout<1,1>(a);                   // 6:12
+```
+or `select<I...>`
+```cpp
+Layout a   = Layout<Shape<_2,_3,_5,_7>>{};     // (2,3,5,7):(1,2,6,30)
+Layout a13 = select<1,3>(a);                   // (3,7):(2,30)
+Layout a01 = select<0,1,3>(a);                 // (2,3,7):(1,2,30)
+Layout a2  = select<2>(a);                     // (5):(6)
+```
+or `take<ModeBegin, ModeEnd>`
+```cpp
+Layout a   = Layout<Shape<_2,_3,_5,_7>>{};     // (2,3,5,7):(1,2,6,30)
+Layout a13 = take<1,3>(a);                     // (3,5):(2,6)
+Layout a14 = take<1,4>(a);                     // (3,5,7):(2,6,30)
+// take<1,1> not allowed. Empty layouts not allowed.
+```
+
+### Concatenation
+
+A `Layout` can be provided to `make_layout` to wrap and concatenate
+```cpp
+Layout a = Layout<_3,_1>{};                     // 3:1
+Layout b = Layout<_4,_3>{};                     // 4:3
+Layout row = make_layout(a, b);                 // (3,4):(1,3)
+Layout col = make_layout(b, a);                 // (4,3):(3,1)
+Layout q   = make_layout(row, col);             // ((3,4),(4,3)):((1,3),(3,1))
+Layout aa  = make_layout(a);                    // (3):(1)
+Layout aaa = make_layout(aa);                   // ((3)):((1))
+Layout d   = make_layout(a, make_layout(a), a); // (3,(3),3):(1,(1),1)
+```
+or can be combined with `append`, `prepend`, or `replace`
+```cpp
+Layout a = Layout<_3,_1>{};                     // 3:1
+Layout b = Layout<_4,_3>{};                     // 4:3
+Layout ab = append(a, b);                       // (3,4):(1,3)
+Layout ba = prepend(a, b);                      // (4,3):(3,1)
+Layout c  = append(ab, ab);                     // (3,4,(3,4)):(1,3,(1,3))
+Layout d  = replace<2>(c, b);                   // (3,4,4):(1,3,3)
+```
+
+### Grouping
+
+Layout modes can be grouped with `group<ModeBegin, ModeEnd>` and flattened with `flatten`
+```cpp
+Layout a = Layout<Shape<_2,_3,_5,_7>>{}; // (_2,_3,_5,_7)
+Layout b = group<0,2>(a);                // ((_2,_3),_5,_7)
+Layout c = group<1,3>(b);                // ((_2,_3),(_5,_7))
+Layout f = flatten(c);                   // (_2,_3,_5,_7)
+```
+
+### Slicing
+
+`Layout`s can be sliced, but slicing is more appropriate to perform on `Tensor`s. See the [`Tensor` section](./03_tensor.md) for slicing details.

 ## Summary

 * The `Shape` of a `Layout` defines its coordinate space(s).

    * Every `Layout` has a 1-D coordinate space.
-      This can be used to iterate in a "generalized-column-major" order.
+      This can be used to iterate over the coordinate spaces in a colexicographical order.

    * Every `Layout` has a R-D coordinate space,
      where R is the rank of the layout.
-      These spaces are ordered _colexicographically_
-      (reading right to left, instead of "lexicographically,"
-      which reads left to right).
-      The enumeration of that order
-      corresponds to the 1-D coordinates above.
+      The colexicographical enumeration of the R-D coordinates
+      correspond to the 1-D coordinates above.

-    * Every `Layout` has an h-D coordinate space where h is "hierarchical." These are ordered colexicographically and the enumeration of that order corresponds to the 1-D coordinates above. An h-D coordinate is congruent to the `Shape` so that each element of the coordinate has a corresponding element of the `Shape`.
+    * Every `Layout` has an h-D (natural) coordinate space where h is "hierarchical." These are ordered colexicographically and the enumeration of that order corresponds to the 1-D coordinates above. A natural coordinate is *congruent* to the `Shape` so that each element of the coordinate has a corresponding element of the `Shape`.

 * The `Stride` of a `Layout` maps coordinates to indices.

-    * In general, this could be any function from 1-D coordinates (integers) to indices (integers).
+    * The inner product of the elements of the natural coordinate with the elements of the `Stride` produces the resulting index.

-    * In `CuTe` we use an inner product of the h-D coordinates with the `Stride` elements.
+For each `Layout` there exists an integral `Shape` that is that compatible with that `Layout`. Namely, that integral shape is `size(layout)`. We can then observe that
+
+> Layouts are functions from integers to integers.
+
+If you're familiar with the C++23 feature `mdspan`,
+this is an important difference between
+`mdspan` layout mappings and CuTe `Layout`s. In CuTe, `Layout` is a first class citizen, is natively hierarchical to naturally represent functions beyond row-major and column-major, and can similarly be indexed with a hierarchy of coordinates.
+(`mdspan` layout mappings can represent hierarchical functions as well,
+but this requires defining a custom layout.)
+Input coordinates for an `mdspan` must have the same shape as the `mdspan`;
+a multidimensional `mdspan` does not accept 1-D coordinates.
--- a/media/docs/cute/02_layout_algebra.md
+++ b/media/docs/cute/02_layout_algebra.md
@ -0,0 +1,572 @@
+# CuTe Layout Algebra
+
+CuTe provides an "algebra of `Layout`s" to support combining layouts in different ways.  This algebra includes operations such as
+
+* `Layout` functional composition,
+* a notion of `Layout` "product" to reproduce one layout according to another, and
+* a notion of `Layout` "divide" to split one layout according to another. 
+
+Common utilities for building complicated layouts from simpler ones depend on the `Layout` product. Common utilities for partitioning layouts (of data, for example) across other layouts (of threads, for example) depend on the `Layout` divide. All of these utilities rely on the functional composition of `Layout`s.
+
+In this section, we'll build up the tools of the `Layout` algebra and explain some of these core operations in detail.
+
+## Coalesce
+
+In the previous section, we summarized `Layout`s with
+> Layouts are functions from integers to integers.
+
+The `coalesce` operation is a "simplify" on functions from integers to integers. If we only care about input integers, then we can manipulate the shape and number of modes of the `Layout` without changing it as a function. The only thing `coalesce` can't change is the `Layout`'s `size`.
+
+More specifically, you can find the checked post-conditions in [the `coalesce` unit test](../../../test/unit/cute/core/coalesce.cpp), which we'll reproduce here:
+```cpp
+// @post size(@a result) == size(@a layout)
+// @post depth(@a result) <= 1
+// @post for all i, 0 <= i < size(@a layout), @a result(i) == @a layout(i)
+Layout coalesce(Layout const& layout)
+```
+
+For example,
+
+```cpp
+auto layout = Layout<Shape <_2,Shape <_1,_6>>,
+                     Stride<_1,Stride<_6,_2>>>{};
+auto result = coalesce(layout);    // _12:_1
+```
+
+where we can see the result has fewer modes and is "simpler." Indeed, this could save us a few operations in the coordinate mapping and index mapping (if those are performed dynamically).
+
+So, how do we get there? 
+
+* We've already seen that column-major `Layout`s like `(_2,_4):(_1,_2)` act identically to `_8:_1` for 1-D coordinates.
+* Modes with size static-1 will always produce a natural coordinate of static-0. They can be ignored no matter the stride.
+
+Generalizing, consider a layout with just two integral modes, s0:d0 and s1:d1.  Denote the result of coalescing this layout as s0:d0 ++ s1:d1. Then, there are four cases:
+
+1. `s0:d0  ++  _1:d1  =>  s0:d0`. Ignore modes with size static-1.
+2. `_1:d0  ++  s1:d1  =>  s1:d1`. Ignore modes with size static-1.
+3. `s0:d0  ++  s1:s0*d0  =>  s0*s1:d0`. If the second mode's stride is the product of the first mode's size and stride, then they can be combined.
+4. `s0:d0  ++  s1:d1  =>  (s0,s1):(d0,d1)`. Else, nothing can be done and they must be treated separately.
+
+That's it! We can flatten any layout and apply the above binary operation to each pair of adjacent modes in order to "coalesce" the modes of the layout.
+
+### By-mode Coalesce
+
+Obviously, sometimes we do care about the shape of our `Layout`, but would still like to coalesce. For example, I have a 2-D `Layout` and I would like the result to remain 2-D.
+
+For this reason, there's an overload of `coalesce` that takes an additional parameter
+```cpp
+// Apply coalesce at the terminals of trg_profile
+Layout coalesce(Layout const& layout, IntTuple const& trg_profile)
+```
+
+which can be used as follows
+
+```cpp
+auto a = Layout<Shape <_2,Shape <_1,_6>>,
+                Stride<_1,Stride<_6,_2>>>{};
+auto result = coalesce(a, Step<_1,_1>{});   // (_2,_6):(_1,_2)
+// Identical to 
+auto same_r = make_layout(coalesce(layout<0>(a)), 
+                          coalesce(layout<1>(a)));
+```
+
+This function is recursing into `Step<_1,_1>{}` and applying `coalesce` to the corresponding sublayout whenever it sees an integer (the values don't matter, they're just flags) rather than a tuple.
+
+> This theme of defining an operation that treats a `Layout` as a "1-D" function from integers to integers and then generalizing to use it for an arbitrarily shaped layout will be a common one!
+
+## Composition
+
+Functional composition of `Layout`s is the core of CuTe and is used in just about every higher-level operation. 
+
+Starting again from the observation that `Layout`s are just functions from integers to integers, we can define functional composition that results in another `Layout`. First, an example.
+
+```text
+Functional composition, R := A o B
+R(c) := (A o B)(c) := A(B(c))
+
+Example
+A = (6,2):(8,2)
+B = (4,3):(3,1)
+
+R( 0) = A(B( 0)) = A(B(0,0)) = A( 0) = A(0,0) =  0
+R( 1) = A(B( 1)) = A(B(1,0)) = A( 3) = A(3,0) = 24
+R( 2) = A(B( 2)) = A(B(2,0)) = A( 6) = A(0,1) =  2
+R( 3) = A(B( 3)) = A(B(3,0)) = A( 9) = A(3,1) = 26
+R( 4) = A(B( 4)) = A(B(0,1)) = A( 1) = A(1,0) =  8
+R( 5) = A(B( 5)) = A(B(1,1)) = A( 4) = A(4,0) = 32
+R( 6) = A(B( 6)) = A(B(2,1)) = A( 7) = A(1,1) = 10
+R( 7) = A(B( 7)) = A(B(3,1)) = A(10) = A(4,1) = 34
+R( 8) = A(B( 8)) = A(B(0,2)) = A( 2) = A(2,0) = 16
+R( 9) = A(B( 9)) = A(B(1,2)) = A( 5) = A(5,0) = 40
+R(10) = A(B(10)) = A(B(2,2)) = A( 8) = A(2,1) = 18
+R(11) = A(B(11)) = A(B(3,2)) = A(11) = A(5,1) = 42
+```
+
+The absolutely amazing observation is that the function `R(c) = k` defined above can be written down as another `Layout`
+
+```
+R = ((2,2),3):((24,2),8)
+```
+
+AND 
+
+```
+compatible(B, R)
+```
+
+That is, every coordinate of `B` can also be used as a coordinate of `R`. This is an expected property of functional composition because `B` defines the *domain* of `R`.
+
+You can find many examples and checked post-conditions in [the `composition` unit test](../../../test/unit/cute/core/composition.cpp). The post-conditions are precisely as we just stated.
+```cpp
+// @post compatible(@a layout_b, @a result)
+// @post for all i, 0 <= i < size(@a layout_b), @a result(i) == @a layout_a(@a layout_b(i)))
+Layout composition(LayoutA const& layout_a, LayoutB const& layout_b)
+```
+
+### Computing Composition
+
+First, a few observations:
+
+* `B = (B_0, B_1, ...)`. A layout can be expressed as the concatenation of its sublayouts.
+
+* `A o B = A o (B_0, B_1, ...) = (A o B_0, A o B_1, ...)`. When `B` is injective, composition is left-distributive with concatenation.
+
+With the above, we can assume without loss of generality that `B = s:d` is a layout with integral shape and stride. We can also assume that `A` is a flattened, coalesced layout.
+
+When `A` is integral, `A = a:b`, the result is rather trivial: `R = A o B = a:b o s:d = s:(b*d)`. But when `A` is multimodal, we need to be more careful. 
+
+Put into words, `A o B = A o s:d`, for integral `s` and `d` means that we want (1) every `d`th element of `A`, and then (2) keep the first `s` of those strided elements.
+
+1. Every `d`th element of `A` can be computed by "dividing out" the first `d` elements from the shape of `A`. For an array of integers representing the shape, this is computed as
+```cpp
+void shape_div(int* shapeA, int N, int& strideB) {
+   for (int i = 0; i < N; ++i) {
+      assert(shapeA[i] %   strideB == 0 or 
+               strideB % shapeA[i] == 0);
+      int new_shape  = ceil_div(shapeA[i], strideB);
+      int new_stride = ceil_div(strideB, shapeA[i]);
+      shapeA[i] = new_shape;
+      strideB   = new_stride;
+   }
+}
+```
+which progressively "removes" the first `strideB` elements from `shapeA` starting from the left. For example,
+* `(6,2) /  2 => (3,2)`
+* `(6,2) /  3 => (2,2)`
+* `(6,2) /  6 => (1,2)`
+* `(6,2) / 12 => (1,1)`
+* `(3,6,2,8) / 6 => (1,3,2,8)`
+* `(3,6,2,8) / 9 => (1,2,2,8)`
+* `(42,16,3) / 2 => (21,16,3)`
+* `(42,16,3) / 6 => ( 7,16,3)`
+
+As you may have noticed, we can only divide shapes by certain values and get a sensible result. This is called the **divisibility condition** and is enforced by the `assert` in the above code and statically checked in CuTe when possible.
+
+2. The first `s` elements of the strided `A` layout can be computed by "modding out" the first `s` elements from the shape of `A`. For an array of integers representing the shape, this is computed as
+```cpp
+void shape_mod(int* shapeA, int N, int& shapeB) {
+   for (int i = 0; i < N; ++i) {
+      assert(shapeA[i] %    shapeB == 0 or 
+                shapeB % shapeA[i] == 0);
+      int new_shapeA =      min(shapeA[i], shapeB);
+      int new_shapeB = ceil_div(shapeB, shapeA[i]);
+      shapeA[i] = new_shapeA;
+      shapeB    = new_shapeB;
+   }
+}
+```
+which progressibly "keeps" the first `shapeB` elements from `shapeA` starting from the left. For example,
+* `(6,2) %  2 => (2,1)`
+* `(6,2) %  3 => (3,1)`
+* `(6,2) %  6 => (6,1)`
+* `(6,2) % 12 => (6,2)`
+* `(3,6,2,8) %  6 => (3,2,1,1)`
+* `(3,6,2,8) %  9 => (3,3,1,1)`
+* `(1,2,2,8) %  2 => (1,2,1,1)`
+* `(1,2,2,8) % 16 => (1,2,2,4)`
+
+Again, this operation must satisfy the divisibility condition to yield a sensible result. This is enforced by the `assert` in the above code and statically checked in CuTe when possible.
+
+Clearly, CuTe does not use arrays to store shapes or strides and the above code is for explication only. CuTe works with shapes and strides as `IntTuple`s and the implementation is expressed as algorithmic `fold`s which carefully account for static and dynamic integers.
+
+#### Example 1 -- Reshape a layout into a matrix
+
+`20:2  o  (5,4):(4,1)`.
+
+This describes interpreting the layout `20:2`
+as a 5x4 matrix in a row-major order.
+
+1. ` = 20:2 o (5:4,4:1)`. Concatenation of sublayouts.
+
+2. ` = (20:2 o 5:4, 20:2 o 4:1)`. Left distributivity.
+
+    * `20:2 o 5:4  =>  5:8`. Trivial case.
+    * `20:2 o 4:1  =>  4:2`. Trivial case.
+
+3. ` = (5:8, 4:2)`.
+
+4. ` = (5,4):(8,2)`. Concatenation of sublayouts.
+
+#### Example 2 -- Reshape a layout into a matrix
+
+`(10,2):(16,4)  o  (5,4):(1,5)`
+
+This describes interpreting the layout `(10,2):(16,4)`
+as a 5x4 matrix in a col-major order.
+
+1. ` = (10,2):(16,4) o (5:1,4:5)`. Concatenation of sublayouts.
+
+2. ` = ((10,2):(16,4) o 5:1, (10,2):(16,4) o 4:5)`. Left distributivity.
+
+    * `(10,2):(16,4) o 5:1 => (5,1):(16,4)`. Mod out the shape `5`.
+    * `(10,2):(16,4) o 4:5 => (2,2):(80,4)`. Div out the stride `5`.
+
+3. ` = ((5,1):(16,4), (2,2):(80,4))`. Collect results.
+
+4. ` = (5:16, (2,2):(80,4))`. By-mode coalesce.
+
+5. ` = (5,(2,2))):(16,(80,4))`. Concatenation of sublayouts.
+
+We get exactly this result with CuTe
+if we use compile-time shapes and strides.
+The following C++ code prints `(_5,(_2,_2)):(_16,(_80,_4))`.
+
+```cpp
+Layout a = make_layout(make_shape (Int<10>{}, Int<2>{}), 
+                       make_stride(Int<16>{}, Int<4>{}));
+Layout b = make_layout(make_shape (Int< 5>{}, Int<4>{}), 
+                       make_stride(Int< 1>{}, Int<5>{}));
+Layout c = composition(a, b);
+print(c);
+```
+
+If we use dynamic integers, the following C++ code prints `((5,1),(2,2)):((16,4),(80,4))`.
+
+```cpp
+Layout a = make_layout(make_shape (10, 2), 
+                       make_stride(16, 4));
+Layout b = make_layout(make_shape ( 5, 4), 
+                       make_stride( 1, 5));
+Layout c = composition(a, b);
+print(c);
+```
+
+The results may _look_ different but are the mathematically the same. The 1s in the shape don't affect the layout as a mathematical function from 1-D coordinates to integers or as a function from 2-D coordinates to integers. In the dynamic case, CuTe can not coalesce the dynamic size-1 modes to "simplify" the layout due to the static rank and type of the tuples containing them.
+
+### By-mode Composition
+
+Similar to by-mode `coalesce` and building up to a generic tiling operation, sometimes we do care about the shape of the `A` layout and would still like to apply `composition` to individual modes. For example, I have a 2-D `Layout` and would like some sublayout of the elements down the columns and another sublayout of elements across the rows.
+
+For this reason, `composition` also works when its second parameter -- the `B` -- is a `Tiler`. In general, a tiler is a layout or a tuple-of-layouts (note the generalization on `IntTuple`), which can be used as follows
+```cpp
+// (12,(4,8)):(59,(13,1))
+auto a = make_layout(make_shape (12,make_shape ( 4,8)),
+                     make_stride(59,make_stride(13,1)));
+// <3:4, 8:2>
+auto tiler = make_tile(Layout<_3,_4>{},  // Apply 3:4 to mode-0
+                       Layout<_8,_2>{}); // Apply 8:2 to mode-1                 
+
+// (_3,(2,4)):(236,(26,1))
+auto result = composition(a, tiler);
+// Identical to
+auto same_r = make_layout(composition(layout<0>(a), get<0>(tiler)),
+                          composition(layout<1>(a), get<1>(tiler)));
+```
+We often use the `<LayoutA, LayoutB, ...>` notation to distinguish `Tiler`s from the concatenation-of-sublayouts notation `(LayoutA, LayoutB, ...)` that we used previously.
+
+The `result` in the above code can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
+<p align="center">
+  <img src="../../images/cute/composition1.png" alt="composition1.png" height="250"/>
+</p>
+
+For convenience, CuTe also interprets `Shape`s as a tiler as well. A `Shape` is interpreted as tuple-of-layouts-with-stride-1:
+```cpp
+// (12,(4,8)):(59,(13,1))
+auto a = make_layout(make_shape (12,make_shape ( 4,8)),
+                     make_stride(59,make_stride(13,1)));
+// (8, 3)
+auto tiler = make_shape(Int<3>{}, Int<8>{});
+// Equivalent to <3:1, 8:1>
+// auto tiler = make_tile(Layout<_3,_1>{},  // Apply 3:1 to mode-0
+//                        Layout<_8,_1>{}); // Apply 8:1 to mode-1
+
+// (_3,(4,2)):(59,(13,1))
+auto result = composition(a, tiler);     
+```
+where `result` can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
+<p align="center">
+  <img src="../../images/cute/composition2.png" alt="composition2.png" height="250"/>
+</p>
+
+## Composition Tilers
+
+In summary, a `Tiler` is one of the following objects.
+1. A `Layout`.
+2. A tuple of `Tiler`s.
+3. A `Shape`, which will be interpreted as a tiler of `Layout`s with stride-1.
+
+Any of the above can be used as the second argument in `composition`. With (1), we think of the `composition` as between two functions from integers to integers, no matter the ranks of the layouts. With (2) and (3), the `composition` is performed on each pair of corresponding modes of `A` and `B`, until case (1) is found.
+
+This allows composition to be applied by-mode to retrieve arbitrary sublayouts of specified modes of a tensor ("Give me the 3x5x8 subblock of this MxNxL tensor") but also allows entire tiles of data to be reshaped and reordered as if they were 1-D vectors ("Reorder this 8x16 block of data into a 32x4 block using this weird order of elements"). We will see the by-mode cases appear often when we are tiling for threadblocks in examples that follow. We will see 1-D reshaping and reordering when we want to apply arbitrary partitioning patterns for threads and values in MMAs in examples that follow. 
+
+## Complement
+
+Before getting to "product" and "divide," we need one more operation. We can think of `composition` as a layout `B` that is "selecting" certain coordinates from another layout `A`. But what about the coordinates that aren't "selected"? To implement generic tiling, we want to be able to select arbitrary elements -- the tile -- and to describe the layout of those tiles -- the leftovers, or the "rest."
+
+The `complement` of a layout attempts to find another layout that represents the "rest" -- the elements that aren't touched by the layout. 
+
+You can many examples and checked post-conditions in [the `complement` unit test](../../../test/unit/cute/core/complement.cpp). The post-conditions include
+```cpp
+// @post cosize(make_layout(@a layout_a, @a result))) >= @a cosize_hi
+// @post cosize(@a result) >= round_up(@a cosize_hi, cosize(@a layout_a))
+// @post for all i, 1 <= i < size(@a result), 
+//         @a result(i-1) < @a result(i)
+// @post for all i, 1 <= i < size(@a result),
+//         for all j, 0 <= j < size(@a layout_a),
+//           @a result(i) != @a layout_a(j)
+Layout complement(LayoutA const& layout_a, Integral const& cosize_hi)
+```
+That is, the complement `R` of a layout `A` with respect to an integer `M` satisfies the following properties.
+1. The size (and cosize) of `R` is bounded by `M`.
+2. `R` is *ordered*.  That is, the strides of `R` are positive and increasing.  This means that `R` is unique.
+3. `A` and `R` have *disjoint* codomains. `R` attempts to "complete" the codomain of `A`.
+
+### Complement Examples
+
+`complement` is most effective on static shapes and strides, so consider all integers below to be static. Similar examples for dynamic shapes and strides can be found in the unit test.
+
+* `complement(4:1, 24)` is `6:4`. Note that `(4,6):(1,4)` has cosize `24`. The layout `4:1` is effectively repeated 6 times with `6:4`.
+
+* `complement(6:4, 24)` is `4:1`. Note that `(6,4):(4,1)` has cosize `24`. The "hole" in `6:4` is filled with `4:1`.
+
+* `complement((4,6):(1,4), 24)` is `1:0`. Nothing needs to be appended.
+
+* `complement(4:2, 24)` is `(2,4):(1,8)`. Note that `(4,(2,4)):(2,(1,8))` has cosize `24`. The "hole" in `4:2` is filled with `2:1` first, then everything is repeated 4 times with `4:8`.
+
+* `complement((2,4):(1,6), 24)` is `3:2`. Note that `((2,4),3):((1,6),2)` has cosize `24` and produces unique indices.
+
+* `complement((2,2):(1,6), 24)` is `(3,2):(2,12)`. Note that `((2,4),(2,2)):((1,6),(2,12))` has cosize `24` and produces unique indices.
+
+<p align="center">
+  <img src="../../images/cute/complement1.png" alt="complement1.png" height="75"/>
+</p>
+As a visualization, the above figure depicts the codomain of the last example. The image of the original layout `(2,2):(1,6)` is colored in gray. The complement effectively "repeats" the original layout (displayed in the other colors) such that the codomain size of the result is `24`. The complement `(3,2):(2,12)` can be viewed as the "layout of the repetition."
+
+## Division (Tiling)
+
+Finally, we can define the division of a `Layout` by another `Layout`. Functions that divide a layout into components are useful as a basis for tiling and partitioning layouts.
+
+In this section, we'll define `logical_divide(Layout, Layout)`, which again considers all `Layout`s as 1-D functions from integers to integers, and then use that definition to create multidimensional `Layout` divides.
+
+Informally, `logical_divide(A, B)` splits a layout `A` into two modes -- in the first mode are all elements pointed to by `B` and in the second mode are all elements not pointed to by `B`.
+
+Formally, this can be written as
+
+$A \oslash B := A \circ (B,B^*)$
+
+and implemented as
+```cpp
+template <class LShape, class LStride,
+          class TShape, class TStride>
+auto logical_divide(Layout<LShape,LStride> const& layout,
+                    Layout<TShape,TStride> const& tiler)
+{
+  return composition(layout, make_layout(tiler, complement(tiler, size(layout))));
+}
+```
+Note that this is defined only in terms of concatenation, composition, and complement.
+
+So what is that?
+
+> in the first mode are all elements pointed to by `B`
+
+This is clearly composition, `A o B`.
+
+> in the second mode are all elements not pointed to by `B`
+
+The elements NOT pointed to by `B` sounds like a complement, `B*`, up to the size of `A`. As we've seen above in the `complement` section, this can be described as the "layout of the repetition of `B`." If `B` is the "tiler", then `B*` is the layout of the tiles.
+
+### Logical Divide 1-D Example
+
+Consider tiling the 1-D layout `A = (2,4,3):(4,1,8)` with the tiler `B = 4:2`. Informally, this means that we have a 1-D vector of 24 elements in some storage order defined by `A` and we want to extract tiles of 4 elements strided by 2.
+
+This is computed in the three steps described in the implementation above.
+* Complement of `B = 4:2` under `size(A) = 24` is `B* = (2,3):(1,8)`.
+* Concantenation of `(B,B*) = (4,(2,3)):(2,(1,8))`.
+* Composition of `A = (2,4,3):(4,1,8)` with `(B,B*)` is then `((2,2),(2,3)):((4,1),(2,8))`.
+
+<p align="center">
+  <img src="../../images/cute/divide1.png" alt="divide1.png" height="150"/>
+</p>
+
+The above figure depicts `A` as a 1-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are six of those tiles in `A` shown by each of the colors. After the divide, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
+
+### Logical Divide 2-D Example
+
+Using the `Tiler` concept defined above, this immediately generalizes to multidimensional tiling. The below example simply applies `layout_divide` by-mode to the cols and rows of a 2-D layout using a `Tiler`.
+
+Similar to the 2-D composition example above, consider a 2-D layout `A = (9,(4,8)):(59,(13,1))` and want to apply `3:3` down the columns (mode-0) and `(2,4):(1,8)` across the rows (mode-1). This means the tiler can be written as `B = <3:3, (2,4):(1,8)>`.
+
+<p align="center">
+  <img src="../../images/cute/divide2.png" alt="divide2.png" height="450"/>
+</p>
+
+The above figure depicts `A` as a 2-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are twelve of those tiles in `A` shown by each of the colors. After the divide, the first mode of each mode of the result is the tile of data and the second mode of each mode iterates over each tile. In that sense, this operation can be viewed as a kind of `gather` operation or as simply a permutation on the rows and cols.
+
+Note that the first mode of each mode of the result is the sublayout `(3,(2,4)):(236,(13,52))` and is precisely the result we would have received if we had applied `composition` instead of `logical_divide`.
+
+### Zipped, Tiled, Flat Divides
+
+It's easy to see the tiles when they are highlighted in the images above, but working with them can still be awkward. How would you slice out the `3`rd tile or the `7`th tile or the `(1,2)`th tile so you could continue working on it?
+
+Enter the convenience flavors of `logical_divide`. Suppose we have a `Layout` and a `Tiler` of some shape, then each operation will apply `logical_divide`, but potentially rearrange the modes into more convenient forms.
+```text
+Layout Shape : (M, N, L, ...)
+Tiler Shape  : <TileM, TileN>
+
+logical_divide : ((TileM,RestM), (TileN,RestN), L, ...)
+zipped_divide  : ((TileM,TileN,...), (RestM,RestN,L,...))
+tiled_divide   : ((TileM,TileN,...), RestM, RestN, L, ...)
+flat_divide    : (TileM, TileN, ..., RestM, RestN, L, ...)
+```
+
+For example, the `zipped_divide` function applies `logical_divide`, and then gathers the "subtiles" into a single mode and the "rest" into a single mode.
+```cpp
+// A: shape is (9,32)
+auto layout_a = make_layout(make_shape (Int< 9>{}, make_shape (Int< 4>{}, Int<8>{})),
+                            make_stride(Int<59>{}, make_stride(Int<13>{}, Int<1>{})));
+// B: shape is (3,8)
+auto tiler = make_tile(Layout<_3,_3>{},           // Apply     3:3     to mode-0           
+                       Layout<Shape <_2,_4>,      // Apply (2,4):(1,8) to mode-1
+                              Stride<_1,_8>>{});
+
+// ((TileM,RestM), (TileN,RestN)) with shape ((3,3), (8,4))
+auto ld = logical_divide(layout_a, tiler);                   
+// ((TileM,TileN), (RestM,RestN)) with shape ((3,8), (3,4))
+auto zd = zipped_divide(layout_a, tiler);
+```
+Then, the offset to the `3`rd tile is `zd(0,3)`. The offset to the `7`th tile is `zd(0,7)`. The offset to the `(1,2)`th tile is `zd(0,make_coord(1,2))`. The tile itself always has layout `layout<0>(zd)`. Indeed, it is always the case that 
+
+`layout<0>(zipped_divide(a, b)) == composition(a, b)`.
+
+We note that `logical_divide` preserves the *semantics* of the modes while permuting the elements within those modes -- the `M`-mode of layout `A` is still the `M`-mode of the result and the `N`-mode of layout `A` is still the `N`-mode of the result.
+
+This is not the case with `zipped_divide`. The mode-0 in the `zipped_divide` result is the `Tile` itself (of whatever rank the `Tiler` was) and mode-1 is the layout of those tiles. It doesn't always make sense to plot these as 2-D layouts, because the `M`-mode is now more aptly the "tile-mode" and the `N`-mode is more aptly the "rest-mode". Regardless, we still can plot the resulting layout as 2-D as shown below.
+
+<p align="center">
+  <img src="../../images/cute/divide3.png" alt="divide3.png" height="450"/>
+</p>
+
+We've kept each tile as its color in the previous images for clarity. Clearly, iterating across tiles is now equivalent to iterating across a row of this layout and iterating over elements within a tile is equivalent to iterating down a column of this layout. As we'll see in the `Tensor` section, this can be used to great effect in partitioning within or across tiles of data.
+
+## Product (Tiling)
+
+Finally, we can define the product of a Layout by another Layout. In this section, we'll define `logical_product(Layout, Layout)`, which again considers all `Layout`s as 1-D functions from integers to integers, and then use that definition to create multidimensional `Layout` products.
+
+Informally, `logical_product(A, B)` results in a two mode layout where the first mode is the layout `A` and the second mode is the layout `B` but with each element replaced by a "unique replication" of layout `A`.
+
+Formally, this can be written as
+
+$A \otimes B := (A, A^* \circ B)$
+
+and implemented in CuTe as
+```cpp
+template <class LShape, class LStride,
+          class TShape, class TStride>
+auto logical_product(Layout<LShape,LStride> const& layout,
+                     Layout<TShape,TStride> const& tiler)
+{
+  return make_layout(layout, composition(complement(layout, size(layout)*cosize(tiler)), tiler));
+}
+```
+Note that this is defined only in terms of concatenation, composition, and complement.
+
+So what is that?
+
+> where the first mode is the layout `A`
+
+This is clearly just a copy of `A`.
+
+> the second mode is the layout `B` but with each element replaced by a "unique replication" of layout `A`.
+
+The "unique replication" of layout `A` sounds like complement, `A*`, up to the cosize of `B`. As we've seen in the `complement` section, this can be described as the "layout of the repetition of `A`". If `A` is the "tile", then `A*` is the layout of repetitions that are available for `B`.
+
+### Logical Product 1-D Example
+
+Consider reproducing the 1-D layout `A = (2,2):(4,1)` according to `B = 6:1`. Informally, this means that we have a 1-D layout of 4 elements defined by `A` and we want to reproduce it 6 times.
+
+This is computed in the three steps described in the implementation above.
+* Complement of `A = (2,2):(4,1)` under `6*4 = 24` is `A* = (2,3):(2,8)`.
+* Composition of `A* = (2,3):(2,8)` with `B = 6:1` is then `(2,3):(2,8)`.
+* Concatenation of `(A,A* o B) = ((2,2),(2,3)):((4,1),(2,8))`.
+
+<p align="center">
+  <img src="../../images/cute/product1.png" alt="product1.png" height="175"/>
+</p>
+
+The above figure depicts `A` and `B` as a 1-D layouts. The layout `B` describes the number and order of repetitions of `A` and they are colored for clarity. After the product, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
+
+Note that the result is identical to the result of the 1-D Logical Divide example.
+
+Of course, we can change the number and order of the tiles in the product by changing `B`.
+
+<p align="center">
+  <img src="../../images/cute/product2.png" alt="product2.png" height="175"/>
+</p>
+
+For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated tiles instead of 6 and the tiles are in a different order.
+
+### Logical Product 2-D Example
+
+We can use the by-mode `tiler` strategies previously developed to write multidimensional products as well.
+
+<p align="center">
+  <img src="../../images/cute/product2d.png" alt="product2d.png" height="250"/>
+</p>
+
+The above image demonstates the use of a `tiler` to apply `logical_product` by-mode. Despite this **not being the recommended approach**, the result is a rank-2 layout consisting of 2x5 row-major block that is tiled across a 3x4 col-major arrangement.
+
+The reason **this is not the recommended approach** is that the `tiler B` in the above expression is highly unintuitive. In fact, it requires perfect knowledge of the shape and strides of `A` in order to construct. We would like to express "Tile Layout `A` according to Layout `B`" in a way that makes `A` and `B` independent and is much more intuitive.
+
+#### Blocked and Raked Products
+
+The `blocked_product(LayoutA, LayoutB)` and `raked_product(LayoutA, LayoutB)` are interesting, more intuitive, rank-sensitive transformations on top of 1-D `logical_product` that let us express the intuitive Layout products that we most often want to express.
+
+A key observation in the implementation of these functions are the compatibility post-conditions of `logical_product`:
+```
+// @post rank(result) == 2
+// @post compatible(layout_a, layout<0>(result))
+// @post compatible(layout_b, layout<1>(result))
+```
+
+Because `A` is always compatible with mode-0 of the result and `B` is always compatible with mode-1 of the result, if we made `A` and `B` the same rank then we could "reassociate" like-modes after the product. That is, the "col" mode in `A` could be combined with the "col" mode in `B` and the "row" mode in `A` could be combined with the "row" mode in `B`, etc.
+
+This is exactly what `blocked_product` and `raked_product` do and it is why they are called rank-sensitive. Unlike other CuTe functions that take `Layout` arguments, these care about the top-level rank of the arguments so that each mode can be reassociated after the `logical_product`.
+
+<p align="center">
+  <img src="../../images/cute/productblocked2d.png" alt="productblocked2d.png" height="250"/>
+</p>
+
+The above image shows the same result as the `tiler` approach, but with much more intuitive arguments. A 2x5 row-major layout is arranged as a tile in a 3x4 col-major arrangement. Also note that `blocked_product` went ahead and `coalesced` mode-0 for us.
+
+Similarly, `raked_product` combines the modes slightly differently. Instead of the resulting "col" mode being constructed from the `A` "col" mode then the `B` "col" mode, the resulting "col" mode is constructed from the `B` "col" mode then the `A` "col" mode.
+
+<p align="center">
+  <img src="../../images/cute/productraked2d.png" alt="productraked2d.png" height="250"/>
+</p>
+
+This results in the "tile" `A` now being interleaved or "raked" with the "layout-of-tiles" `B` instead of appearing as blocks. Other references call this a "cyclic distribution."
+
+### Zipped and Tiled Products
+
+Similar to `zipped_divide` and `tiled_divide`, the `zipped_product` and `tiled_product` simply rearrange the modes that result from a by-mode `logical_product`.
+
+```text
+Layout Shape : (M, N, L, ...)
+Tiler Shape  : <TileM, TileN>
+
+logical_product : ((M,TileM), (N,TileN), L, ...)
+zipped_product  : ((M,N), (TileM,TileN,L,...))
+tiled_product   : ((M,N), TileM, TileN, L, ...)
+flat_product    : (M, N, TileM, TileN, L, ...)
+```
--- a/media/docs/cute/02_layout_operations.md
+++ b/media/docs/cute/02_layout_operations.md
@ -1,833 +0,0 @@
-# CuTe Layout Operations
-
-CuTe provides an "algebra of `Layout`s."
-`Layout`s can be combined and manipulated
-to construct more complicated `Layout`s.
-This includes tiling and partitioning `Layout`s across other `Layout`s.
-In this section, we explain some of these core operations in detail.
-
-## How do I print CuTe objects on host or device?
-
-CuTe comes with different ways to print CuTe objects.
-You can print human-readable text,
-or you can print LaTeX commands for generating
-a beautifully formatted and colored table
-describing the CuTe object.
-Both of these can be helpful for reasoning about or debugging
-layouts, copy atoms, or matrix multiply atoms
-(don't worry, we'll explain all of these things in this tutorial).
-
-CuTe's print functions work on either host or device.
-Note that on device, printing is expensive.
-Even just leaving print code in place on device,
-even if it is never called
-(e.g., printing in an `if` branch that is not taken at run time),
-may generate slower code.
-Thus, be sure to remove code that prints on device after debugging.
-
-The following code examples assume that you have a
-`using namespace cute;` statement in scope.
-
-### Printing human-readable text
-
-The `cute::print` function has overloads for almost all CuTe types, including Pointers, Layout, Shape, Stride, and Tensors.  When in doubt, try calling `print` on it.  You might also only want to print on thread 0 of each thread block, or block 0 of the grid.  The `thread0()` function returns true only for global thread 0 of the kernel.  A typical idiom for printing CuTe objects to print only on thread 0 of block 0.
-
-```c++
-if (thread0()) {
-  print(some_cute_object);
-}
-```
-
-Some algorithms do different things on different threads or blocks,
-so you might sometimes need to print on threads or blocks other than zero.
-The header file
-[`cute/util/debug.hpp`](../../../include/cute/util/debug.hpp),
-among other utilities,
-includes the function `bool thread(int tid, int bid)`
-that returns `true` if running on thread `tid` and block `bid`.
-
-Some CuTe types have special printing functions that use a different output format.
-For example, `print_layout` can display a rank-2 layout in a table
-(using plain text formatting).
-It has an overload taking a rank-2 matrix layout and a thread layout,
-that displays a table with the mapping between threads and values.
-
-### Printing LaTeX output
-
-The `cute::print_latex` function works like `cute::print`,
-but prints LaTeX commands that you can use
-to generate a nicely formatted and colored table.
-
-## Fundamental types
-
-### Layout and its components
-
-This directory includes
-[an overview of CuTe's fundamental types for describing layouts](./01_layout.md).
-
-#### Tuple
-
-CuTe starts with a Tuple, which is a finite ordered list of zero or more elements.
-In C++, we identify a Tuple with the
-[`cute::tuple` class](../../../include/cute/container/tuple.hpp).
-`cute::tuple` behaves like `std::tuple`, but it works on device or host,
-and it imposes restrictions on its template arguments for performance and simplicity.
-
-
-#### IntTuple
-
-CuTe then defines an IntTuple as either an integer, or a Tuple of IntTuple.
-This recursive definition lets us build arbitrarily nested layouts.
-In C++, we identify an IntTuple with [`IntTuple`](../../../include/cute/int_tuple.hpp),
-which is just an alias of `cute::tuple`.
-Any of the following are thus valid template arguments of IntTuple.
-
-1. "Run-time integers" (or "static integers")
-    are just ordinary integral types like `int` or `size_t`.
-
-2. "Compile-time integers" include `std::integral_constant`
-    or subclasses of it that CuTe defines,
-    such as `Int<Value>` (see below).
-    These types all have in common
-    that the value is encoded in the type itself
-    (as a public `static constexpr value` member).
-    CuTe defines aliases `_1`, `_2`, `_3` etc.
-    to the types `Int<1>`, `Int<2>`, `Int<3>` etc.
-
-3. `IntTuple` with any valid template arguments.
-
-CuTe reuses IntTuple for many different things,
-including Shape, Stride, Step, and Coord
-(see [`include/cute/layout.hpp`](../../../include/cute/layout.hpp)).
-In C++, Shape, Stride, Step, and Coord are all aliases for IntTuple.
-
-### Layout
-
-A Layout is a tuple of (Shape, Stride).
-Semantically, it implements a mapping from
-a "logical" Shape-shaped (multidimensional) index,
-to a "physical" 1-D index into an array.
-Here is an example of a 2 x 3 array with static strides (3, 1).
-
-```c++
-Layout layout = make_layout(make_shape (_2{}, _3{}),
-                            make_stride(_3{}, _1{}));
-print_layout(layout);
-for (int i = 0; i < size(layout); ++i) {
-  print(layout(i));
-  print(", ");
-}
-print("\n");
-print(layout(1, 1));
-print("\n");
-```
-
-This code produces the following text output.
-
-```text
-(_2,_3):(_3,_1)
-      0   1   2
-    +---+---+---+
- 0  | 0 | 1 | 2 |
-    +---+---+---+
- 1  | 3 | 4 | 5 |
-    +---+---+---+
-0, 3, 1, 4, 2, 5,
-4
-```
-
-`print(layout(1, 1))` prints the mapping of
-the logical 2-D coordinate (1,1) to the 1-D index, which is 4.
-You can see that from the table,
-which shows the left logical index as the "row,"
-and the right logical index as the "column."
-
-### Underscore (`_`)
-
-An Underscore is a special type used for array slices.  The underscore punctuation `_` is a constant instance of Underscore.  It acts like `:` (the colon punctuation) in Python or Fortran array slices.  See [`include/cute/underscore.hpp`](../../../include/cute/underscore.hpp).
-
-### Tile
-
-"A Tile is not a Layout, it's a tuple of Layouts or Tiles or Underscores."
-See [`include/cute/tile.hpp`](../../../include/cute/tile.hpp).
-
-The algebraic layout operations discussed below are defined on `Layout`s, but `Tile` allows these operations to recurse and to be applied to sublayouts or particular modes of a given Layout. These are referred to as by-mode operations.
-
-See the section on "Logical Divide" to see an example of using `Tile` to extract portions of a row-mode and portions of a column-mode independently.
-
-## Layout definitions and operations
-
-### Layouts are functions from integers (logical 1-D coordinate) to integers (1-D index)
-
-The `for` loop in the above print example shows how CuTe identifies 1-D coordinates with a column-major layout of logical 2-D coordinates.  Iterating from `i = 0` to `size(layout)` (which is 6), and indexing into our layout with the single integer coordinate `i`, traverses the layout in column-major fashion, even though this is a row-major layout.  You can see this from the output of the `for` loop (0, 3, 1, 4, 2, 5).  CuTe calls this index `i` a "1-D coordinate," versus the "natural coordinate," which would be the logical 2-D coordinate.
-
-If you're familiar with the C++23 feature `mdspan`,
-this is an important difference between
-`mdspan` layout mappings and CuTe `Layout`s.
-`mdspan` layout mappings are *one way*:
-they always take a multidimensional logical coordinate,
-and they return an integer offset.
-Depending on the strides,
-the offset may skip over elements of the physical 1-D array.
-Thus, `mdspan`'s offset does NOT mean the same thing as
-the 1-D logical coordinate `i` in the `for` loop above.
-You can iterate correctly over any CuTe `Layout`
-by using the 1-D logical coordinate.
-`mdspan` doesn't have an idea of a 1-D logical coordinate.
-
-### Rank, depth, size, cosize
-
-*Rank*: the tuple size of the layout's shape.
-
-*Depth*: the depth of the layout's shape.  A single integer has depth 0.  A tuple has depth 1 + the max depth of its components.
-
-*Size*: Size of the shape; size of the domain of the function. This is the product of all extents in the layout's shape.
-
-*Cosize*: Size of the function's codomain (not necessarily the range); for a layout A, A(size(A) - 1) + 1.  (Here, we use size(A) - 1 as a 1-D logical coordinate input.)
-
-### Layout compatibility
-
-We say that layouts A and B are *compatible* if their shapes are compatible.  Shape A is compatible with shape B if any natural coordinate of A is also a valid coordinate for B.
-
-### Flatten
-
-The `flatten` operation "un-nests" a potentially nested Layout.  For example,
-
-```c++
-Layout layout = Layout<Shape <Shape <_4, _3>, _1>,
-                     Stride<Stride<_3, _1>, _0>>{};
-Layout flat_layout = flatten(layout);
-```
-
-results in `flat_layout` having the following type
-
-```text
-Layout<Shape<_4, _3, _1>, Stride<_3, _1, _0>>
-```
-
-and
-
-```c++
-Layout layout = Layout<Shape <_4, Shape <_4,  _2>>,
-                     Stride<_4, Stride<_1, _16>>>{};
-Layout flat_layout = flatten(layout);
-```
-
-results in `flat_layout` having the following type
-
-```text
-Layout<Shape<_4, _4, _2>, Stride<_4, _1, _16>>
-```
-
-Hierarchical Layouts and flattening let us reinterpret tensors in place as matrices, matrices as vectors, vectors as matrices, etc.  This lets us implement arbitrary tensor contractions as batched matrix multiply, by combining the contraction modes into a single mode, and combining the A, B, C, and "batch" modes as needed to reach the desired form.
-
-### Coalesce
-
-The `coalesce` operation first flattens the layout, then combines all the modes that are possible to combine, starting with mode 0 (the leftmost mode) and moving right.  If all the modes can be combined, then this results in a 1-D layout expressing what array elements the original layout accesses.
-
-For example,
-
-```text
-layout: (_2,(_1,_6)):(_1,(_6,_2))
-coalesce(layout): _12:_1
-```
-
-What does it mean to "combine" modes?  In the above example, the flattened layout is (2, 1, 6) : (1, 6, 2).
-
-1. If we look at the leftmost two modes, this is just a vector of length 2 and stride 1.  The middle mode has extent 1, so the corresponding stride 6 would not be observed anyway.  This leaves us with (2, 6) : (1, 2).
-
-2. The intermediate result (2, 6) : (1, 2) is just a 2 x 6 column-major matrix, which can be coalesced into a vector of length 12 and stride 1.
-
-More formally, "combining all the modes" means a left fold, where the binary operation that combines two modes has three cases.
-
-1. If the leftmost layout is s1:d1, and the next layout is 1:d0, then combine into s1:d1.  This generalizes Step 1 above.  If a mode has extent 1, we can't observe its stride, so we can skip the mode.
-
-2. If the leftmost layout is 1:d1, and the next layout is s0:d0, then combine into s0:d0.  Again, if a mode has extent 1, we can't observe its stride, so we can skip the mode.
-
-3. If the leftmost layout is s1:d1, and the next layout is s0 : s1*d1, then combine into s0 * s1 : d1.  This generalizes Step 2 above.  One can call this "noticing a column-major layout sequence."
-
-That's it!  For example, the result of coalescing the row-major layout (2, 2) : (2, 1) is (2, 2) : (2, 1), the same layout, because none of the above three cases applies.
-
-### Complement
-
-#### Definition
-
-The complement B of a layout A with respect to an integer M satisfies the following properties.
-
-1. $A$ and $B$ are *disjoint*: $A(x) \neq B(x)$ for all $x \neq 0$ in the domain of $A$.
-
-2. B is *ordered*: $B(x-1) \lt B(x)$ for all $x$ in $\{0, 1, \dots, size(B) - 1\}$.
-
-3. B is *bounded* by M: $size(B) \geq M / size(A)$, and $cosize(B) \leq floor(M / cosize(A)) * cosize(A)$.
-
-Regarding disjointness: we need to specify $x \neq 0$ because CuTe layouts are linear.  That is, if the domain is nonempty, the range always contains zero.
-
-Regarding the ordered property: CuTe layouts are hierarchically strided, so this implies that if size(B) is nonzero, then the strides of B are all positive.
-
-#### Examples
-
-complement(4:1, 24) is 6:4.
-
-1. The result is disjoint of 4:1, so it must have a stride of at least 4 (since it includes 0, but must skip over 1, 2, 3).
-
-2. The size of the result is $\geq 24 / 4 = 6$.  (This plus Step (1) means that the cosize is at least 24.)
-
-3. The cosize of the result is $\leq (24 / 4) * 4 = 24$.  (This plus Step (2) means that the cosize is exactly 24.)
-
-4. The only (1-D) layout with size 6 and cosize 24 is 6:4.
-
-complement(6:4, 24) is 4:1.
-
-1. 4:1 is disjoint of 6:4, but so is s:d
-   for any s > 0 and d > 20.
-
-2. The size of the result is $\geq 24 / 6 = 4$.
-
-3. The cosize of the result is $\leq (24 / 21) * 21 = 21$.
-
-4. The stride cannot be greater than 20
-   (else (2) would contradict (3)),
-   so it must be less than 4.
-
-5. This leaves 4:1 by elimination.
-
-### Composition
-
-Layouts are functions, so composition of layouts is just composition of functions.  The composition $A \circ B$ means "apply the layout B first, then treat the result as a 1-D logical coordinate input to the layout A, and apply A to it."  Very often, this composition can be represented as another Layout.
-
-#### Rules for computing composition
-
-Both humans and CuTe compute composition using the following rules.
-
-1. $A \circ B$ has a shape that is compatible with B. In function composition, the rightmost function defines the domain. For `Layout`s this means that any valid coordinate for $B$ can also be used as a coordinate for $A \circ B$.
-
-2. Concatenation: A layout can be expressed as the concatenation of its sublayouts.  We denote concatenation with parentheses: $B = (B_0,B_1,...)$.  The CuTe function `make_layout`, when given zero or more `Layout`s, concatenates them.
-
-3. Composition is (left-)distributive with concatenation: $A \circ B = A \circ (B_0, B_1, ...) = (A \circ B_0, A \circ B_1, ...)$.
-
-4. "Base case": For layouts $A = a : b$ and $B = c : d$ with integral shape and stride, $A \circ B = R = c : (b * d)$.
-
-5. By-mode composition: Let $\langle B, C \rangle$ (angle brackets, not parentheses)
-   denote a tuple of two layouts B and C, not their concatenation.  Let $A = (A_0, A_1)$.
-   Then, $A \circ \langle B, C \rangle = (A_0, A_1) \circ \langle B, C \rangle = (A_0 \circ B, A_1 \circ C)$.
-   This allows the application of composition independently to sublayouts of $A$.
-
-#### Examples: Reshape a vector into a matrix
-
-This section gives two composition examples.  Both start with a vector with layout $20:2$ (that is, the vector has 20 elements, and the stride between each is 2).  They compose this vector with a 4 x 5 matrix layout.  This effectively "reshapes" the vector in place into a matrix.
-
-##### Example 1
-
-$20:2 \circ (4,5) : (1,4)$.
-
-This describes interpreting the vector $20:2$
-as a 4 x 5 column-major matrix.
-
-The resulting layout has shape $(4,5)$,
-because in function composition,
-the rightmost function defines the domain.
-What are the strides?
-
-1. A layout can be expressed as the concatenation of its sublayouts,
-   so $(4,5) : (1,4)$ is $(4:1, 5:4)$.
-
-2. Composition is distributive, so
-   $20:2 \circ (4:1, 5:4)$ is $(20:2 \circ 4:1, 20:2 \circ 5:4)$.
-
-3. $20:2 \circ 4:1$ has shape 4 (rightmost function defines the domain)
-   and stride $2 = 2 \cdot 1$.
-
-4. $20:2 \circ 5:4$ has shape 5 and stride $8 = 2 \cdot 4$.
-
-5. Result: (4:2, 5:8), which by concatenation is (4,5) : (2,8).
-
-#### Example 2
-
-$20:2 \circ (4,5) : (5,1)$.
-
-This describes interpreting the vector 20:2
-as a 4 x 5 row-major matrix.
-
-The resulting layout has shape $(4,5)$, just as before.  What are the strides?
-
-1. By deconcatenation, $(4,5) : (5,1)$ is $(4:5, 5:1)$.
-
-2. Composition is distributive, so $20:2 \circ (4:5, 5:1)$ is $(20:2 \circ 4:5, 20:2 \circ 5:1)$.
-
-3. $20:2 \circ 4:5$ has shape $4$ and stride $10 = 2 \cdot 5$.
-
-4. $20:2 \circ 5:1$ has shape $5$ and stride $2 = 2 \cdot 1$.
-
-5. Result: (4:10, 5:2), which by concatenation is (4,5) : (10,2).
-
-#### Example: Reshape a matrix into another matrix
-
-The composition $((20,2):(16,4) \circ (4,5):(1,4))$
-expresses reshaping the matrix with layout (20,2):(16:4),
-into a 4 x 5 matrix in a column-major way.
-
-1. By deconcatenation, $(4,5) : (1,4)$ is $(4:1, 5:4)$.
-
-2. Composition is distributive, so $(20,2):(16,4) \circ (4:1, 5:4)$ is $((20,2):(16,4) \circ 4:1, (20,2):(16,4) \circ 5:4)$.
-
-3. $(20,2):(16,4) \circ 4:1$ has shape $4$ and stride $16$.  (4:1 expresses picking the first 4 consecutive elements of (20,2):(16,4).  These elements run down the 0th column (leftmost mode) of the layout, whose stride is 16.)
-
-4. $(20,2):(16,4) \circ 5:4$ has shape $5$ and stride $64 = 4 \cdot 16$.
-
-5. Result: $(4:16, 5:64)$, which by concatenation is $(4,5) : (16,64)$.
-
-We get exactly this result with CuTe
-if we use compile-time shapes and strides.
-The following C++ code prints `(_4,_5):(_16,_64).`
-
-```c++
-using namespace cute;
-auto a = make_layout(make_shape(Int<20>{}, _2{}), make_stride(_16{}, _4{}));
-auto b = make_layout(make_shape(     _4{}, _5{}), make_stride( _1{}, _4{}));
-auto c = composition(a, b);
-printf("\n");
-print(c);
-```
-
-Results may _look_ different (but are the same mathematically)
-if we use run-time integers.
-The following C++ code prints `((4,1),(5,1)):((16,4),(64,4)).`
-
-```c++
-using namespace cute;
-auto a = make_layout(make_shape(20, 2), make_stride(16, 4));
-auto b = make_layout(make_shape( 4, 5), make_stride( 1, 4));
-auto c = composition(a, b);
-printf("\n");
-print(c);
-```
-
-((4,1),(5,1)):((16,4),(64,4)) is effectively the same layout
-as (4,5) : (16,64), because the 1s in the shape don't affect the layout
-(as a mathematical function from one integer to one integer).
-CuTe chooses not to simplify layout computations
-with run-time values in them as much as it could,
-because simplifications involving run-time values have a run-time cost.
-
-### Product
-
-CuTe includes four different kinds of layout products.
-
-1. `logical_product`
-
-2. `blocked_product`
-
-3. `raked_product`
-
-4. `tiled_product`
-
-`logical_product(A, B)` results in a layout where each element of layout B
-has been replaced by a "copy" of layout A.
-The other three products offer variations of this idea.
-
-#### Example: Tiled matrix
-
-Suppose that I want to make a matrix consisting of 3 x 4 tiles
-in a row-major arrangement,
-where each tile is a 2 x 2 column-major matrix.
-
-The Layout of each tile (tile) has Shape (2,2) and Stride (1,2).
-
-The Layout of the "matrix of tiles" (`matrix_of_tiles`)
-has Shape (3,4) and Stride (4,1).
-
-##### Blocked product: the intuitive tiling
-
-If I were to deduce by hand what the layout of the tiled matrix should be,
-it would look like this.
-
-|       | (0,0) | (1,0) | (0,1) | (1,1) | (0,2) | (1,2) | (0,3) | (1,3) |
-| ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   |
-| (0,0) |  0    |  2    |  4    |  6    |  8    | 10    | 12    | 14    |
-| (1,0) |  1    |  3    |  5    |  7    |  9    | 11    | 13    | 15    |
-| (0,1) | 16    | 18    | 20    | 22    | 24    | 26    | 28    | 30    |
-| (1,1) | 17    | 19    | 21    | 23    | 25    | 27    | 29    | 31    |
-| (0,2) | 32    | 34    | 36    | 38    | 40    | 42    | 44    | 46    |
-| (1,2) | 33    | 35    | 37    | 39    | 41    | 43    | 45    | 47    |
-
-The row and column labels use the equivalence of 1-D logical coordinates and 2-D column-major coordinates.  The left index in each pair is the row resp. column coordinate of the tile, while the right index in each pair is the row resp. column coordinate of the matrix-of-tiles.  The resulting layout has Shape ((2, 3), (2, 4)), and Stride ((1, 16), (2, 4)), and the second mode can be coalesced.  The Shape ((2, 3), (2, 4)) is hierarchical, but it is still rank-2 and can be drawn in 2D as above.  Note how the row mode of the tile remains part of the row mode of the product, and the column mode of the tile remains a column mode of the product.
-
-The above layout is what `blocked_product(tile, matrix_of_tiles)` produces.
-A critical use case for blocked product is "tiling" an "atom"
-(some tile that relates to a hardware feature) over a matrix.
-
-```c++
-Layout tile            = Layout<Shape <_2,_2>,
-                                Stride<_1,_2>>{};
-Layout matrix_of_tiles = Layout<Shape <_3,_4>,
-                                Stride<_4,_1>>{};
-
-print_layout(blocked_product(tile, matrix_of_tiles));
-```
-
-##### Logical product
-
-The logical product `logical_product(tile, matrix_of_tiles)`
-results in Shape ((2, 2), (3, 4)) and Stride ((1, 2), (16, 4)).
-
-|       | (0,0) | (1,0) | (2,0) | (0,1) | (1,1) | (2,1) | (0,2) | (1,2) | (2,2) | (0,3) | (1,3) | (2,3) |
-| ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   |
-| (0,0) |  0    | 16    | 32    |  4    | 20    | 36    |  8    | 24    | 40    | 12    | 28    | 44    |
-| (1,0) |  1    | 17    | 33    |  5    | 21    | 37    |  9    | 25    | 41    | 13    | 29    | 45    |
-| (0,1) |  2    | 18    | 34    |  6    | 22    | 38    | 10    | 26    | 42    | 14    | 30    | 46    |
-| (1,1) |  3    | 19    | 35    |  7    | 23    | 39    | 11    | 27    | 43    | 15    | 31    | 47    |
-
-Note how the tile appears in the leftmost column and is reproduced
-in each column in the same order as the matrix-of-tiles. That is,
-the tile can be indexed through the first mode of the result and the
-matrix-of-tiles can be indexed through the second mode.
-
-```c++
-Layout tile            = Layout<Shape <_2,_2>,
-                                Stride<_1,_2>>{};
-Layout matrix_of_tiles = Layout<Shape <_3,_4>,
-                                Stride<_4,_1>>{};
-
-print_layout(logical_product(tile, matrix_of_tiles));
-```
-
-##### Raked product
-
-The raked product `raked_product(tile, matrix_of_tiles)` results in
-Shape ((3, 2), (4, 2)) and Stride ((16, 1), (4, 2)).
-
-|       | (0,0) | (1,0) | (2,0) | (3,0) | (0,1) | (1,1) | (2,1) | (3,1) |
-| ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   |
-| (0,0) |  0    |  4    |  8    | 12    |  2    |  6    | 10    | 14    |
-| (1,0) | 16    | 20    | 24    | 28    | 18    | 22    | 26    | 30    |
-| (2,0) | 32    | 36    | 40    | 44    | 34    | 38    | 42    | 46    |
-| (0,1) |  1    |  5    |  9    | 13    |  3    |  7    | 11    | 15    |
-| (1,1) | 17    | 21    | 25    | 29    | 19    | 23    | 27    | 31    |
-| (2,1) | 33    | 37    | 41    | 45    | 35    | 39    | 43    | 47    |
-
-The tile is now interleaved or "raked" with the other 3x4 matrix-of-tiles
-instead of appearing as blocks. Other references call this a "cyclic
-distribution."
-
-This might look familiar if you have ever used ScaLAPACK.
-It expresses a 2-D block cyclic distribution of a 6 x 8 matrix
-over 4 processes in a 2 x 2 "process grid."  See
-["The Two-dimensional Block-Cyclic Distribution"](https://netlib.org/scalapack/slug/node75.html#sec2dbcd)
-and
-["Local Storage Scheme and Block-Cyclic Mapping"](https://netlib.org/scalapack/slug/node76.html#seclocalstorage)
-in the ScaLAPACK Users' Guide.
-
-In general, `logical_product` and these variations can produce any interleaving,
-including blocked, cyclic, by-mode blocked/cyclic, and intermediate interleavings
-that don't have common names.
-
-```c++
-Layout tile            = Layout<Shape <_2,_2>,
-                                Stride<_1,_2>>{};
-Layout matrix_of_tiles = Layout<Shape <_3,_4>,
-                                Stride<_4,_1>>{};
-
-print_layout(raked_product(tile, matrix_of_tiles));
-```
-
-### Division
-
-The previous section covered layout products,
-that reproduce one layout over another.
-This section covers layout *division*.
-Functions that divide a layout into components are useful
-as a basis for tiling and partitioning layouts.
-
-For example, consider folding a vector into a matrix.
-We could imagine an operation, called `logical_divide`,
-
-```c++
-Layout vec = Layout<_16,_3>{};           //  16 : 3
-Layout col = Layout< _4,_1>{};           //   4 : 1
-Layout mat = logical_divide(vec, col);   // (4,4) : (3,12)
-```
-
-that "takes" the first 4 elements of the vector into the first mode
-and leaves the "rest" in the second mode. This is a column-major matrix
-view of the data in `vec`.
-What if we want a row-major matrix view?
-
-```c++
-Layout vec = Layout<_16,_3>{};           //  16 : 3
-Layout col = Layout< _4,_4>{};           //   4 : 4
-Layout mat = logical_divide(vec, col);   // (4,4) : (12,3)
-```
-
-Now, every fourth element of the vector is in the first mode and
-the "rest" are in the second mode.
-Multidimensional, hierarchical indices let us extend this operation
-to any layout that "divides" the vector.
-
-```c++
-Layout vec = Layout<_16,_3>{};           //  16 : 3
-Layout col = Layout< _4,_2>{};           //   4 : 2
-Layout mat = logical_divide(vec, col);   // (4,(2,2)) : (6,(3,24))
-```
-
-```c++
-Layout vec = Layout<_16,_3>{};           //  16 : 3
-Layout col = Layout<Shape <_2,_2>,
-                    Stride<_4,_1>>{};    // (2,2) : (4,1)
-Layout mat = logical_divide(vec, col);   // ((2,2),(2,2)) : ((12,3),(6,24))
-```
-
-All of the above examples produce a 4x4 matrix
-that can be indexed and treated like a normal 4x4 matrix,
-but each has a different underlying layout.
-Thus, our algorithms can be written using logical coordinates,
-without needing to address the detailed indexing that each layout requires.
-
-CuTe includes 3 different kinds of layout division operations.
-
-1. `logical_divide`
-
-2. `zipped_divide`
-
-3. `tiled_divide`
-
-We will summarize these in the sections that follow.
-
-#### Logical divide
-
-##### Example worked in detail
-
-This section will work the following logical divide example in detail.
-
-```c++
-Layout a = make_layout(24, 2);
-Layout b = make_layout( 4, 2);
-Layout c = logical_divide(a, b);
-```
-
-Logical divide produces a rank-2 `Layout`,
-where mode 0 (the leftmost mode) corresponds to the divisor `b`,
-and mode 1 (the rightmost mode) corresponds to the "remainder."
-Intuitively, the remainder of 24 divided by 4 is 6,
-so we know that mode 1 has 6 elements.
-We just don't know its shape yet.
-
-CuTe defines `logical_divide(a, b)` as
-`composition(a, make_layout(b, complement(b, size(a))))`.
-Here, `size(a)` is 24.
-What is `complement(b, 24)`?
-Intuitively, it means "the remainder,"
-what's left over after applying `b` to 0, 1, 2, $\dots$, 23.
-
-The layout 4:2 means "take 4 elements at even-numbered indices."
-The following table overlays the range of 4:2
-atop the complement's codomain 0, 1, $\dots$, 23.
-
-| Range of 4:2  | 0     |       | 2     |       | 4     |       | 6     |     |     |     |         |     |
-| ---           | ---   | ---   | ---   | ---   | ---   | ---   | ---   | --- | --- | --- | ---     | --- |
-| Codomain      | 0     | 1     | 2     | 3     | 4     | 5     | 6     | 7   | 8   | 9   | $\dots$ | 23  |
-
-Layouts are linear, so their range must include zero.
-The complement of 4:2 with respect to 24 is thus a layout whose range
-
-* includes zero;
-
-* does not include any other elements of the range of 4:2
-    (i.e., satisfies the disjoint property; see above); and
-
-* includes as much of 0, 1, $\dots$, 23 as possible
-    (so that it forms the "remainder" of 4:2 with respect to 24).
-
-Intuitively, the range of the complement must look like this:
-0, 1, 8, 9, 16, 17.
-The resulting layout is ordered.
-It has size 6 and cosize 18,
-so it satisfies the bounded property (see above).
-This is the layout (2, 3) : (1, 8).
-(Going from this intuitive sense of the complement
-to knowing how to compute it directly
-is out of scope for this part of the tutorial.)
-
-The following table shows 4:2 with its complement (2, 3) : (1, 8).
-
-| Range of 4:2        | 0     |       | 2     |       | 4     |       | 6     |     |     |     |     |     |     |     |     |     |     |     |         |     |
-| ---                 | ---   | ---   | ---   | ---   | ---   | ---   | ---   | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---     | --- |
-| Codomain            | 0     | 1     | 2     | 3     | 4     | 5     | 6     | 7   | 8   | 9   | 10  | 11  | 12  | 13  | 14  | 15  | 16  | 17  | $\dots$ | 23  |
-| ---                 | ---   | ---   | ---   | ---   | ---   | ---   | ---   | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---     | --- |
-| Range of complement | 0     | 1     |       |       |       |       |       |     | 8   | 9   |     |     |     |     |     |     | 16  | 17  |         |     |
-
-Now we know that `logical_divide`(24:2, 4:2) is
-`composition`(24:2, `make_layout`(4:2, (2,3):(1,8))).
-The composition of two layouts has the shape of the second (rightmost) layout,
-so the resulting shape is (4, (2, 3)).
-We see that the leftmost mode 4 corresponds to the divisor 4:2,
-and the rightmost mode (2, 3) describes what's "left over"
-from the original shape 24.
-
-What are the strides?
-We can start from the leftmost mode.
-4:2 takes every other element (the even-numbered elements) of 24:2.
-That's a stride-2 thing, striding over a stride-2 thing.
-The resulting stride is 4.
-Similarly, the stride 2 of 24:2
-doubles the two strides of the rightmost mode.
-The resulting layout is (4, (2, 3)) : (4, (2, 16)).
-
-##### Tiling example
-
-Suppose I have the 6 x 8 matrix from the Raked Product section
-and want to "collect" the `tile`, turning the Raked Product into
-the Blocked Product.
-
-To do this, we would like to gather two elements from the column
-and leave the rest, then gather two elements from the row and leave the rest.
-Thus, we want to apply `logical_divide` independently to the rows and cols
-in order to retrieve the appropriate elements.
-
-In code, we copy the Layout from the result of the Raked Product section, then
-specify the elements in the rows and cols we would like to gather.
-
-```c++
-Layout raked_prod = Layout<Shape <Shape < _3,_2>,Shape <_4,_2>>,
-                           Stride<Stride<_16,_1>,Stride<_4,_2>>>{};
-Tile   subtile    = make_tile(Layout<_2,_3>{},    // Gather elements 2 : 3 from mode 0
-                              Layout<_2,_4>{});   // Gather elements 2 : 4 from mode 1
-
-print_layout(logical_divide(raked_prod, subtile));
-```
-
-Indeed, this does produce the result from the Blocked Product section.
-
-|       | (0,0) | (1,0) | (0,1) | (1,1) | (0,2) | (1,2) | (0,3) | (1,3) |
-| ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   |
-| (0,0) |  0    |  2    |  4    |  6    |  8    | 10    | 12    | 14    |
-| (1,0) |  1    |  3    |  5    |  7    |  9    | 11    | 13    | 15    |
-| (0,1) | 16    | 18    | 20    | 22    | 24    | 26    | 28    | 30    |
-| (1,1) | 17    | 19    | 21    | 23    | 25    | 27    | 29    | 31    |
-| (0,2) | 32    | 34    | 36    | 38    | 40    | 42    | 44    | 46    |
-| (1,2) | 33    | 35    | 37    | 39    | 41    | 43    | 45    | 47    |
-
-Of course, any other rearrangement of the rows and cols is also valid.
-
-#### Zipped divide
-
-The `zipped_divide` function applies `logical_divide`, and then gathers the
-"subtiles" into a single mode and the "rest" into a single mode.
-
-For example, if we apply `zipped_divide` instead of `logical_divide` in the example above,
-
-```c++
-Layout raked_prod = Layout<Shape <Shape < _3,_2>,Shape <_4,_2>>,
-                           Stride<Stride<_16,_1>,Stride<_4,_2>>>{};
-Tile   subtile    = make_tile(Layout<_2,_3>{},    // Gather elements 2 : 3 from mode 0
-                              Layout<_2,_4>{});   // Gather elements 2 : 4 from mode 1
-
-print_layout(zipped_divide(raked_prod, subtile));
-```
-
-then we get the result
-
-|       | (0,0) | (1,0) | (2,0) | (0,1) | (1,1) | (2,1) | (0,2) | (1,2) | (2,2) | (0,3) | (1,3) | (2,3) |
-| ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   | ---   |
-| (0,0) |  0    | 16    | 32    |  4    | 20    | 36    |  8    | 24    | 40    | 12    | 28    | 44    |
-| (1,0) |  1    | 17    | 33    |  5    | 21    | 37    |  9    | 25    | 41    | 13    | 29    | 45    |
-| (0,1) |  2    | 18    | 34    |  6    | 22    | 38    | 10    | 26    | 42    | 14    | 30    | 46    |
-| (1,1) |  3    | 19    | 35    |  7    | 23    | 39    | 11    | 27    | 43    | 15    | 31    | 47    |
-
-Note that this is the same layout as the result in the Logical Product section.
-That is, the first mode is our original tile (and can be interpreted as a 2x2 matrix itself)
-and the second mode is its logical layout within the raked layout.
-
-#### More Examples of Divide
-
-For brevity, shapes can be used with `logical_divide` and `tiled_divide` to quickly split and tile modes of a tensor. For example, this C++ code
-
-```c++
-Layout layout     = Layout<Shape <_12, _32,_6>,
-                           Stride< _1,_128,_0>>{};
-Shape  tile_shape = make_shape(_4{},_8{});
-Layout logical_divided_tile = logical_divide(layout, tile_shape);
-Layout zipped_divided_tile  =  zipped_divide(layout, tile_shape);
-
-print("layout               :  "); print(layout);               print("\n");
-print("tile_shape           :  "); print(tile_shape);           print("\n");
-print("logical_divided_tile :  "); print(logical_divided_tile); print("\n");
-print("zipped_divided_tile  :  "); print(zipped_divided_tile);  print("\n\n");
-```
-
-produces the following output when we vary `layout`.
-
-```text
-full_layout          :  (_12,_32,_6):(_1,_128,_0)
-tile_shape           :  (_4,_8)
-logical_divided_tile :  ((_4,_3),(_8,_4),_6):((_1,_4),(_128,_1024),_0)
-zipped_divided_tile  :  ((_4,_8),(_3,_4,_6)):((_1,_128),(_4,_1024,_0))
-
-full_layout          :  (_12,(_4,_8),_6):(_1,(_32,_512),_0)
-tile_shape           :  (_4,_8)
-logical_divided_tile :  ((_4,_3),((_4,_2),_4),_6):((_1,_4),((_32,_512),_1024),_0)
-zipped_divided_tile  :  ((_4,(_4,_2)),(_3,_4,_6)):((_1,(_32,_512)),(_4,_1024,_0))
-```
-
-This code
-
-```c++
-Layout layout = make_layout(Shape<_8,_8>{},
-                            Stride<_8,_1>{});
-Layout tile = make_tile(make_layout(Shape<_4>{}),
-                        make_layout(Shape<_2>{}));
-print("layout: ");
-print_layout(layout);
-print("\n");
-print("tile: ");
-print(tile);
-print("\n");
-print("logical_divide: ");
-print_layout(logical_divide(layout, tile));
-print("zipped_divide: ");
-print_layout(zipped_divide(layout, tile));
-```
-
-results in the following layouts.
-
-<p align="center">
-  <img src="../../images/cute/logical_divide-and-zipped_divide.png" alt="logical_divide-and-zipped_divide" height="400"/>
-</p>
-
-This code
-
-```c++
-Layout layout = make_layout(Shape<_8,_8>{},
-                            Stride<_8,_1>{});
-Layout tile = make_tile(make_layout(Shape<_2>{}),
-                        make_layout(Shape<_4>{}));
-print("layout: ");
-print_layout(layout);
-print("\n");
-print("tile: ");
-print(tile);
-print("\n");
-print("logical_divide: ");
-print_layout(logical_divide(layout, tile));
-print("zipped_divide: ");
-print_layout(zipped_divide(layout, tile));
-```
-
-results in the following layouts.
-
-<p align="center">
-  <img src="../../images/cute/logical_divide-and-zipped_divide-2.png" alt="logical_divide-and-zipped_divide-2" height="400"/>
-</p>
-
-#### Tiled divide
-
-The `tiled_divide` function works like `zipped_divide`,
-except that it unpacks the second mode. This is useful when you have a `Tile` that describes all of the elements for a particular operation, for example, and want to gather those together but retain the logical shape of those tiles within the original layout. That is,
-
-```text
-Layout Shape : (M, N, L, ...)
-Tile Shape   : <M', N'>
-Tiled Result : ((M', N'), m, n, L, ...)
-```
-
-where `m` is `M / M'` and `n` is `N / N'`.
-We can consider `m` as the "number of `Tile`s in `M`" and `n` as the "number of `Tile`s in `N`". This style of operation is common when applying MMA Atoms and Copy Atoms.
--- a/media/docs/cute/0t_mma_atom.md
+++ b/media/docs/cute/0t_mma_atom.md
@ -142,13 +142,13 @@ directory, in header files starting with `mma_traits`.

 An `MMA_Traits` specialization defines the following public type aliases.

-* `ElementDVal`: Compute type of the D matrix
+* `ValTypeD`: Compute type of the D matrix

-* `ElementAVal`: Compute type of the A matrix
+* `ValTypeA`: Compute type of the A matrix

-* `ElementBVal`: Compute type of the B matrix
+* `ValTypeB`: Compute type of the B matrix

-* `ElementCVal`: Compute type of the C matrix
+* `ValTypeC`: Compute type of the C matrix

 * `Shape_MNK`: Logical MxNxK shape of the MMA operation

@ -172,10 +172,10 @@ It looks like this.
 template <>
 struct MMA_Traits<SM70_8x8x4_F32F16F16F32_NT>
 {
-  using ElementDVal = float;
-  using ElementAVal = half_t;
-  using ElementBVal = half_t;
-  using ElementCVal = float;
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;

  using Shape_MNK = Shape<_8,_8,_4>;
  using ThrID   = SM70_QuadPair;
@ -207,10 +207,10 @@ We first take a look at how we would take the ISA semantics of thread and data p
 The HMMA NT above uses types:

 ```cpp
-  using ElementDVal = float;
-  using ElementAVal = half_t;
-  using ElementBVal = half_t;
-  using ElementCVal = float;
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
 ```

 The rest of the `MMA_Traits` will be described in units of these types.
--- a/media/images/cute/complement1.png
+++ b/media/images/cute/complement1.png
--- a/media/images/cute/composition1.png
+++ b/media/images/cute/composition1.png
--- a/media/images/cute/composition2.png
+++ b/media/images/cute/composition2.png
--- a/media/images/cute/divide1.png
+++ b/media/images/cute/divide1.png
--- a/media/images/cute/divide2.png
+++ b/media/images/cute/divide2.png
--- a/media/images/cute/divide3.png
+++ b/media/images/cute/divide3.png
--- a/media/images/cute/product1.png
+++ b/media/images/cute/product1.png
--- a/media/images/cute/product2.png
+++ b/media/images/cute/product2.png
--- a/media/images/cute/product2d.png
+++ b/media/images/cute/product2d.png
--- a/media/images/cute/productblocked2d.png
+++ b/media/images/cute/productblocked2d.png
--- a/media/images/cute/productraked2d.png
+++ b/media/images/cute/productraked2d.png