fix: fig link in cute docs (#2216)

This commit is contained in:
Zhang_kg
2025-04-11 02:51:41 +08:00
committed by GitHub
parent b3f3c7758c
commit 5e497243f7
4 changed files with 25 additions and 25 deletions

View File

@ -250,7 +250,7 @@ We often use the `<LayoutA, LayoutB, ...>` notation to distinguish `Tiler`s from
The `result` in the above code can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
<p align="center">
<img src="../../images/cute/composition1.png" alt="composition1.png" height="250"/>
<img src="../../../images/cute/composition1.png" alt="composition1.png" height="250"/>
</p>
For convenience, CuTe also interprets `Shape`s as a tiler as well. A `Shape` is interpreted as tuple-of-layouts-with-stride-1:
@ -269,7 +269,7 @@ auto result = composition(a, tiler);
```
where `result` can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
<p align="center">
<img src="../../images/cute/composition2.png" alt="composition2.png" height="250"/>
<img src="../../../images/cute/composition2.png" alt="composition2.png" height="250"/>
</p>
## Composition Tilers
@ -324,7 +324,7 @@ The `cotarget` parameter above is most commonly an integer -- you can see we onl
* `complement((2,2):(1,6), 24)` is `(3,2):(2,12)`. Note that `((2,2),(3,2)):((1,6),(2,12))` has cosize `24` and produces unique indices.
<p align="center">
<img src="../../images/cute/complement1.png" alt="complement1.png" height="75"/>
<img src="../../../images/cute/complement1.png" alt="complement1.png" height="75"/>
</p>
As a visualization, the above figure depicts the codomain of the last example. The image of the original layout `(2,2):(1,6)` is colored in gray. The complement effectively "repeats" the original layout (displayed in the other colors) such that the codomain size of the result is `24`. The complement `(3,2):(2,12)` can be viewed as the "layout of the repetition."
@ -372,7 +372,7 @@ This is computed in the three steps described in the implementation above.
* Composition of `A = (4,2,3):(2,1,8)` with `(B,B*)` is then `((2,2),(2,3)):((4,1),(2,8))`.
<p align="center">
<img src="../../images/cute/divide1.png" alt="divide1.png" height="150"/>
<img src="../../../images/cute/divide1.png" alt="divide1.png" height="150"/>
</p>
The above figure depicts `A` as a 1-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are six of those tiles in `A` shown by each of the colors. After the divide, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
@ -384,7 +384,7 @@ Using the `Tiler` concept defined above, this immediately generalizes to multidi
Similar to the 2-D composition example above, consider a 2-D layout `A = (9,(4,8)):(59,(13,1))` and want to apply `3:3` down the columns (mode-0) and `(2,4):(1,8)` across the rows (mode-1). This means the tiler can be written as `B = <3:3, (2,4):(1,8)>`.
<p align="center">
<img src="../../images/cute/divide2.png" alt="divide2.png" height="450"/>
<img src="../../../images/cute/divide2.png" alt="divide2.png" height="450"/>
</p>
The above figure depicts `A` as a 2-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are twelve of those tiles in `A` shown by each of the colors. After the divide, the first mode of each mode of the result is the tile of data and the second mode of each mode iterates over each tile. In that sense, this operation can be viewed as a kind of `gather` operation or as simply a permutation on the rows and cols.
@ -430,7 +430,7 @@ We note that `logical_divide` preserves the *semantics* of the modes while permu
This is not the case with `zipped_divide`. The mode-0 in the `zipped_divide` result is the `Tile` itself (of whatever rank the `Tiler` was) and mode-1 is the layout of those tiles. It doesn't always make sense to plot these as 2-D layouts, because the `M`-mode is now more aptly the "tile-mode" and the `N`-mode is more aptly the "rest-mode". Regardless, we still can plot the resulting layout as 2-D as shown below.
<p align="center">
<img src="../../images/cute/divide3.png" alt="divide3.png" height="450"/>
<img src="../../../images/cute/divide3.png" alt="divide3.png" height="450"/>
</p>
We've kept each tile as its color in the previous images for clarity. Clearly, iterating across tiles is now equivalent to iterating across a row of this layout and iterating over elements within a tile is equivalent to iterating down a column of this layout. As we'll see in the `Tensor` section, this can be used to great effect in partitioning within or across tiles of data.
@ -477,7 +477,7 @@ This is computed in the three steps described in the implementation above.
* Concatenation of `(A,A* o B) = ((2,2),(2,3)):((4,1),(2,8))`.
<p align="center">
<img src="../../images/cute/product1.png" alt="product1.png" height="175"/>
<img src="../../../images/cute/product1.png" alt="product1.png" height="175"/>
</p>
The above figure depicts `A` and `B` as a 1-D layouts. The layout `B` describes the number and order of repetitions of `A` and they are colored for clarity. After the product, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
@ -487,7 +487,7 @@ Note that the result is identical to the result of the 1-D Logical Divide exampl
Of course, we can change the number and order of the tiles in the product by changing `B`.
<p align="center">
<img src="../../images/cute/product2.png" alt="product2.png" height="175"/>
<img src="../../../images/cute/product2.png" alt="product2.png" height="175"/>
</p>
For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated tiles instead of 6 and the tiles are in a different order.
@ -497,7 +497,7 @@ For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated til
We can use the by-mode `tiler` strategies previously developed to write multidimensional products as well.
<p align="center">
<img src="../../images/cute/product2d.png" alt="product2d.png" height="250"/>
<img src="../../../images/cute/product2d.png" alt="product2d.png" height="250"/>
</p>
The above image demonstates the use of a `tiler` to apply `logical_product` by-mode. Despite this **not being the recommended approach**, the result is a rank-2 layout consisting of 2x5 row-major block that is tiled across a 3x4 column-major arrangement.
@ -520,7 +520,7 @@ Because `A` is always compatible with mode-0 of the result and `B` is always com
This is exactly what `blocked_product` and `raked_product` do and it is why they are called rank-sensitive. Unlike other CuTe functions that take `Layout` arguments, these care about the top-level rank of the arguments so that each mode can be reassociated after the `logical_product`.
<p align="center">
<img src="../../images/cute/productblocked2d.png" alt="productblocked2d.png" height="250"/>
<img src="../../../images/cute/productblocked2d.png" alt="productblocked2d.png" height="250"/>
</p>
The above image shows the same result as the `tiler` approach, but with much more intuitive arguments. A 2x5 row-major layout is arranged as a tile in a 3x4 column-major arrangement. Also note that `blocked_product` went ahead and `coalesced` mode-0 for us.
@ -528,7 +528,7 @@ The above image shows the same result as the `tiler` approach, but with much mor
Similarly, `raked_product` combines the modes slightly differently. Instead of the resulting "column" mode being constructed from the `A` "column" mode then the `B` "column" mode, the resulting "column" mode is constructed from the `B` "column" mode then the `A` "column" mode.
<p align="center">
<img src="../../images/cute/productraked2d.png" alt="productraked2d.png" height="250"/>
<img src="../../../images/cute/productraked2d.png" alt="productraked2d.png" height="250"/>
</p>
This results in the "tile" `A` now being interleaved or "raked" with the "layout-of-tiles" `B` instead of appearing as blocks. Other references call this a "cyclic distribution."

View File

@ -270,7 +270,7 @@ Tensor F = A(make_coord(2,_),make_coord(_,3,_));
```
<p align="center">
<img src="../../images/cute/slice.png" alt="slice.png" height="300"/>
<img src="../../../images/cute/slice.png" alt="slice.png" height="300"/>
</p>
In the image above, a `Tensor` is sliced in various ways and the subtensors generated by those slices are highlighted within the original tensor. Note that tensor `C` and `D` contain the same elements, but have different ranks and shapes due to the use of `_` versus the use of `make_coord(_,_)`. In each case, the rank of the result is equal to the number of `Underscore`s in the slicing coordinate.
@ -328,7 +328,7 @@ Tensor v = tv(threadIdx.x, _); // (4)
```
<p align="center">
<img src="../../images/cute/tv_layout.png" alt="tv_layout.png" height="300"/>
<img src="../../../images/cute/tv_layout.png" alt="tv_layout.png" height="300"/>
</p>
The above image is a visual representation of the above code. An arbitrary 4x8 layout of data is composed with a specific 8x4 TV-layout that represents a partitioning pattern. The result of the composition is on the right where each threads' values are arranged across each row. The bottom layout depicts the inverse TV layout which shows the mapping of 4x8 logical coordinates to the thread id and value id they will be mapped to.

View File

@ -209,7 +209,7 @@ Volta architecture implements an HMMA instruction where a group of 8 threads cal
We first take a look at how we would take the ISA semantics of thread and data partitioning for the HMMA instruction, and encode it in a Traits struct. The HMMA NT instruction has the thread-data layout:
<p align="center">
<img src="../../images/cute/HMMA.8x8x4.NT.png" alt="HMMA.8x8x4.NT.png" height="400"/>
<img src="../../../images/cute/HMMA.8x8x4.NT.png" alt="HMMA.8x8x4.NT.png" height="400"/>
</p>
### Types
@ -251,7 +251,7 @@ Again, this layout function maps the logical thread id [0,8) of the MMA operatio
Let us look at exactly how the 8 threads within a QP are mapped to the A, B and C matrices. For the C and D matrices, the above image is broken down a bit more below. On the left is shown the whole QP level view, and on the right is shown the values owned by just thread 0.
<p align="center">
<img src="../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
<img src="../../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
</p>
The metainformation of this single instruction level view is what we want to encode in CuTe. Specifically, the QP level view in this diagram corresponds to the four MMA traits for [SM70_F32F16F16F32](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/mma_sm70.hpp). These structs contain the `Element` types, the `Shape_MNK`, and the `ThrID` mapping we constructed above. Now, let us take a look at the definition of `CLayout`, the thread-data layout of accumulators. The job of `CLayout` is to construct a mapping between the `(logical_thr_id, logical_val_id)` and `(m, n)` coordinate in the C matrix which can then be used to build up more complicated layouts and operations like the 16x16x4 WMMA.
@ -321,7 +321,7 @@ In the case of F16 accumulators, the layout is way less complex. Each row of acc
A and B matrix layouts depend on whether the sources are transposed or not. The diagram below shows the thread ID to data ownership map for A and B matrices in the case of NT and TN transposes.
<p align="center">
<img src="../../images/cute/HMMA.8x8x4.quadpair.AB.png" alt="HMMA.8x8x4.quadpair.AB.png" height="400"/>
<img src="../../../images/cute/HMMA.8x8x4.quadpair.AB.png" alt="HMMA.8x8x4.quadpair.AB.png" height="400"/>
</p>
Let's look at the TN layout for A matrix first (right side in the diagram). Again, there are the same 8 logical threads, but each threads owns only 4 elements this time. The shape of `ALayout` will then be `Shape<_8, _4>`. As for the strides, we again need a similar mapping between `(m, k) == m + k * M`. Looking down the `M` mode, we go from `(T0, V0)` to `(T1, V0)` which is a stride of 1 for all 8 threads. For the `K` mode, as we go across, we go from `(T0, V0)` to `(T0, V1)`, which makes a stride of 8 for all 4 values. Therefore, the A layout is:
@ -376,7 +376,7 @@ Accumulators are mapped hierarchically in GMMA, starting from the concept of a c
Each core matrix has the layout as shown in the diagram below.
<p align="center">
<img src="../../images/cute/gmma_coremat_cd_fp16.png" alt="gmma_coremat_cd_fp16.png" height="600"/>
<img src="../../../images/cute/gmma_coremat_cd_fp16.png" alt="gmma_coremat_cd_fp16.png" height="600"/>
</p>
As in the Volta examples, the thread IDs are logical only, and which of the four warps they belong to in the warpgroup is not important.
@ -384,7 +384,7 @@ As in the Volta examples, the thread IDs are logical only, and which of the four
Then GMMA tiles this core matrix first vertically along the M mode, and then repeats that column of core matrices along the N mode to construct the full MxN tile. This tiling is shown in the image below.
<p align="center">
<img src="../../images/cute/gmma_wg_n_slice.png" alt="gmma_wg_n_slice.png" height="600"/>
<img src="../../../images/cute/gmma_wg_n_slice.png" alt="gmma_wg_n_slice.png" height="600"/>
</p>
With this image, we are again ready to start building the `CLayout` for `SM90_64x128x16_F16F16F16F16_TN` atom. Same as before, we are constructing a mapping between the `(logical_thr_id, logical_val_id) -> (m, n)` coordinate spaces.
@ -453,7 +453,7 @@ MMA_Atom mma = MMA_Atom<SM70_8x8x4_F32F16F16F32_NT>{};
print_latex(mma);
```
<p align="center">
<img src="../../images/cute/HMMA.8x8x4.NT_Atom.png" alt="HMMA.8x8x4.NT_Atom.png" height="400"/>
<img src="../../../images/cute/HMMA.8x8x4.NT_Atom.png" alt="HMMA.8x8x4.NT_Atom.png" height="400"/>
</p>
The above is equivalent to
@ -473,7 +473,7 @@ We can create an object akin to a WMMA by using four of these quadpair MMAs:
print_latex(mma);
```
<p align="center">
<img src="../../images/cute/HMMA.8x8x4.NT_2x2.png" alt="HMMA.8x8x4.NT_2x2.png" height="400"/>
<img src="../../../images/cute/HMMA.8x8x4.NT_2x2.png" alt="HMMA.8x8x4.NT_2x2.png" height="400"/>
</p>
This `TiledMMA` replicates the `MMA_Atom` across threads as we can see the `T4` and `T8` and `T12` threads in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the atom's partitioning pattern for a new quadpair and this replication follows a `(2,2):(2,1)` layout.
@ -486,7 +486,7 @@ The above represents a 16x16x4 MMA now, but we can immediately expand this "tile
print_latex(mma);
```
<p align="center">
<img src="../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png" alt="HMMA.8x8x4.NT_2x2_32x32x4.png" height="400"/>
<img src="../../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png" alt="HMMA.8x8x4.NT_2x2_32x32x4.png" height="400"/>
</p>
This `TiledMMA` replicates the previous `TiledMMA` across values instead of threads. We can see the `T0V8` and `T16V8` and `T8V8` values in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the previous `TiledMMA`'s partitioning pattern for a new set of values.
@ -514,7 +514,7 @@ which are separate, but we might prefer them to be next to each other. That is w
print_latex(mma);
```
<p align="center">
<img src="../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png" alt="HMMA.8x8x4.NT_2x2_32Mx32x4.png" height="400"/>
<img src="../../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png" alt="HMMA.8x8x4.NT_2x2_32Mx32x4.png" height="400"/>
</p>
That layout `(4,4,2):(1,8,4)` is read like a scatter permutation, telling the m-coords of the original image where to go in the new image.

View File

@ -335,7 +335,7 @@ These thread layouts are then used to partition the tiles of data in global memo
where we've used the same projection-style interface to avoid applying the `N`-mode of `tC` to the `(BLK_M,BLK_K)` shape of `sA` and avoid applying the `M`-mode of `tC` to the `(BLK_N,BLK_K)` shape of `sB`.
<p align="center">
<img src="../../images/cute/tC_partitioning.png" alt="tC_partitioning.png" height="300"/>
<img src="../../../images/cute/tC_partitioning.png" alt="tC_partitioning.png" height="300"/>
</p>
This diagram shows a `tC` layout, highlights two threads in green and blue, shows the projections of the `tC` layout, and finally highlights the subtensors within `sA`, `sB`, and `gC` that `tCsA`, `tCsB`, and `tCgC` represent.
@ -391,7 +391,7 @@ As a first example, lets look at the `TiledCopy` that `gemm_nt` generates.
```
The easiest way to see what this `TiledCopy` does is to look at the partition pattern in LaTeX.
<p align="center">
<img src="../../images/cute/TiledCopyA.png" alt="TiledCopyA.png" height="300"/>
<img src="../../../images/cute/TiledCopyA.png" alt="TiledCopyA.png" height="300"/>
</p>
On the left is the source-tensor partitioning and on the right is the destination-tensor partitioning. The partition patterns are the same for this case, but there exist PTX instructions which require different patterns in the source and destination. The diagram shows that each thread reads 4x1 `TA` elements and there are 32x8 threads. The `UniversalCopy<uint128_t>` forces the instruction to use a 128-bit copy instruction. If the partition (of `sA` or `gA` in this case) does not result in 4 `TA` elements that cannot be vectorized to a 128-bit load/store, then CuTe will statically fail with an error message to that effect.
@ -422,7 +422,7 @@ As a first example, lets look at the `TiledMMA` that `gemm_nt` generates.
```
The easiest way to see what this `TiledMMA` does is to look at the partition pattern in LaTeX.
<p align="center">
<img src="../../images/cute/TiledMmaC.png" alt="TiledMmaC.png" height="300"/>
<img src="../../../images/cute/TiledMmaC.png" alt="TiledMmaC.png" height="300"/>
</p>
On the left is the A-tensor partitioning, on the top is the B-tensor partitioning, and in the middle is the C-tensor partitioning.Because the `UniversalFMA` is a 1x1x1 MMA instruction, a 16x16x1 tiling of them results in a 16x16x1 `TiledMMA`. Other MMA instructions will have different threads involved and have different instruction sizes. In this case, all threads will read a single element from `A`, `B`, and `C` each.