v3.9 update (#2203)

* v3.9 update * voidD --------- Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-04-02 12:11:18 -07:00
parent 62750a2b75
commit 6f4921858b
129 changed files with 7719 additions and 2036 deletions
--- a/media/docs/cpp/blackwell_cluster_launch_control.md
+++ b/media/docs/cpp/blackwell_cluster_launch_control.md
@ -76,8 +76,8 @@ __device__ clc_dynamic_persistent_kernel(...) {

 ### Cluster Launch Control Pipeline Class

-Please refer to the `PipelineCLCFetchAsync` pipeline class defined in [Cluster launch control pipeline class](/include/cutlass/pipeline/sm100_pipeline.hpp). Cluster launch control queries can be pipelined and mananged by an asynchronous pipeline with producer-consumer relationship (See
-[pipeline](/media/docs/pipeline.md) document). The producer is the scheduler warp of the 0th CTA in the cluster and the consumers are all warps that need `ClcID`s. 
+Please refer to the `PipelineCLCFetchAsync` pipeline class defined in [Cluster launch control pipeline class](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/pipeline/sm100_pipeline.hpp). Cluster launch control queries can be pipelined and mananged by an asynchronous pipeline with producer-consumer relationship (See
+[pipeline](pipeline.md) document). The producer is the scheduler warp of the 0th CTA in the cluster and the consumers are all warps that need `ClcID`s. 

 To setup a CLC pipeline correctly, we need to make sure the params are set to the right values:

@ -88,18 +88,18 @@ To setup a CLC pipeline correctly, we need to make sure the params are set to th


 ### Dynamic tile scheduler class
-Please refer to `PersistentTileSchedulerSm100` class defined in [sm100 dynamic persistent tile scheduler](/include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp).
+Please refer to `PersistentTileSchedulerSm100` class defined in [sm100 dynamic persistent tile scheduler](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp).

 There are two important methods of the CLC scheduler class. The first is `advance_to_next_work`, which is intended to be executed by one elected thread from the scheduler warp. It effectively sends out the CLC query to the CLC. A CLC query response will be broadcast to the same shared memory address of all CTAs in the cluster.

 The other method is named `get_current_work`. It simply loads the CLC response from the shared memory buffer indexed by a pipeline state. 


-The CLC pipeline and scheduler classes are used together to ensure correct functionality and necessary synchronization of CLC feature. Please refer to [cluster launch control pipeline unit test](/test/unit/pipeline/pipeline_cluster_launch_control_async_warp_specialized_blackwell.cu).
+The CLC pipeline and scheduler classes are used together to ensure correct functionality and necessary synchronization of CLC feature. Please refer to [cluster launch control pipeline unit test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/pipeline/pipeline_cluster_launch_control_async_warp_specialized_blackwell.cu).

 ## Blackwell Warp-specialized Persistent Kernel

-Now, let's take a look at how CLC feature is used in our [Blackwell dense GEMM kernel](/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp).
+Now, let's take a look at how CLC feature is used in our [Blackwell dense GEMM kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp).

 This particular warp-specialized kernel has the following warp assignment:

--- a/media/docs/cpp/blackwell_functionality.md
+++ b/media/docs/cpp/blackwell_functionality.md
@ -77,10 +77,10 @@ All four layouts (TT, NN, NT, TT) are supported for all legacy data types.
 |                               | A Type     | B Type     | AB Layout      | A Alignment | B Alignment | Target tcgen05.mma.kind | Unit Test |
 |-------------------------------|------------|------------|----------------|-------------|-------------|-------------------------|-----------|
 |1                              | tfloat32_t | tfloat32_t | TN, NN, NT, TT | 4           | 4           | tf32                    | |
-|2                              | half_t     | half_t     | TN, NN, NT, TT | 8           | 8           | f16                     | [Unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/f16_f16_void_f32.cu)|
-|3                              | bfloat16_t | bfloat16_t | TN, NN, NT, TT | 8           | 8           | f16                     | [Similar to half_t unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/f16_f16_void_f32.cu)|
-|4                              | int8_t     | int8_t     | TN, NN, NT, TT | 16          | 16          | i8                      | [Unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/s8_s8_void_s32.cu)|
-|5                              | uint8_t    | uint8_t    | TN, NN, NT, TT | 16          | 16          | i8                      | [Similar to int8_t unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/s8_s8_void_s32.cu)|
+|2                              | half_t     | half_t     | TN, NN, NT, TT | 8           | 8           | f16                     | [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/f16_f16_void_f32.cu)|
+|3                              | bfloat16_t | bfloat16_t | TN, NN, NT, TT | 8           | 8           | f16                     | [Similar to half_t unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/f16_f16_void_f32.cu)|
+|4                              | int8_t     | int8_t     | TN, NN, NT, TT | 16          | 16          | i8                      | [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/s8_s8_void_s32.cu)|
+|5                              | uint8_t    | uint8_t    | TN, NN, NT, TT | 16          | 16          | i8                      | [Similar to int8_t unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/s8_s8_void_s32.cu)|

 For narrow precision Mmas, not all A/B type, and A/B layout combinations are supported by every `tcgen05.mma` instructions.
 Furthermore, tensor copy instructions for subbyte types impose additional alignment requirements while loading narrow-precision
@ -93,31 +93,31 @@ instructions supported by CUTLASS.
 **Table 2: Valid Data Type, Alignment, and Layout Combinations For Narrow Precision MMAs Without Block Scaling** <a id="non_bs_gemm_table" name="non_bs_gemm_table"></a>
 |                               | A Type   | B Type   | AB Layout      | A Alignment | B Alignment | Target tcgen05.mma.kind | Unit Test |
 |-------------------------------|----------|----------|----------------|-------------|-------------|-------------------------|-----------|
-|[1](#nonbs_rows_1_2_3_6)       | float4_t | float4_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
-|[2](#nonbs_rows_1_2_3_6)       | float4_t | float6_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
-|[3](#nonbs_rows_1_2_3_6)       | float6_t | float4_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
-|[4](#nonbs_rows_4_7)           | float4_t | float8_t | TN, NN, NT, TT | 128         | 16          | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_nt_layout.cu) |
-|[5](#nonbs_rows_5_8)           | float8_t | float4_t | TN, NN, NT, TT | 16          | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_nt_layout.cu) |
-|[6](#nonbs_rows_1_2_3_6)       | float6_t | float6_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
-|[7](#nonbs_rows_4_7)           | float6_t | float8_t | TN, NN, NT, TT | 128         | 16          | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_nt_layout.cu) |
-|[8](#nonbs_rows_5_8)           | float8_t | float6_t | TN, NN, NT, TT | 16          | 128         | f8f6f4                  | [TN unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_nt_layout.cu) |
-|[9](#nonbs_rows_9)             | float8_t | float8_t | TN, NN, NT, TT | 16          | 16          | f8f6f4                  | [Unit tests](../../test/unit/gemm/device/sm100_tensorop_gemm/f8_f8_void_f32.cu)|
+|[1](#nonbs_rows_1_2_3_6)       | float4_t | float4_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
+|[2](#nonbs_rows_1_2_3_6)      | float4_t | float6_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
+|[3](#nonbs_rows_1_2_3_6)         | float6_t | float4_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
+|[4](#nonbs_rows_4_7)           | float4_t | float8_t | TN, NN, NT, TT | 128         | 16          | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_nt_layout.cu) |
+|[5](#nonbs_rows_5_8)           | float8_t | float4_t | TN, NN, NT, TT | 16          | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_nt_layout.cu) |
+|[6](#nonbs_rows_1_2_3_6)       | float6_t | float6_t | TN, NN, NT, TT | 128         | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nt_layout.cu) <br> [NN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_nn_layout.cu) <br> [TT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f6f4_void_f32_tt_layout.cu) |
+|[7](#nonbs_rows_4_7)           | float6_t | float8_t | TN, NN, NT, TT | 128         | 16          | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f6f4_f8_void_f32_nt_layout.cu) |
+|[8](#nonbs_rows_5_8)           | float8_t | float6_t | TN, NN, NT, TT | 16          | 128         | f8f6f4                  | [TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_tn_layout.cu) <br> [NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/narrow_precision/f8_f6f4_void_f32_nt_layout.cu) |
+|[9](#nonbs_rows_9)             | float8_t | float8_t | TN, NN, NT, TT | 16          | 16          | f8f6f4                  | [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/f8_f8_void_f32.cu)|


 **Table 3: Valid Data Type, Alignment, and Layout Combinations for Block Scaled Narrow Precision MMAs** <a id="bs_gemm_table" name="bs_gemm_table"></a>
 |                         | A Type      | B Type      | AB Layout      | A Alignment | B Alignment | Target tcgen05.mma.kind |Unit Test|
 |-------------------------|-------------|-------------|----------------|-------------|-------------|-------------------------|------|
-|[1](#bs_rows_1)          | nv_float4_t | nv_float4_t | TN             | 32          | 32          | mxf4nvf4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/nvf4_nvf4_bf16_bf16.cu)|
-|[2](#bs_rows_2)          | mx_float4_t | mx_float4_t | TN             | 32          | 32          | mxf4, mxf4nvf4          |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf4_void_f16_tn_layout.cu)|
-|[3](#bs_rows_3)          | mx_float4_t | mx_float4_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf4_void_f16_nt_layout.cu)|
-|[4](#bs_rows_4_5_7_8_10) | mx_float4_t | mx_float6_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf6_f32_f16_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf6_f32_f16_nt_layout.cu)|
-|[5](#bs_rows_4_5_7_8_10) | mx_float6_t | mx_float4_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf4_f16_f16_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf4_f16_f16_nt_layout.cu)|
-|[6](#bs_rows_6_9_11)     | mx_float4_t | mx_float8_t | TN, NN, NT, TT | 128         | 16          | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf8_bf16_bf16_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf8_bf16_bf16_nt_layout.cu)|
-|[7](#bs_rows_4_5_7_8_10) | mx_float8_t | mx_float4_t | TN, NN, NT, TT | 16          | 128         | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf4_f16_bf16_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf4_f16_bf16_nt_layout.cu)|
-|[8](#bs_rows_4_5_7_8_10) | mx_float6_t | mx_float6_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf6_void_bf16_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf6_void_bf16_nt_layout.cu)|
-|[9](#bs_rows_6_9_11)     | mx_float6_t | mx_float8_t | TN, NN, NT, TT | 128         | 16          | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf8_void_f32_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf8_void_f32_nt_layout.cu)|
-|[10](#bs_rows_4_5_7_8_10)| mx_float8_t | mx_float6_t | TN, NN, NT, TT | 16          | 128         | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf6_f16_f8_tn_layout.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf6_f16_f8_nt_layout.cu)|
-|[11](#bs_rows_6_9_11)    | mx_float8_t | mx_float8_t | TN, NN, NT, TT | 16          | 16          | mxf8f6f4                |[TN unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf8_void_f8_tn_layout.cu.cu)<br>[NT unit tests](../../test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf8_void_f8_nt_layout.cu)|
+|[1](#bs_rows_1)          | nv_float4_t | nv_float4_t | TN             | 32          | 32          | mxf4nvf4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/nvf4_nvf4_bf16_bf16.cu)|
+|[2](#bs_rows_2)          | mx_float4_t | mx_float4_t | TN             | 32          | 32          | mxf4, mxf4nvf4          |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf4_void_f16_tn_layout.cu)|
+|[3](#bs_rows_3)          | mx_float4_t | mx_float4_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf4_void_f16_nt_layout.cu)|
+|[4](#bs_rows_4_5_7_8_10) | mx_float4_t | mx_float6_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf6_f32_f16_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf6_f32_f16_nt_layout.cu)|
+|[5](#bs_rows_4_5_7_8_10) | mx_float6_t | mx_float4_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf4_f16_f16_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf4_f16_f16_nt_layout.cu)|
+|[6](#bs_rows_6_9_11)     | mx_float4_t | mx_float8_t | TN, NN, NT, TT | 128         | 16          | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf8_bf16_bf16_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf4_mxf8_bf16_bf16_nt_layout.cu)|
+|[7](#bs_rows_4_5_7_8_10) | mx_float8_t | mx_float4_t | TN, NN, NT, TT | 16          | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf4_f16_bf16_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf4_f16_bf16_nt_layout.cu)|
+|[8](#bs_rows_4_5_7_8_10) | mx_float6_t | mx_float6_t | TN, NN, NT, TT | 128         | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf6_void_bf16_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf6_void_bf16_nt_layout.cu)|
+|[9](#bs_rows_6_9_11)     | mx_float6_t | mx_float8_t | TN, NN, NT, TT | 128         | 16          | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf8_void_f32_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf6_mxf8_void_f32_nt_layout.cu)|
+|[10](#bs_rows_4_5_7_8_10)| mx_float8_t | mx_float6_t | TN, NN, NT, TT | 16          | 128         | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf6_f16_f8_tn_layout.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf6_f16_f8_nt_layout.cu)|
+|[11](#bs_rows_6_9_11)    | mx_float8_t | mx_float8_t | TN, NN, NT, TT | 16          | 16          | mxf8f6f4                |[TN unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf8_void_f8_tn_layout.cu.cu)<br>[NT unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_tensorop_gemm/mxf8_mxf8_void_f8_nt_layout.cu)|

 ## MMA tile shapes supported

@ -327,18 +327,18 @@ Similarly for epilogues, we can use `cutlass::epilogue::collective::EpilogueSche

 ## Building a Block Scaled Kernel <a id="detailed_blockscale_example" name="detailed_blockscale_example"></a>

-For non-blockscaled dense GEMM refer to [quick start page](quickstart.md#instantiating-a-blackwell-gemm-kernel). An example dense GEMM can be found:
-1. [Blackwell FP16 GEMM example](../../examples/70_blackwell_gemm/).
+For non-blockscaled dense GEMM refer to [quick start page](quickstart.md#instantiating-a-blackwell-sm100-gemm-kernel). An example dense GEMM can be found:
+1. [Blackwell FP16 GEMM example](https://github.com/NVIDIA/cutlass/tree/main/examples/70_blackwell_gemm/).

 Narrow precision and block scaled narrow precision kernels can be built using CUTLASS 3.x collective builder interface
 (as described in [CUTLASS 3.0 GEMM API](gemm_api_3x.md#cutlass-30-gemm-api)). However, special attention needs to be given to 
 A and B matrix layouts, alignment requirements, and dispatch policies to obtain a functionally correct and performant kernel
 which are listed above.

-Several examples of block scaled kernels can be found in [examples/72_blackwell_narrow_precision_gemm](../../examples/72_blackwell_narrow_precision_gemm/) directory:
-1. [NVF4 Gemm with block scaling](../../examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu)
-2. [NVF4 Gemm with block scaling and NVF4 output matrix](../../examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu)
-3. [Mixed precision Nvf4 x Mxf8 GEMM with block scaling](../../examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm.cu)
+Several examples of block scaled kernels can be found in [examples/72_blackwell_narrow_precision_gemm](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/) directory:
+1. [NVF4 Gemm with block scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu)
+2. [NVF4 Gemm with block scaling and NVF4 output matrix](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu)
+3. [Mixed precision Nvf4 x Mxf8 GEMM with block scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm.cu)

 Collective builder interface expects the same arguments as any other CUTLASS 3.x kernels as described
 [here](gemm_api_3x.md#collective-builder-for-collectivemmas) with a small difference for Collective MMA builder interface.
@ -508,30 +508,24 @@ Typically, GmemLayoutSFD would be same as the GmemLayoutD.
 ```

 Above example made a gentle introduction to using the fusion operations in the epilogue. For more detailed example, see
-[Blackwell GEMM with collective builder](../../examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu)
+[Blackwell GEMM with collective builder](https://github.com/NVIDIA/cutlass/tree/main/examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu)

 Note that we have first discussed the CollectiveMainloop, then the CollectiveEpilogue for clarity. 
-However, the CollectiveMainloop needs to know the SMEM utilization of the epilogue. Therefore, it needs to be setup before the CollectiveMainloop. See  [examples/72_blackwell_narrow_precision_gemm](../../examples/72_blackwell_narrow_precision_gemm/) directory for full kernel and run setup.
+However, the CollectiveMainloop needs to know the SMEM utilization of the epilogue. Therefore, it needs to be setup before the CollectiveMainloop. See  [examples/72_blackwell_narrow_precision_gemm](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/) directory for full kernel and run setup.

 ### Scale Factor Layouts

 The scale factor layout consists of a 512B basic-block structure, as illustrated in the diagram below. Each block contains 128 M/N dimension and 4 scale factors (SF) along the K dimension.
 The byte order of the basic storage chunk is row-major, meaning that M0SF0 to M0SF3, M32SF0 to M32SF3, M64SF0 to M64SF3, and M96SF0 to M96SF3 are stored consecutively in GMEM.

-[](../images/M128xK4_scalefactor_gmem.png)
-<p align="center">
-  <img src="../images/M128xK4_scalefactor_gmem.png" alt="/M128xK4_scalefactor_gmem.png"/>
-</p>
+![ALT](../../images/M128xK4_scalefactor_gmem.png)

 If the scale factor tensor exceeds M128xSF4, it indicates that there are multiple basic blocks along both the M and SFK dimensions. The arrangement of these basic blocks follows a K-major order. Here is a diagram illustrating the scenario where M equals 512 and the SFK is 16.

-[](../images/narrow_precison_multiple_block_sf_layout.png)
-<p align="center">
-  <img src="../images/narrow_precison_multiple_block_sf_layout.png" alt="/narrow_precison_multiple_block_sf_layout.png"/>
-</p>
+![ALT](../../images/narrow_precison_multiple_block_sf_layout.png)

 The creation of scale factor tensors' layouts are tedious. CUTLASS provides `Sm1xxBlockScaledConfig` to create these layouts easily
-(See [sm100_blockscaled_layout.hpp](cutlass/include/cutlass/detail/sm100_blockscaled_layout.hpp)).
+(See [sm100_blockscaled_layout.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/detail/sm100_blockscaled_layout.hpp)).
 The interface to create SFA and SFB tensor layouts is as follows:

 ```cpp
@ -548,6 +542,77 @@ auto tensor_sfb = make_tensor(bptr, layout_sfb);
 // Access SF for for element m,k of A tensor
 auto val_a_mk = tensor_sfa(make_coord(m,k,0));
 ```
+# Blackwell SM120 GEMMs
+The NVIDIA RTX 5000 Series GPUs introduce support for new narrow precision (4bit and 6bit) block-scaled and non-block-scaled tensor cores. The PTX ISA has extended the `mma` instructions to support these data formats which are 1x to 4x faster than Ada architecture's fp8 tensor cores. For more detailed information see [`mma` PTX documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#multiply-and-accumulate-instruction-mma).
+
+CUTLASS 4.0 has added support for these newly introduced narrow precision GEMMs. Similar to the Blackwell SM100 GEMMs, the SM120 GEMMs can be built using the collective builder interface. See examples in [examples/79_blackwell_geforce_gemm/](../../examples/79_blackwell_geforce_gemm/) and unit tests listed below. 
+
+The data types supported and tensor alignment requirements are the same as the Blackwell SM100 GEMMs. The scale factor layout is also the same as SM100 mentioned above. `OpClassTensorOp` is used for non-blockscaled narrow precision GEMMs and `OpClassBlockScaledTensorOp` is used for blockscaled narrow precision GEMMs.
+
+| Ptx Instruction                                                     | Throughput                 | Notes | Unit Test |
+|---------------------------------------------------------------------|----------------------------|-------|-----------|
+|mma.sync.aligned.kind::f8f6f4                                        | 1x Ada Fp8 Tensor Core(2x for FP32 accumulator)     | Mixed precision MMA with A={f4,f6,f8} x B={f4,f6,f8} TN layouts                               | [unit test](../../test/unit/gemm/device/sm120_tensorop_gemm/) |
+|mma.sync.aligned.kind::mxf8f6f4.block_scale                          | 1x Ada Fp8 Tensor Core(2x for FP32 accumulator)     | Block scaled mixed precision MMA with A={mxf4,mxf6,mxf8} x B={mxf4,mxf6,mxf8} with TN layouts | [unit test](../../test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_mxf6_mxf8_f32_f32.cu) |
+|mma.sync.aligned.kind::mxf4.block_scale                              | 2x Ada Fp8 Tensor Core(4x for FP32 accumulator)     | Block scaled MMA with A={mxf4} x B={mxf4} with TN layouts                                     | [unit test](../../test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_mxf4_mxf4_f32_f32.cu) |
+|mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::[2X\|4X]      | 2x Ada Fp8 Tensor Core(4x for FP32 accumulator)     | Block scaled MMA with A={mxf4} x B={mxf4} or A={nvf4} x B={nvf4} with TN layouts              | [unit test](../../test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/sm120_bs_gemm_nvf4_nvf4_f32_f32.cu) |
+
+Besides the similarities, there are some key differences from the Blackwell SM100 GEMMs:
+
+## Cluster Size
+
+On Geforce series graphics card, there is no multicast feature therefore the cluster shape is fixed to 1x1x1.
+
+## Tensor Layout
+
+Only TN layout is supported. Matrix A is row major and matrix B is column major.
+
+## Pingpong v.s. cooperative kernel schedule
+
+Similar to Hopper's warp-group GEMM, SM120 GEMMs support both pingpong and cooperative kernel schedules. Pingpong kernel schedule has two groups of 4 MMA warps working on different output tiles, overlapping the mainloop and epilogue, while the cooperative kernel schedule has only one group of 8 MMA warps working on the same output tile. If `KernelScheduleAuto` is specified, `KernelTmaWarpSpecializedCooperative` will be selected by default.
+
+## Epilogue schedule:
+ 
+`EpilogueScheduleAuto` must be used.
+
+## Tile size:
+ 
+Below are tables that summarize the valid tile shapes and dispatch policies for SM120 GEMMs. If the output is `float_6_t`, the tile size in the leading dimension of output tensor must be 128.
+
+**Table 16: Valid Tile Shapes and Dispatch Policies for {float8_t, float_6_t, float_4_t} x {float8_t, float_6_t, float_4_t} of SM120 GEMMs** 
+| Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
+|----------------|----|----|----|----|------------------------------------|
+ 64x64x128      | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+ 64x128x128     | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+ 128x64x128     | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+ 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+
+**Table 17: Valid Tile Shapes for nv_float4_t x nv_float4_t of SM120 GEMMs** 
+| Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
+|----------------|----|----|----|----|------------------------------------|
+ 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+ 256x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedCooperative` |
+ 128x128x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+
+**Table 18: Valid Tile Shapes and Dispatch Policies for mx_float4_t x mx_float4_t of SM120 GEMMs**
+| Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
+|----------------|----|----|----|----|------------------------------------|
+ 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+ 256x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedCooperative` |
+ 128x128x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |
+
+**Table 19: Valid Tile Shapes and Dispatch Policies for mx_float4_t x mx_float4_t of SM120 GEMMs**
+| Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
+|----------------|----|----|----|----|------------------------------------|
+ 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedMxf8f6f4Sm120` or `KernelTmaWarpSpecializedPingpongMxf8f6f4Sm120` |
+ 256x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedMxf8f6f4Sm120` |
+ 128x128x256    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedMxf8f6f4Sm120` or `KernelTmaWarpSpecializedPingpongMxf8f6f4Sm120` |
+
+Specialized policies must be used to generate mixed-input-datatype `mx_float4_t` kernels.
+
+**Table 20: Valid Tile Shapes and Dispatch Policies for {mx_float4_t, mx_float6_t, mx_float8_t} x {mx_float4_t, mx_float6_t, mx_float8_t}**
+| Mma Tile Shape | TN | TT | NT | NN | Dispatch Policy                    |
+|----------------|----|----|----|----|------------------------------------|
+ 128x128x128    | Y  | N  | N  | N  | `KernelTmaWarpSpecializedPingpong` or `KernelTmaWarpSpecializedCooperative` |

 # Copyright

--- a/media/docs/cpp/build/building_in_windows_with_visual_studio.md
+++ b/media/docs/cpp/build/building_in_windows_with_visual_studio.md
@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **CUTLASS 3.0: Building on Windows with Visual Studio**
-
 # Building on Windows with Visual Studio

 CUTLASS 3.2 reintroduces support for the Microsoft Visual Studio compiler on Windows.
--- a/media/docs/cpp/build/building_with_clang_as_host_compiler.md
+++ b/media/docs/cpp/build/building_with_clang_as_host_compiler.md
@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **CUTLASS 3: Building with Clang as host compiler**
-
 # Building with Clang as host compiler

 CUTLASS 3.2(.1) reintroduces support for building with
--- a/media/docs/cpp/code_organization.md
+++ b/media/docs/cpp/code_organization.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")
-
-[README](../../README.md#documentation) > **Code Organization**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")

 # CUTLASS Code Organization

@ -78,13 +76,13 @@ include/                     # Top-level include directory. Client applications
    *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations
 ```

-See [Programming Guidelines](/media/docs/programming_guidelines.md) for further details about
+See [Programming Guidelines](programming_guidelines.md) for further details about
 conventions and design patterns used throughout CUTLASS.

 ## CuTe

 CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly packages the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations. More documentation
-for CuTe can be found in [`/media/docs/cute/`](/media/docs/cute/).
+for CuTe can be found in [`cute/`](cute/index).

 ## Tools

@ -138,7 +136,7 @@ and may be built as follows.
 $ make cutlass_profiler -j
 ```

-[Further details about the CUTLASS Profiler are described here.](/media/docs/profiler.md)
+[Further details about the CUTLASS Profiler are described here.](profiler.md)

 ### CUTLASS Utilities

@ -166,7 +164,7 @@ tools/
          *
 ```

-[More details about CUTLASS Utilities may be found here.](/media/docs/utilities.md)
+[More details about CUTLASS Utilities may be found here.](utilities.md)


 ## Examples
--- a/media/docs/cpp/cute/00_quickstart.md
+++ b/media/docs/cpp/cute/00_quickstart.md
@ -30,22 +30,22 @@ and how to launch kernels.

 CuTe's tests and examples build and run as part of CUTLASS's normal build process.

-CuTe's unit tests live in the [`test/unit/cute`](../../../test/unit/cute) subdirectory.
+CuTe's unit tests live in the [`test/unit/cute`](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute) subdirectory.

-CuTe's examples live in the [`examples/cute`](../../../examples/cute) subdirectory.
+CuTe's examples live in the [`examples/cute`](https://github.com/NVIDIA/cutlass/tree/main/examples/cute) subdirectory.

 ## Library Organization

-CuTe is a header-only C++ library, so there is no source code that needs building. Library headers are contained within the top level [`include/cute`](../../../include/cute) directory, with components of the library grouped by directories that represent their semantics.
+CuTe is a header-only C++ library, so there is no source code that needs building. Library headers are contained within the top level [`include/cute`](https://github.com/NVIDIA/cutlass/tree/main/include/cute) directory, with components of the library grouped by directories that represent their semantics.

 |        Directory       |        Contents        |
 |------------------------|------------------------|
-| [`include/cute`](../../../include/cute) | Each header in the top level corresponds to one of the fundamental building blocks of CuTe, such as [`Layout`](../../../include/cute/layout.hpp) and [`Tensor`](../../../include/cute/tensor.hpp). |
-| [`include/cute/container`](../../../include/cute/container) | Implementations of STL-like objects, such as tuple, array, and aligned array.  |
-| [`include/cute/numeric`](../../../include/cute/numeric) | Fundamental numeric data types that include nonstandard floating-point types, nonstandard integer types, complex numbers, and integer sequence.  |
-| [`include/cute/algorithm`](../../../include/cute/algorithm) | Implementations of utility algorithms such as copy, fill, and clear that automatically leverage architecture-specific features if available. |
-| [`include/cute/arch`](../../../include/cute/arch) | Wrappers for architecture-specific matrix-matrix multiply and copy instructions. |
-| [`include/cute/atom`](../../../include/cute/atom) | Meta-information for instructions in `arch` and utilities like partitioning and tiling.
+| [`include/cute`](https://github.com/NVIDIA/cutlass/tree/main/include/cute) | Each header in the top level corresponds to one of the fundamental building blocks of CuTe, such as [`Layout`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/layout.hpp) and [`Tensor`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/tensor.hpp). |
+| [`include/cute/container`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/container) | Implementations of STL-like objects, such as tuple, array, and aligned array.  |
+| [`include/cute/numeric`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/numeric) | Fundamental numeric data types that include nonstandard floating-point types, nonstandard integer types, complex numbers, and integer sequence.  |
+| [`include/cute/algorithm`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm) | Implementations of utility algorithms such as copy, fill, and clear that automatically leverage architecture-specific features if available. |
+| [`include/cute/arch`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch) | Wrappers for architecture-specific matrix-matrix multiply and copy instructions. |
+| [`include/cute/atom`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom) | Meta-information for instructions in `arch` and utilities like partitioning and tiling.

 ## Tutorial

@ -103,7 +103,7 @@ if (thread0()) {
 Some algorithms depend on some thread or threadblock,
 so you may need to print on threads or threadblocks other than zero.
 The header file
-[`cute/util/debug.hpp`](../../../include/cute/util/debug.hpp),
+[`cute/util/debug.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/util/debug.hpp),
 among other utilities,
 includes the function `bool thread(int tid, int bid)`
 that returns `true` if running on thread `tid` and threadblock `bid`.
--- a/media/docs/cpp/cute/01_layout.md
+++ b/media/docs/cpp/cute/01_layout.md
@ -33,17 +33,17 @@ CuTe provides a number of traits to work with integers.
 * `cute::is_static<T>`: Checks whether `T` is an empty type (so instantiations cannot depend on any dynamic information). Equivalent to `std::is_empty`.
 * `cute::is_constant<N,T>`: Checks that `T` is a static integer AND its value is equivalent to `N`.

-See the [`integral_constant` implementations](../../../include/cute/numeric/integral_constant.hpp) for more information.
+See the [`integral_constant` implementations](https://github.com/NVIDIA/cutlass/tree/main/include/cute/numeric/integral_constant.hpp) for more information.

 ### Tuple

 A tuple is a finite ordered list of zero or more elements.
-The [`cute::tuple` class](../../../include/cute/container/tuple.hpp) behaves like `std::tuple`, but works on device and host. It imposes restrictions on its template arguments and strips down the implementation for performance and simplicity.
+The [`cute::tuple` class](https://github.com/NVIDIA/cutlass/tree/main/include/cute/container/tuple.hpp) behaves like `std::tuple`, but works on device and host. It imposes restrictions on its template arguments and strips down the implementation for performance and simplicity.

 ### IntTuple

 CuTe defines the IntTuple concept as either an integer, or a tuple of IntTuples. Note the recursive definition.
-In C++, we define [operations on `IntTuple`](../../../include/cute/int_tuple.hpp).
+In C++, we define [operations on `IntTuple`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/int_tuple.hpp).

 Examples of `IntTuple`s include:
 * `int{2}`, the dynamic integer 2.
@ -53,7 +53,7 @@ Examples of `IntTuple`s include:

 CuTe reuses the `IntTuple` concept for many different things,
 including Shape, Stride, Step, and Coord
-(see [`include/cute/layout.hpp`](../../../include/cute/layout.hpp)).
+(see [`include/cute/layout.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/layout.hpp)).

 Operations defined on `IntTuple`s include the following.

--- a/media/docs/cpp/cute/02_layout_algebra.md
+++ b/media/docs/cpp/cute/02_layout_algebra.md
@ -17,7 +17,7 @@ In the previous section, we summarized `Layout`s with

 The `coalesce` operation is a "simplify" on functions from integers to integers. If we only care about input integers, then we can manipulate the shape and number of modes of the `Layout` without changing it as a function. The only thing `coalesce` can't change is the `Layout`'s `size`.

-More specifically, you can find the checked post-conditions in [the `coalesce` unit test](../../../test/unit/cute/core/coalesce.cpp), which we'll reproduce here:
+More specifically, you can find the checked post-conditions in [the `coalesce` unit test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/core/coalesce.cpp), which we'll reproduce here:
 ```cpp
 // @post size(@a result) == size(@a layout)
 // @post depth(@a result) <= 1
@ -116,7 +116,7 @@ compatible(B, R)

 That is, every coordinate of `B` can also be used as a coordinate of `R`. This is an expected property of functional composition because `B` defines the *domain* of `R`.

-You can find many examples and checked post-conditions in [the `composition` unit test](../../../test/unit/cute/core/composition.cpp). The post-conditions are precisely as we just stated.
+You can find many examples and checked post-conditions in [the `composition` unit test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/core/composition.cpp). The post-conditions are precisely as we just stated.
 ```cpp
 // @post compatible(@a layout_b, @a result)
 // @post for all i, 0 <= i < size(@a layout_b), @a result(i) == @a layout_a(@a layout_b(i)))
@ -289,7 +289,7 @@ Before getting to "product" and "divide," we need one more operation. We can thi

 The `complement` of a layout attempts to find another layout that represents the "rest" -- the elements that aren't touched by the layout.

-You can find many examples and checked post-conditions in [the `complement` unit test](../../../test/unit/cute/core/complement.cpp). The post-conditions include
+You can find many examples and checked post-conditions in [the `complement` unit test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/core/complement.cpp). The post-conditions include
 ```cpp
 // @post cosize(make_layout(@a layout_a, @a result))) >= size(@a cotarget)
 // @post cosize(@a result) >= round_up(size(@a cotarget), cosize(@a layout_a))
@ -309,7 +309,7 @@ The `cotarget` parameter above is most commonly an integer -- you can see we onl

 ### Complement Examples

-`complement` is most effective on static shapes and strides, so consider all integers below to be static. Similar examples for dynamic shapes and strides as well as IntTuple `cotarget` can be found in [the unit test](../../../test/unit/cute/core/complement.cpp).
+`complement` is most effective on static shapes and strides, so consider all integers below to be static. Similar examples for dynamic shapes and strides as well as IntTuple `cotarget` can be found in [the unit test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/core/complement.cpp).

 * `complement(4:1, 24)` is `6:4`. Note that `(4,6):(1,4)` has cosize `24`. The layout `4:1` is effectively repeated 6 times with `6:4`.

--- a/media/docs/cpp/cute/03_tensor.md
+++ b/media/docs/cpp/cute/03_tensor.md
--- a/media/docs/cpp/cute/04_algorithms.md
+++ b/media/docs/cpp/cute/04_algorithms.md
@ -4,7 +4,7 @@ This section summarizes the interfaces and implementations
 of common numerical algorithms performed on `Tensor`s.

 The implementation of these algorithms may be found in the
-[include/cute/algorithm/](../../../include/cute/algorithm/)
+[include/cute/algorithm/](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/)
 directory.

 ## `copy`
@ -12,7 +12,7 @@ directory.
 CuTe's `copy` algorithm copies the elements of a source `Tensor`
 into the elements of a destination `Tensor`.
 The various overloads of `copy` can be found in
-[`include/cute/algorithm/copy.hpp`](../../../include/cute/algorithm/copy.hpp).
+[`include/cute/algorithm/copy.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/copy.hpp).

 ### Interface and specialization opportunities

@ -82,7 +82,7 @@ such as `cp.async`, or its C++ interface `memcpy_async`.
 In that case, users will need to perform
 the additional synchronization appropriate to that underlying implementation
 before they may use the results of the `copy` algorithm.
-[The CuTe GEMM tutorial example](../../../examples/cute/tutorial/)
+[The CuTe GEMM tutorial example](https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial/)
 shows one such synchronization method.
 More optimized GEMM implementations use pipelining techniques
 to overlap asynchronous `copy` operations with other useful work.
@ -129,7 +129,7 @@ CuTe's optimized copy implementations can do all of these.
 ## `copy_if`

 CuTe's `copy_if` algorithm lives in the same header as `copy`,
-[`include/cute/algorithm/copy.hpp`](../../../include/cute/algorithm/copy.hpp).
+[`include/cute/algorithm/copy.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/copy.hpp).
 The algorithm takes source and destination `Tensor` parameters like `copy`,
 but it also takes a "predication `Tensor`"
 with the same shape as the input and output.
@ -195,7 +195,7 @@ for different architectures, please refer to the
 ## `axpby`

 The `axpby` algorithm lives in the header file
-[`include/cute/algorithm/axpby.hpp`](../../../include/cute/algorithm/axpby.hpp).
+[`include/cute/algorithm/axpby.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/axpby.hpp).
 It assigns to $y$ the result of $\alpha x + \beta y$,
 where $\alpha$ and $\beta$ are scalars and $x$ and $y$ are `Tensor`s.
 The name stands for "Alpha times X Plus Beta times Y,"
@ -205,21 +205,21 @@ and is a generalization of the original BLAS "AXPY" routine
 ## `fill`

 The `fill` algorithm lives in the header file
-[`include/cute/algorithm/fill.hpp`](../../../include/cute/algorithm/fill.hpp).
+[`include/cute/algorithm/fill.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/fill.hpp).
 It overwrites the elements of its `Tensor` output argument
 with a given scalar value.

 ## `clear`

 The `clear` algorithm lives in the header file
-[`include/cute/algorithm/clear.hpp`](../../../include/cute/algorithm/clear.hpp).
+[`include/cute/algorithm/clear.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/clear.hpp).
 It overwrites the elements of its `Tensor` output argument with zeros.

 ## Other algorithms

 CuTe provides other algorithms.
 Their header files can be found in the
-[`include/cute/algorithm`](../../../include/cute/algorithm)
+[`include/cute/algorithm`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm)
 directory.

 ## Copyright
--- a/media/docs/cpp/cute/0t_mma_atom.md
+++ b/media/docs/cpp/cute/0t_mma_atom.md
@ -66,7 +66,7 @@ including
 #### Location of files

 CuTe provides its Operations structs in the
-[`include/cute/arch`](../../../include/cute/arch)
+[`include/cute/arch`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch)
 directory, in header files starting with `mma`.

 #### Operation struct's name
@ -84,7 +84,7 @@ These often include

 For example, the Volta section below will refer to the
 `SM70_8x8x4_F32F16F16F32_NT` Operation struct defined in
-[`include/cute/arch/mma_sm70.hpp`](../../../include/cute/arch/mma_sm70.hpp).
+[`include/cute/arch/mma_sm70.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/mma_sm70.hpp).

 * "SM70" refers to Volta.

@ -111,7 +111,7 @@ An Operation struct has the following members.
 An Operation struct has four public type aliases:
 `DRegisters`, `ARegisters`, `BRegisters`, and `CRegisters`.
 For example, the `SM70_8x8x4_F32F16F16F32_NT` Operation struct defined in
-[`include/cute/arch/mma_sm70.hpp`](../../../include/cute/arch/mma_sm70.hpp)
+[`include/cute/arch/mma_sm70.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/mma_sm70.hpp)
 defines these as follows.

 ```c++
@ -145,7 +145,7 @@ can still compile, even if the PTX instruction is not available.
 #### Location of files

 CuTe provides its Traits structs in the
-[`include/cute/atom`](../../../include/cute/atom)
+[`include/cute/atom`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom)
 directory, in header files starting with `mma_traits`.

 #### Contents
@ -175,7 +175,7 @@ An `MMA_Traits` specialization defines the following public type aliases.

 The specialization of MMA_Traits for the
 `SM70_8x8x4_F32F16F16F32_NT` Operation lives in the header file
-[`include/cute/atom/mma_traits_sm70.hpp`](../../../include/cute/atom/mma_traits_sm70.hpp).
+[`include/cute/atom/mma_traits_sm70.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom/mma_traits_sm70.hpp).
 It looks like this.

 ```c++
@ -254,7 +254,7 @@ Let us look at exactly how the 8 threads within a QP are mapped to the A, B and
  <img src="../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
 </p>

-The metainformation of this single instruction level view is what we want to encode in CuTe. Specifically, the QP level view in this diagram corresponds to the four MMA traits for [SM70_F32F16F16F32](../../../include/cute/arch/mma_sm70.hpp). These structs contain the `Element` types, the `Shape_MNK`, and the `ThrID` mapping we constructed above. Now, let us take a look at the definition of `CLayout`, the thread-data layout of accumulators. The job of `CLayout` is to construct a mapping between the `(logical_thr_id, logical_val_id)` and `(m, n)` coordinate in the C matrix which can then be used to build up more complicated layouts and operations like the 16x16x4 WMMA.
+The metainformation of this single instruction level view is what we want to encode in CuTe. Specifically, the QP level view in this diagram corresponds to the four MMA traits for [SM70_F32F16F16F32](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/mma_sm70.hpp). These structs contain the `Element` types, the `Shape_MNK`, and the `ThrID` mapping we constructed above. Now, let us take a look at the definition of `CLayout`, the thread-data layout of accumulators. The job of `CLayout` is to construct a mapping between the `(logical_thr_id, logical_val_id)` and `(m, n)` coordinate in the C matrix which can then be used to build up more complicated layouts and operations like the 16x16x4 WMMA.

 We can start constructing a `CLayout` from the picture above. As with any CuTe layout, it is a pair of `Shape` and corresponding `Stride`. Let us just look at the shape for now. We know that the HMMA uses 8 threads each of which own 8 values. Therefore, the shape of our mapping must have a size of 8 along two modes. With this, we have

--- a/media/docs/cpp/cute/0x_gemm_tutorial.md
+++ b/media/docs/cpp/cute/0x_gemm_tutorial.md
@ -1,7 +1,7 @@
 # CuTe dense matrix-matrix multiply tutorial

 In this section, we review
-[these examples](../../../examples/cute/tutorial/),
+[these examples](https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial/),
 which demonstrate a few self-contained, single-file dense matrix-matrix multiply implementations using only CuTe.

 ## `sgemm_1.cu`
@ -535,7 +535,7 @@ gett(int m0, int m1, int n, int k,
 ```
 Note that the only changes are the definition of shape `M`, the definition of strides `dA` and `dC`, and the definition of the CTA Tiler `bM`. The above uses a multimodel problem shape `M = (m0,m1)` and a multimodal CTA Tiler `bM = <_64,_2>` to change which portion of the global memory tensors `A` and `C` each CTA will be responsible for computing.

-Similar examples can be found for CUTLASS 3.x kernels that are based on CuTe, such as [this Hopper GETT example](../../../examples/51_hopper_gett).
+Similar examples can be found for CUTLASS 3.x kernels that are based on CuTe, such as [this Hopper GETT example](https://github.com/NVIDIA/cutlass/tree/main/examples/51_hopper_gett).

 ## Copyright

--- a/media/docs/cpp/cute/0y_predication.md
+++ b/media/docs/cpp/cute/0y_predication.md
--- a/media/docs/cpp/cute/0z_tma_tensors.md
+++ b/media/docs/cpp/cute/0z_tma_tensors.md
--- a/media/docs/cpp/cute/index.rst
+++ b/media/docs/cpp/cute/index.rst
@ -0,0 +1,17 @@
+.. _cpp_cute:
+
+CuTe
+====================
+
+.. toctree::
+  :maxdepth: 2
+
+  00_quickstart<00_quickstart.md>
+  01_layout<01_layout.md>
+  02_layout_algebra<02_layout_algebra.md>
+  03_tensor<03_tensor.md>
+  04_algorithms<04_algorithms.md>
+  0t_mma_atom<0t_mma_atom.md>
+  0x_gemm_tutorial<0x_gemm_tutorial.md>
+  0y_predication<0y_predication.md>
+  0z_tma_tensors<0z_tma_tensors.md>
--- a/media/docs/cpp/cutlass_3x_backwards_compatibility.md
+++ b/media/docs/cpp/cutlass_3x_backwards_compatibility.md
@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **CUTLASS 3.0 GEMM Backwards Compatibility**
-
 # CUTLASS 3.0 GEMM Backwards Compatibility

 Although CUTLASS 3.0 restructures the GEMM hierarchy and introduces new types for the
@ -16,7 +14,7 @@ The entry point for CUTLASS's Device GEMM API
 is the class
 `cutlass::gemm::device::GemmUniversalAdapter`.
 This class lives in the header file
-[include/cutlass/gemm/device/gemm_universal_adapter.h](/include/cutlass/gemm/device/gemm_universal_adapter.h).
+[include/cutlass/gemm/device/gemm_universal_adapter.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_universal_adapter.h).

 `GemmUniversalAdapter` is a "universal adapter"
 and serves as a common device interface
@ -89,7 +87,7 @@ and a collective epilogue.
 The entry point for CUTLASS's kernel API is the class
 `cutlass::gemm::kernel::GemmUniversal`.
 This class' declaration lives in the header file
-[include/cutlass/gemm/kernel/gemm_universal.hpp](/include/cutlass/gemm/kernel/gemm_universal.hpp).
+[include/cutlass/gemm/kernel/gemm_universal.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemm_universal.hpp).

 ```c++
 /*
@ -128,11 +126,11 @@ Each kernel layer schedule is specialized
 for a GEMM scheduling algorithm and GPU architecture.
 Specializations of `kernel::GemmUniversal` for 3.0 APIs live in
 any of various `gemm_*.hpp` files in the directory
-[include/cutlass/gemm/kernel/](../../include/cutlass/gemm/kernel/).
+[include/cutlass/gemm/kernel/](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/).
 The specialization to which to dispatch is decided through the dispatch policy's `Schedule` type.

 Specializations for 2.x APIs live in the header file
-[include/cutlass/gemm/kernel/gemm_universal.h](../../include/cutlass/gemm/kernel/gemm_universal.h).
+[include/cutlass/gemm/kernel/gemm_universal.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemm_universal.h).

 ### Kernel API design differences

@ -204,7 +202,7 @@ if they wish to author custom mainloop code in the 3.x API.

 Similarly, for the GEMM inner loops, `cute::MMA_Atom`s replace the
 `gemm::warp` and `gemm::thread` layer code. Going forward, all new PTX instructions
-and associated metadata development will occur directly inside [`cute/arch/*.hpp`](/include/cute/arch/) and [`cute/atom/*.hpp`](/include/cute/atom/).
+and associated metadata development will occur directly inside [`cute/arch/*.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/) and [`cute/atom/*.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom/).

 The desired inner loop MMA iteration order and tiling can be achieved through careful
 selection of the atom layout, value layout, and permutations of the `cute::TiledMma`.
@ -212,7 +210,7 @@ selection of the atom layout, value layout, and permutations of the `cute::Tiled
 For epilogues, the `cutlass::epilogue::collective` layer replaces `cutlass::threadblock::collective`.  However, the thread-level epilogue elementwise operations
 in `cutlass::epilogue::thread` will continue to be used in 3.x kernels as well, albeit, with
 a more idiomatic epilogue vectorization strategy.
-[Example 50](/examples/50_hopper_gemm_with_epilogue_swizzle/50_hopper_gemm_with_epilogue_swizzle.cu)
+[Example 50](https://github.com/NVIDIA/cutlass/tree/main/examples/50_hopper_gemm_with_epilogue_swizzle/50_hopper_gemm_with_epilogue_swizzle.cu)
 shows how to use 2.x epilogue thread operators with 3.0 API kernels.

 ## Porting from 2.x to 3.0 API
@ -271,7 +269,7 @@ For the matrix B,
 CUTLASS 2.x defines "layout tag" classes
 `cutlass::layout::ColumnMajor` and `cutlass::layout::RowMajor`,
 that live in the header file
-[`cutlass/layout/matrix.h`](/include/cutlass/layout/matrix.h).
+[`cutlass/layout/matrix.h`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/layout/matrix.h).
 The interpretation of these layouts in GEMM
 depends on whether they are applied
 to the input matrix A or B. For the matrix A, "column major" means
@ -304,7 +302,7 @@ whether we are talking about the A or B matrix. M and N major inputs always have
 static size-1 stride in their 0th (outer) mode. Similarly, K major inputs
 always contain the static size-1 stride in their 1st mode. This uniformity in stride order
 allows us to represent tensor layouts much more cleanly and treat both A and B equally in our interfaces.
-See for example the following snippet from our [`kernel/sm70_gemm.hpp`](/include/cutlass/gemm/kernel/sm70_gemm.hpp)
+See for example the following snippet from our [`kernel/sm70_gemm.hpp`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm70_gemm.hpp)
 for Ampere kernel schedules.

 ```c++
@ -352,7 +350,7 @@ dynamic stride modes corresponding to the minor mode and the batch mode. Batch
 mode is included by default as all CUTLASS 3.0 kernels support packed batch-mode GEMMs
 out of the box.

-The [`cutlass/gemm/gemm.h#440`](../../include/cutlass/gemm/gemm.h#440)
+The [`cutlass/gemm/gemm.h#440`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/gemm.h#440)
 header file includes functions
 that can be useful for converting
 from CUTLASS 3.0 `cute::Stride`s back to CUTLASS 2.x layout tags.
@ -375,7 +373,7 @@ these 2.x reflective types from an assembled kernel with a more stable API,
 the specialization of `cutlass::gemm::device::GemmUniversalAdapter`
 for CUTLASS 3.0 kernel provides all aliases for all 2.x type aliases
 in addition to the layout tags. You can see how they are used in the header file
-[`cutlass/gemm/device/gemm_universal_adapter.h`](/include/cutlass/gemm/device/gemm_universal_adapter.h).
+[`cutlass/gemm/device/gemm_universal_adapter.h`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_universal_adapter.h).
 Here is an excerpt.

 ```c++
--- a/media/docs/cpp/cutlass_3x_design.md
+++ b/media/docs/cpp/cutlass_3x_design.md
@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **CUTLASS 3.0 Design and Hierarchy**
-
 # CUTLASS 3.0 Design

 CUTLASS 3.0 is a major enhancement over the abstractions of CUTLASS 2.x
@ -29,7 +27,7 @@ CUTLASS 3.0 has the following design goals, in no particular order.
 CUTLASS 2.x decomposes the moving parts of a GEMM operation
 across a hierarchy that closely mirrors the organization of GPU
 architectures. This discussed in detail within the
-[CUTLASS 2.x GEMM API documentation](/media/docs/gemm_api.md).
+[CUTLASS 2.x GEMM API documentation](gemm_api.md).
 This design, however, sometimes results in a coupling that is too tight
 to extend to newer GPU features that might not fit into the same architectural
 hierarchy. For instance, Hopper's warp-group wide instructions do not naturally
@ -46,7 +44,7 @@ with a consistent interface to hardware acceleration regardless of
 the architecture specific details.

 The new conceptual GEMM hierarchy is discussed in detail in the dedicated
-[CUTLASS 3.0 GEMM API documentation readme](/media/docs/gemm_api_3x.md),
+[CUTLASS 3.0 GEMM API documentation readme](gemm_api_3x.md),
 along with code examples of the core concepts and types. 

 ## Adoption of CuTe Layout and Tensors
@ -55,9 +53,9 @@ CUTLASS 3.0 introduces a new core library, CuTe, to describe and manipulate tens
 CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly packages the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. 

 CUTLASS 3.0 adopts CuTe throughout the GEMM hierarchy in its templates, greatly simplifying the design,
-improving code composability, and readability. More documentation specific to CuTe can be found in its [dedicated documentation directory](/media/docs/cute/00_quickstart.md).
+improving code composability, and readability. More documentation specific to CuTe can be found in its [dedicated documentation directory](cute/00_quickstart.md).

-![CuTe helps reduce named iterator types down to a single vocabulary type, `Layout`](/media/images/cutlass-reduction-in-named-iterators.png)
+![CuTe helps reduce named iterator types down to a single vocabulary type, `Layout`](../../images/cutlass-reduction-in-named-iterators.png)

 Programming massively parallel systems with various layers of logical thread and data hierarchies is not a trivial task. 

--- a/media/docs/cpp/dependent_kernel_launch.md
+++ b/media/docs/cpp/dependent_kernel_launch.md
@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **Dependent kernel launch**
-
 # Dependent kernel launches

 The Hopper and Blackwell architectures supports a new feature through which two kernels in the same stream can
@ -37,11 +35,11 @@ gemm.run(
 ```
 ## Model-Aware Optimizations with PDL

-In [example 63](../../examples/63_hopper_gemm_with_weight_prefetch/README.md), we use PDL to explicitly optimize for 
+In [example 63](https://github.com/NVIDIA/cutlass/tree/main/examples/63_hopper_gemm_with_weight_prefetch/README.md), we use PDL to explicitly optimize for 
 performance of kernels where we know that one of the input matricies (our weights) will not be produced by a prior 
 kernel. In that case, we only need to wait on the prior kernels memory flush in order to load the other input matrix 
 (our activations). During our prologue, we can prefetch our weights to improve performance for memory bandwidth-bound
-problem sizes. For more informations we refer the reader to [the example](../../examples/63_hopper_gemm_with_weight_prefetch/README.md).
+problem sizes. For more informations we refer the reader to [the example](https://github.com/NVIDIA/cutlass/tree/main/examples/63_hopper_gemm_with_weight_prefetch/README.md).

 ## Copyright

--- a/media/docs/cpp/doxygen_mainpage.md
+++ b/media/docs/cpp/doxygen_mainpage.md
@ -33,7 +33,7 @@ to CUTLASS 3.0, please refer to the
 For a code example showing how to define
 a GEMM kernel using CUTLASS, please refer to
 [the quickstart guide](./quickstart.md).
-The [`examples` directory](../../examples)
+The [`examples` directory](https://github.com/NVIDIA/cutlass/tree/main/examples)
 has a variety of examples.

 # Copyright
--- a/media/docs/cpp/efficient_gemm.md
+++ b/media/docs/cpp/efficient_gemm.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "Efficient GEMM in CUDA")
-
-[README](../../README.md#documentation) > **Efficient GEMM in CUDA**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "Efficient GEMM in CUDA")

 # Efficient GEMM in CUDA

@ -60,7 +58,7 @@ This is the hierarchical GEMM computation embodied by CUTLASS. Each stage depict
 nested level of tiling which corresponds to a layer of concurrency within the CUDA execution model and to a
 level within the memory hierarchy, becoming increasingly finer moving left to right.

-![ALT](../images/gemm-hierarchy-with-epilogue.png "Hierarchical GEMM in CUDA")
+![ALT](../../images/gemm-hierarchy-with-epilogue.png "Hierarchical GEMM in CUDA")


 ### Threadblock-level GEMM
@ -154,7 +152,7 @@ following scopes.

 The following diagram illustrates the efficient, pipelined mainloop body used in CUTLASS GEMMs.

-![ALT](../images/software-pipeline.png "Software pipeline in CUTLASS")
+![ALT](../../images/software-pipeline.png "Software pipeline in CUTLASS")

 ### Threadblock Rasterization

@ -164,7 +162,7 @@ consecutively launched threadblocks to packed two-dimensional regions of the par
 problem to increase the probability that these will access the same tiles of global memory at
 approximately the same time.

-Several functions are defined in [cutlass/gemm/threadblock_swizzle.h](../../include/cutlass/gemm/threadblock/threadblock_swizzle.h).
+Several functions are defined in [cutlass/gemm/threadblock_swizzle.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/threadblock/threadblock_swizzle.h).


 ### Parallelized Reductions
@ -226,26 +224,26 @@ to the Hopper kernel design. Blackwell SM100 kernels have a substantially differ
 however, the concept of separating out producer and consumer agents still applies.

 Starting with Hopper, CUTLASS 3.0 incorporates the concept of [Warp Specialization](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#spatial-partitioning-also-known-as-warp-specialization)
-as part of the kernel design. A thread block is partitioned into two sets of warps, [*producer* warp group](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp) and [*consumer* warp group](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp). The *producer* warp group loads data from global memory into shared memory buffers using the new [Tensor Memory Accelerator (TMA)](https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/). 
+as part of the kernel design. A thread block is partitioned into two sets of warps, [*producer* warp group](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp) and [*consumer* warp group](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp). The *producer* warp group loads data from global memory into shared memory buffers using the new [Tensor Memory Accelerator (TMA)](https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/). 

-[*Producer* warp group (DMA)](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) waits for the shared memory buffers to be signaled as [empty](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) by the *consumer* warp group using the newly added **Async Pipeline class** ([refer](pipeline.md)). Once the data is written into the shared memory, TMA is also updates the barrier associated with that stage to notify affected threads that the buffer has been [filled](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp). The [*Consumer* warp group (MMA)](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) on the other hand waits for the *producer* warp group to signal that the buffer is [filled](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) and then launches tensor core MMA operations. Finally, the *consumer* warp group [releases](../../include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) the buffers for the next set of TMA loads to happens.
+[*Producer* warp group (DMA)](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) waits for the shared memory buffers to be signaled as [empty](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) by the *consumer* warp group using the newly added **Async Pipeline class** ([refer](pipeline.md)). Once the data is written into the shared memory, TMA is also updates the barrier associated with that stage to notify affected threads that the buffer has been [filled](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp). The [*Consumer* warp group (MMA)](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) on the other hand waits for the *producer* warp group to signal that the buffer is [filled](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) and then launches tensor core MMA operations. Finally, the *consumer* warp group [releases](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) the buffers for the next set of TMA loads to happens.

 **Warp-Specialized Persistent Cooperative kernel design**

-Another flavor of Warp-Specialized kernel design being introduced starting with Hopper is the [*Warp-Specialized Persistent Cooperative*](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp) kernel. Like the Warp-Specialized kernel, the concepts of warp groups and barrier synchronization between warp groups remain the same in the cooperative design. 
+Another flavor of Warp-Specialized kernel design being introduced starting with Hopper is the [*Warp-Specialized Persistent Cooperative*](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp) kernel. Like the Warp-Specialized kernel, the concepts of warp groups and barrier synchronization between warp groups remain the same in the cooperative design. 
 The distinctive feature of the Warp-Specialized Persistent Cooperative kernel are the following :
-* Persistent thread blocks launched to occupy as many SMs as mentioned in the [KernelHardwareInfo](../../include/cutlass/kernel_hardware_info.hpp) struct. These persistent thread blocks are used to tile the output and thus (potentially) compute multiple output tiles through their lifetime. The main benefit this adds is amortization of the thread-block launch and kernel prologue overheads which are typical of all kernels.
+* Persistent thread blocks launched to occupy as many SMs as mentioned in the [KernelHardwareInfo](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/kernel_hardware_info.hpp) struct. These persistent thread blocks are used to tile the output and thus (potentially) compute multiple output tiles through their lifetime. The main benefit this adds is amortization of the thread-block launch and kernel prologue overheads which are typical of all kernels.
 * Presence of two *consumer* warp groups cooperating on the same output tile by splitting the tile in half across the M dimension. This allows for larger tile sizes to be enabled - since the register pressure per *consumer* warp group is reduced - and hence improving performance.

-Since each thread block now computes multiple output tiles, the shape of the grid launch and the scheduling of tiles to the thread blocks is managed using the new [*Tile Scheduler*](../../include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp). The *Tile Scheduler* considers the shape of the *clusters* as well as the available number of available SMs to compute a valid scheduling of the output tiles to launched thread blocks.
+Since each thread block now computes multiple output tiles, the shape of the grid launch and the scheduling of tiles to the thread blocks is managed using the new [*Tile Scheduler*](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp). The *Tile Scheduler* considers the shape of the *clusters* as well as the available number of available SMs to compute a valid scheduling of the output tiles to launched thread blocks.

 **Warp-Specialized Persistent Ping-Pong kernel design**

-The third kernel design is the [*Warp-Specialized Persistent Ping-Pong*](../../include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp) kernel. 
+The third kernel design is the [*Warp-Specialized Persistent Ping-Pong*](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp) kernel. 
 Like the Warp-Specialized Persistent Cooperative, kernel the concepts of warp groups, barrier synchronization between warp groups, and the shape of the grid launch remain the same in the persistent ping-pong design. 
 The distinctive feature of the Warp-Specialized Persistent Ping-Pong kernel is the following :
 * The two *consumer* warp groups are assigned a different output tile using the Tile Scheduler. This allows for *epilogue* of one *consumer* warp group to be overlapped with the math operations of the other *consumer* warp group - thus maximizing tensor core utilization. 
-* The *producer* warp group synchronizes using the [Ordered Sequence Barrier](../../include/cutlass/pipeline/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order.
+* The *producer* warp group synchronizes using the [Ordered Sequence Barrier](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/pipeline/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order.

 # Resources

--- a/media/docs/cpp/functionality.md
+++ b/media/docs/cpp/functionality.md
@ -1,17 +1,15 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Functionality")
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Functionality")

-[README](../../README.md#documentation) > **Functionality**

 # Functionality

 Note : CUTLASS-3 requires users to use CUDA 11.4 or newer, and SM70 or newer, for the target toolkit and architecture, respectively.
-Please refer to the [Compatibility](/README.md#Compatibility) section for more details.

 - N - Column Major Matrix
 - T - Row Major matrix
 - {N,T} x {N,T} - All combinations, i.e., NN, NT, TN, TT
- [NHWC](/include/cutlass/layout/tensor.h#L63-206) - 4 dimension tensor used for convolution
- [NCxHWx](/include/cutlass/layout/tensor.h#L290-395) - Interleaved 4 dimension tensor used for convolution
+- [NHWC](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/layout/tensor.h#L63-206) - 4 dimension tensor used for convolution
+- [NCxHWx](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/layout/tensor.h#L290-395) - Interleaved 4 dimension tensor used for convolution
 - f - floating point
 - s - signed int
 - b - bit
@ -32,48 +30,48 @@ Hyperlinks to relevant unit tests demonstrate how specific template instances ma

 |**Opcode Class** | **Compute Capability** | **CUDA Toolkit** | **Data Type**                  | **Layouts**            | **Unit Test**    |
 |-----------------|------------------------|------------------|--------------------------------|------------------------|------------------|
-| **TensorOp**        | 90a                 |  12.0+           | `f16 * f16 + { f16, f32 } => { f16, f32 }`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized.cu) |
-| **TensorOp**        | 90a                 |  12.0+           | `bf16 * bf16 + { f16, f32 } => { bf16, f32 }`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_bf16_bf16_bf16_tensor_op_f32.cu) |
-| **TensorOp**        | 90a                 |  12.0+           | `{f32, tf32} * {f32, tf32} + f32 => f32`| { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_f32_f32_f32_tensor_op_f32.cu) |
-| **TensorOp**        | 90a                 |  12.0+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/sm90_gemm_s8_s8_s8_tensor_op_s32.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `f16 * f16 + { f16, f32 } => { f16, f32 }`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `bf16 * bf16 + { f16, f32 } => { bf16, f32 }`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_bf16_bf16_bf16_tensor_op_f32.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `{f32, tf32} * {f32, tf32} + f32 => f32`| { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_f32_f32_f32_tensor_op_f32.cu) |
+| **TensorOp**        | 90a                 |  12.0+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_s8_s8_s8_tensor_op_s32.cu) |
 

 ### CUTLASS 2.x Kernels

 |**Opcode Class** | **Compute Capability** | **CUDA Toolkit** | **Data Type**                  | **Layouts**            | **Unit Test**    |
 |-----------------|------------------------|------------------|--------------------------------|------------------------|------------------|
-| **Simt**        | 50+                    |  11.4+            | `f32 * f32 + f32 => f32`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_sgemm_nt_sm50.cu)                |
-| **Simt**        | 50+                    |  11.4+            | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_dgemm_nt_sm50.cu)                |
-| **Simt**        | 60+                    |  11.4+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_hgemm_nt_sm50.cu)                |
-| **Simt**        | 61+                    |  11.4+            | `s8 * s8 + s32 => {s32,s8}`    | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/simt_igemm_nt_sm50.cu)              |
-| **WmmaTensorOp**    | 70+                |  11.4+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f16_sm70.cu)     |
-| **WmmaTensorOp**    | 70+                |  11.4+            | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f32_sm70.cu)                |
-| **WmmaTensorOp**    | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s8t_wmma_tensor_op_s32_sm72.cu) |
-| **WmmaTensorOp**    | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s4t_wmma_tensor_op_s32_sm75.cu)                |
-| **WmmaTensorOp**    | 75+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_b1t_wmma_tensor_op_s32_sm75.cu)                |
-| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f16_sm70.cu)                |
-| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f32_sm70.cu)                |
-| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `bf16 * bf16 + f32 => {bf16, f32}`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_bf16n_bf16t_bf16t_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`| {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `cf32 * cf32 + cf32 => cf32`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_cf32n_cf32t_cf32t_tensor_op_tf32_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `cf64 * cf64 + cf64 => cf64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_sm80.cu), [Gaussian 3m](/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_gaussian_sm80.cu) |
-| **SpTensorOp**      | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`    | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80+                |  11.4+           | `bf16 * bf16 + f32 => {bf16, f32}` | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`         | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f32n_f32n_f32t_tensor_op_f32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80+                |  11.4+           | `s8 * s8 + s32 => {s8, s32}`       | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu) |
-| **SpTensorOp**      | 80+                |  11.4+           | `s4 * s4 + s32 => {s4, s32}`       | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu) |
-| **TensorOp**        | 90+                |  11.8+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm90.cu) |
+| **Simt**        | 50+                    |  11.4+            | `f32 * f32 + f32 => f32`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/simt_sgemm_nt_sm50.cu)                |
+| **Simt**        | 50+                    |  11.4+            | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/simt_dgemm_nt_sm50.cu)                |
+| **Simt**        | 60+                    |  11.4+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/simt_hgemm_nt_sm50.cu)                |
+| **Simt**        | 61+                    |  11.4+            | `s8 * s8 + s32 => {s32,s8}`    | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/simt_igemm_nt_sm50.cu)              |
+| **WmmaTensorOp**    | 70+                |  11.4+            | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f16_sm70.cu)     |
+| **WmmaTensorOp**    | 70+                |  11.4+            | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16t_f16t_f16n_wmma_tensor_op_f32_sm70.cu)                |
+| **WmmaTensorOp**    | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s8t_wmma_tensor_op_s32_sm72.cu) |
+| **WmmaTensorOp**    | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s4t_wmma_tensor_op_s32_sm75.cu)                |
+| **WmmaTensorOp**    | 75+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_b1t_b1n_b1t_wmma_tensor_op_s32_sm75.cu)                |
+| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f16_sm70.cu)                |
+| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_volta_tensor_op_f32_sm70.cu)                |
+| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f16 => f16`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `bf16 * bf16 + f32 => {bf16, f32}`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_bf16n_bf16t_bf16t_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`| {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `cf32 * cf32 + cf32 => cf32`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_cf32n_cf32t_cf32t_tensor_op_tf32_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `cf64 * cf64 + cf64 => cf64`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_sm80.cu), [Gaussian 3m](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_gaussian_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`    | {N,T} x {N,T} => {N,T} | [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `bf16 * bf16 + f32 => {bf16, f32}` | {N,T} x {N,T} => {N,T} | [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`         | {N,T} x {N,T} => {N,T} | [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f32n_f32n_f32t_tensor_op_f32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `s8 * s8 + s32 => {s8, s32}`       | {N,T} x {N,T} => {N,T} | [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu) |
+| **SpTensorOp**      | 80+                |  11.4+           | `s4 * s4 + s32 => {s4, s32}`       | {N,T} x {N,T} => {N,T} | [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu) |
+| **TensorOp**        | 90+                |  11.8+           | `f64 * f64 + f64 => f64`       | {N,T} x {N,T} => {N,T} |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm90.cu) |


 ## Device-level Implicit GEMM convolution
@ -84,19 +82,19 @@ One can find and/or create equivalent dgrad and wgrad convolutional operators.

 |**Opcode Class** | **Compute Capability** | **CUDA Toolkit** | **Data Type**                  | **Layouts**      | **Unit Test**    |
 |-----------------|------------------------|------------------|--------------------------------|------------------|------------------|
-| **Simt**            | 50+                |  11.4+            | `f32 * f32 + f32 => f32`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm50.cu)                |
-| **Simt**            | 50+                |  11.4+            | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm50.cu)                |
-| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm70.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm75.cu) |
-| **TensorOp**        | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm75.cu) |
-| **Simt**            | 80+                |  11.4+           | `f32 * f32 + f32 => f32`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu)                |
-| **Simt**            | 80+                |  11.4+           | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu)                |
-| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f16 => f16`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`     | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm80.cu) |
+| **Simt**            | 50+                |  11.4+            | `f32 * f32 + f32 => f32`       | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm50.cu)                |
+| **Simt**            | 50+                |  11.4+            | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm50.cu)                |
+| **TensorOp**        | 70+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm70.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 75+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm75.cu) |
+| **Simt**            | 80+                |  11.4+           | `f32 * f32 + f32 => f32`       | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu)                |
+| **Simt**            | 80+                |  11.4+           | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu)                |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `f16 * f16 + f16 => f16`       | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `tf32 * tf32 + f32 => f32`     | NHWC             |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80+                |  11.4+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm80.cu) |



--- a/media/docs/cpp/fundamental_types.md
+++ b/media/docs/cpp/fundamental_types.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS")
-
-[README](../../README.md#documentation) > **Fundamental Types**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS")

 # Fundamental Types

--- a/media/docs/cpp/gemm_api.md
+++ b/media/docs/cpp/gemm_api.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS GEMM API")
-
-[README](../../README.md#documentation) > **CUTLASS GEMM API**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS GEMM API")

 # CUTLASS GEMM API

@ -69,7 +67,7 @@ thread-level concurrency.
 This loop nest is expressed in CUTLASS via the following components which are specialized for data type, layout, and
 math instruction.

-![ALT](/media/images/cutlass-gemm-components.png "CUTLASS GEMM Components")
+![ALT](../../images/cutlass-gemm-components.png "CUTLASS GEMM Components")

 These components are described in the following sections.

@ -80,10 +78,10 @@ GEMM computation across the GPU. This operator is intended to be used in host-si
 has semantics similar to cuBLAS.

 The device-wide GEMM API is embodied by the following operators:
- [cutlass::gemm::device::Gemm](/include/cutlass/gemm/device/gemm.h) - basic GEMM operation
- [cutlass::gemm::device::GemmArray](/include/cutlass/gemm/device/gemm_array.h) - batched GEMM operation in which input matrices are read from arrays of pointers
- [cutlass::gemm::device::GemmBatched](/include/cutlass/gemm/device/gemm_batched.h) - batched GEMM operation in which input matrices are separated by a constant stride
- [cutlass::gemm::device::GemmSplitKParallel](/include/cutlass/gemm/device/gemm_splitk_parallel.h) - GEMM operation that partitions the GEMM K dimension then launches a separate reduction kernel
+- [cutlass::gemm::device::Gemm](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm.h) - basic GEMM operation
+- [cutlass::gemm::device::GemmArray](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_array.h) - batched GEMM operation in which input matrices are read from arrays of pointers
+- [cutlass::gemm::device::GemmBatched](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_batched.h) - batched GEMM operation in which input matrices are separated by a constant stride
+- [cutlass::gemm::device::GemmSplitKParallel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_splitk_parallel.h) - GEMM operation that partitions the GEMM K dimension then launches a separate reduction kernel

 **Example:** launch a mixed-precision GEMM targeting Volta Tensor Cores.
 ```c++
@ -127,14 +125,14 @@ GEMMs at this scope are expected to efficiently load tiles of data from global m
 products with warp-level GEMM operators.

 The threadblock-scoped matrix multiply operation is embodied by 
-[cutlass::gemm::threadblock::MmaPipelined](/include/cutlass/gemm/threadblock/mma_pipelined.h).
+[cutlass::gemm::threadblock::MmaPipelined](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/threadblock/mma_pipelined.h).
 This is a class inspired by [std::transform_reduce()](https://en.cppreference.com/w/cpp/algorithm/transform_reduce) 
 which computes the accumulated matrix product of a range of tiles defined by tile iterators.

-![ALT](/media/images/cutlass-threadblock-mma-pipelined.png "cutlass::gemm::threadblock::MmaPipelined")
+![ALT](../../images/cutlass-threadblock-mma-pipelined.png "cutlass::gemm::threadblock::MmaPipelined")

 In the case of GEMM, the tile iterators are 
-[cutlass::transform::threadblock::PredicatedTileIterator](/include/cutlass/transform/threadblock/predicated_tile_iterator.h)
+[cutlass::transform::threadblock::PredicatedTileIterator](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/transform/threadblock/predicated_tile_iterator.h)
 to traverse a sequence of tiles in global memory with appropriate predication to avoid out-of-bounds
 memory accesses.

@ -213,14 +211,14 @@ The warp-level GEMM API is a generalization of CUDA's WMMA API to achieve the fo

 Defining a warp-level matrix multiply in CUTLASS is similar to WMMA as shown below.

-![ALT](/media/images/cutlass-warp-level-gemm-api-instantiation.png "CUTLASS vs WMMA API")
+![ALT](../../images/cutlass-warp-level-gemm-api-instantiation.png "CUTLASS vs WMMA API")

 The usage model is also similar. The following example computes a warp-level GEMM operation,
 accumulating a series of matrix products in a register-backed array. The input to a warp-level
 GEMM operation in CUTLASS _must_ be data in shared memory loaded by iterators or on 
 register-backed fragments.

-![ALT](/media/images/cutlass-warp-level-gemm-operation.png "CUTLASS warp-level GEMM API")
+![ALT](../../images/cutlass-warp-level-gemm-operation.png "CUTLASS warp-level GEMM API")

 ```c++
 #include "cutlass/gemm/warp/default_mma_tensor_op.h"
@ -513,8 +511,8 @@ column-major GEMM, operands A & B are transposed and swapped.
 To enable efficient row-major epilogue for both row-major and column-major output layout, 
 CUTLASS' device-level GEMM operators `cutlass::device::Gemm` and `cutlass::device::GemmUniversal` 
 provide two template definitions:
- (a) [General definition](/include/cutlass/gemm/device/gemm.h#L217)
- (b) [Specialized definition for column-major source/output](/include/cutlass/gemm/device/gemm.h#L545)
+- (a) [General definition](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm.h#L217)
+- (b) [Specialized definition for column-major source/output](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm.h#L545)

 Efficient row-major epilogue for:
 - (i)  GEMM operator on row-major source/output uses template (a). It runs row-major GEMM and 
@ -536,8 +534,8 @@ of input layouts. Thus, CUTLASS supports the following layout combinations for i
 CUTLASS defines a template-based interface to Tensor Core operations to avoid resorting
 to inline PTX.

- [mma_sm70.h](/include/cutlass/arch/mma_sm70.h) - Volta TensorCore operations
- [mma_sm75.h](/include/cutlass/arch/mma_sm75.h) - Turing TensorCore operations
+- [mma_sm70.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/mma_sm70.h) - Volta TensorCore operations
+- [mma_sm75.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/mma_sm75.h) - Turing TensorCore operations


 # Copyright
--- a/media/docs/cpp/gemm_api_3x.md
+++ b/media/docs/cpp/gemm_api_3x.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS GEMM API")
-
-[README](../../README.md#documentation) > **CUTLASS 3.0 GEMM API**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS GEMM API")

 # CUTLASS 3.0 GEMM API

@ -71,7 +69,7 @@ is implied by CUDA grid launch semantics.
 However, for persistent kernels,
 these three loops are expressed in the source code 
 as a single `while` loop that queries the
-[work tile scheduler](/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp)
+[work tile scheduler](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp)
 for problem tiles on which to compute.

 Inside the three nested `for` loops,
@ -112,7 +110,7 @@ in order to assemble a kernel.  This order is

 3. wrap up the kernel with a device layer adapter.

-This order is also reflected in the [CUTLASS 3.0 Hopper kernel examples](/examples/48_hopper_warp_specialized_gemm) as seen in the excerpt below.
+This order is also reflected in the [CUTLASS 3.0 Hopper kernel examples](https://github.com/NVIDIA/cutlass/tree/main/examples/48_hopper_warp_specialized_gemm) as seen in the excerpt below.

 ```c++
 // Step 1: Generate the required collective layer mainloop specialization
@ -208,7 +206,7 @@ Any looping over multiple tiles that
 the algorithm might need to do would happen here.

 The `CollectiveMma` class is declared in the header
-[cutlass/gemm/collective/collective_mma.hpp](/include/cutlass/gemm/collective/collective_mma.hpp).
+[cutlass/gemm/collective/collective_mma.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/collective_mma.hpp).

 ```c++
 namespace cutlass::gemm::collective {
@ -328,7 +326,7 @@ all operations that conceptually belong to the same class. This design has the f
 The primary `CollectiveMma` is intended to be an expert user interface that allows full control over
 all the properties of the collective's GPU micro-kernel. However, often a user just wants an
 off-the-shelf GEMM mainloop implementation parameterized on simple configuration parameters. CUTLASS 3.0
-provides [`cutlass::gemm::collective::CollectiveBuilder`](/include/cutlass/gemm/collective/collective_builder.hpp) for such scenarios.
+provides [`cutlass::gemm::collective::CollectiveBuilder`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/collective_builder.hpp) for such scenarios.

 ```c++
 namespace cutlass::gemm::collective {
@ -382,7 +380,7 @@ may also change in the future as we adopt user feedback.

 If the builder is able to provide a collective mainloop type for the given set of parameters,
 it will be aliased within as `CollectiveOp`. For more information on how to
-parameterize kernels conveniently with the collective builder, please see example [49_hopper_gemm_with_collective_builder](/examples/49_hopper_gemm_with_collective_builder).
+parameterize kernels conveniently with the collective builder, please see example [49_hopper_gemm_with_collective_builder](https://github.com/NVIDIA/cutlass/tree/main/examples/49_hopper_gemm_with_collective_builder).

 ### Epilogue

@ -390,7 +388,7 @@ The collective epilogue implements element-wise operations
 involving the output matrix.  Users can provide a custom
 epilogue, or use one of the standard epilogues.
 These live in the directory
-[include/cutlass/epilogue/collective/](/include/cutlass/epilogue/collective/),
+[include/cutlass/epilogue/collective/](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/collective/),
 and include classes like
 `cutlass::epilogue::collective::DefaultEpilogue`
 and
@ -418,7 +416,7 @@ epilogues, and/or other operations.

 The entry point API for CUTLASS 3.0 kernel is the class
 `cutlass::gemm::kernel::GemmUniversal`, found in the header file
-[include/cutlass/gemm/kernel/gemm_universal.hpp](/include/cutlass/gemm/kernel/gemm_universal.hpp).
+[include/cutlass/gemm/kernel/gemm_universal.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemm_universal.hpp).
 `GemmUniversal` is a stateless universal device kernel
 that implements GEMM as the composition of two parts:

@ -478,24 +476,24 @@ We will explain *collective* in more detail below.

 Specializations of `kernel::GemmUniversal` for 3.0 APIs live in 
 any of various `gemm_*.hpp` files in the directory
-[include/cutlass/gemm/kernel/](/include/cutlass/gemm/kernel/).
+[include/cutlass/gemm/kernel/](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/).
 Specializations for 2.x APIs can be found in the header file
-[include/cutlass/gemm/kernel/gemm_universal.h](/include/cutlass/gemm/kernel/gemm_universal.h).
+[include/cutlass/gemm/kernel/gemm_universal.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemm_universal.h).

 CUTLASS 3.x implements various embodiments of `kernel::GemmUniversal`.
 Each kernel layer schedule is specialized
 for a GEMM scheduling algorithm and GPU architecture.
 Specializations of `kernel::GemmUniversal` for 3.0 APIs live in 
 any of various `include/cutlass/gemm/kernel/{arch_tag}*.hpp` files in the directory
-[include/cutlass/gemm/kernel/](/include/cutlass/gemm/kernel/).
+[include/cutlass/gemm/kernel/](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/).
 Which specialization to dispatch to is decided through the dispatch policy's `Schedule` type.

 For example, the header file
-[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp)
+[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp)
 has a specialization of `kernel::GemmUniversal` for Hopper
 that uses a warp-specialized mainloop with a persistent scheduling algorithm,
 while the header file
-[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp)
+[include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp)
 has a specialization of `GemmUniversal` for Hopper
 that uses a warp-specialized but non-persistent algorithm.

@ -536,7 +534,7 @@ It serves the same purpose as cuBLAS and behaves similarly.
 The entry point for the Device GEMM API is the class
 `cutlass::gemm::device::GemmUniversalAdapter`.
 This class lives in the header file
-[include/cutlass/gemm/device/gemm_universal_adapter.h](/include/cutlass/gemm/device/gemm_universal_adapter.h).
+[include/cutlass/gemm/device/gemm_universal_adapter.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_universal_adapter.h).
 `GemmUniversalAdapter` is a stateful, reusable handle,
 which is parameterized on the `cutlass::gemm::kernel` type.

--- a/media/docs/cpp/grouped_scheduler.md
+++ b/media/docs/cpp/grouped_scheduler.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Grouped Kernel Schedulers")
-
-[README](../../README.md#documentation) > **Grouped Kernel Schedulers**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Grouped Kernel Schedulers")

 # CUTLASS Grouped Kernel Schedulers

@ -59,12 +57,12 @@ Consider, for example, the threadblock-to-tile mapping that occurs for a group o
 each consisting of a grid of 2x2 tiles. Suppose that eight threadblocks are launched. The
 figure below illustrates the threadblock ID assigned to each tile in each GEMM in the group.

-![ALT](/media/images/grouped-gemm-schedule-2x2.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four GEMMs with 2x2 grids of tiles")
+![ALT](../../images/grouped-gemm-schedule-2x2.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four GEMMs with 2x2 grids of tiles")

 A similar mapping for problems that do not have the same number of tiles
 is shown below:

-![ALT](/media/images/grouped-gemm-schedule-varied.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four GEMMs with varying tile count")
+![ALT](../../images/grouped-gemm-schedule-varied.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four GEMMs with varying tile count")

 ## Computing the schedule for a given block
 Each threadblock in the grouped GEMM computes its own schedule by calling
@ -114,7 +112,7 @@ of a grid of 2x2 tiles.  Matrix C in each problem is lower triangular, indicated
 shaded tiles. Consider that eight threadblocks are launched to compute the grouped 
 problem. The default grouped GEMM scheduler will assign threadblocks to tiles in the following order:

-![ALT](/media/images/grouped-syr2k-schedule-using-grouped-gemm-scheduler.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four SYR2Ks with 2x2 grids of tiles")
+![ALT](../../images/grouped-syr2k-schedule-using-grouped-gemm-scheduler.png "CUTLASS grouped GEMM scheduler assigning threadblocks to four SYR2Ks with 2x2 grids of tiles")

 In this case, threadblocks 1 and 5 are continuously assigned to inactive tiles. In
 scenarios in which problems within the group have varying size, we have observed
@ -129,7 +127,7 @@ lower-triangular problem (and vice-versa for upper-triangular problems).
 Using the example above, the resulting assignment of threadblocks to tiles from
 such a scheduler might be:

-![ALT](/media/images/grouped-syr2k-schedule-ideal.png "CUTLASS grouped SYR2K scheduler assigning threadblocks to four SYR2Ks with 2x2 grids of tiles")
+![ALT](../../images/grouped-syr2k-schedule-ideal.png "CUTLASS grouped SYR2K scheduler assigning threadblocks to four SYR2Ks with 2x2 grids of tiles")

 Achieving this schedule requires mapping from a threadblock ID to tile coordinates
 `(i, j)`.
@ -139,7 +137,7 @@ first calculate row and column indices assuming one-indexed rows, tiles, and
 threadblock IDs, and then subtract one to convert to zero-indexed versions. Our 
 description borrows heavily from the mapping described [here](https://stackoverflow.com/a/40954159).

-![ALT](/media/images/grouped-syr2k-schedule-3x3.png "CUTLASS grouped SYR2K scheduler assigning threadblocks to one SYR2K with a 3x3 grids of tiles")
+![ALT](../../images/grouped-syr2k-schedule-3x3.png "CUTLASS grouped SYR2K scheduler assigning threadblocks to one SYR2K with a 3x3 grids of tiles")

 ### Calculating row `i` given threadblock ID `t`
 For a given row i, all threadblock IDs t in that row satisfy the following:
@ -199,7 +197,7 @@ each of which contains 2 "true tiles." We can thus first map a threadblock ID to
 using the equations above, and then map it to the "true tile" within its "macro tile." In the example
 of a 2x4 grid, this mapping would look as follows:

-![ALT](/media/images/grouped-syr2k-schedule-macro.png "CUTLASS grouped SYR2K scheduler converting a grid into a 'macro grid' for computing tile mappings for non-square grids")
+![ALT](../../images/grouped-syr2k-schedule-macro.png "CUTLASS grouped SYR2K scheduler converting a grid into a 'macro grid' for computing tile mappings for non-square grids")

 A zero-indexed threadblock ID `t` is mapped to its "macro tile ID" `t_macro` as:
 ```
@ -245,7 +243,7 @@ The only modification needed for upper-triangular matrices is to swap `i_macro`
 # Scheduler modes
 The grouped kernel schedulers come with two different modes for finding
 the next tile for a block to compute. These techniques are controlled by
-the [`cutlass::gemm::kernel::GroupScheduleMode`](../../include/cutlass/gemm/kernel/grouped_problem_visitor.h) enum.
+the [`cutlass::gemm::kernel::GroupScheduleMode`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/grouped_problem_visitor.h) enum.
 We describe each mode in greater detail below.

 ## `GroupScheduleMode::kDeviceOnly` (default)
@ -340,7 +338,7 @@ Thus, there are 216 tiles across the group.
 Suppose this grouped GEMM is run on GA100, which has 108 SMs. Suppose that
 the occupancy given the parameters of the grouped GEMM is one -- one threadblock
 can be active at a time on an SM. The grouped GEMM will, thus, run with 108
-persistent threadblocks, each of which computes (216 / 108) = 2 tiles.
+persistent threadblocks, each of which computes (256 / 108) = 2 tiles.

 Under the round-robin assignment of tiles to threadblocks employed by
 the grouped GEMM scheduler, the assignment of tiles to threadblocks
@ -379,7 +377,7 @@ scheduling mode by around 30%.

 To ease the process of sorting groups and their associated metadata in this
 manner, the device-level grouped kernels provide a `sort_problems()` method.
-An example of how to use this may be found in the [grouped GEMM example](../../examples/24_gemm_grouped/gemm_grouped.cu).
+An example of how to use this may be found in the [grouped GEMM example](https://github.com/NVIDIA/cutlass/tree/main/examples/24_gemm_grouped/gemm_grouped.cu).

 Finally, while sorting problems can be helpful in certain scenarios, it is
 not guaranteed to improve performance. In some cases, performance can
--- a/media/docs/cpp/ide_setup.md
+++ b/media/docs/cpp/ide_setup.md
@ -1,5 +1,3 @@
-[README](../../README.md#documentation) > **IDE Setup for CUTLASS Development**
-
 # IDE Setup for CUTLASS Development

 This document outlines instructions and tips for setting up a local editor for CUTLASS development, including support
@ -33,7 +31,7 @@ and you might see faster responses and more stable performance with clangd.
    * ...others, depending on which files you edit
 1. Edit C++ standard to be `c++17`, `gnu++17`, or equivalent.
 1. Edit `defines` to define preprocessor variables. See
-[Global Config below](#Global-Config) for examples. The important
+[Global Config below](#global-config) for examples. The important
   ones include `__CUDACC_VER_MAJOR__`, `__CUDA_ARCH__`, `__CUDA_ARCH_FEAT_SM90_ALL__`. But configure
   them according to your target architecture.
 1. ...and possible edit any other fields for your specific setup.
--- a/media/docs/cpp/implicit_gemm_convolution.md
+++ b/media/docs/cpp/implicit_gemm_convolution.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Implicit GEMM API")
-
-[README](../../README.md#documentation) > **Implicit GEMM Convolution**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Implicit GEMM API")

 # CUTLASS Convolution

@ -55,7 +53,7 @@ f(p, r) = p * stride_h + R - r - 1 + pad_h
 g(q, s) = q * stride_w + S - s - 1 + pad_w
 ```

-A [host](/tools/util/include/cutlass/util/reference/host/convolution.h) and [device](/tools/util/include/cutlass/util/reference/device/convolution.h) 
+A [host](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util/reference/host/convolution.h) and [device](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util/reference/device/convolution.h) 
 reference implementation are provided in the CUTLASS Utilities.

 This computation may be mapped to the elements of a matrix product as follows.
@ -145,7 +143,7 @@ for (int gemm_i = 0; gemm_i < GEMM_M; ++gemm_i) {
  }
 }
 ```
-The [CUTLASS GEMM implementation](/media/docs/efficient_gemm.md) explicitly iterates over tiles. Consequently, 
+The [CUTLASS GEMM implementation](efficient_gemm.md) explicitly iterates over tiles. Consequently, 
 a tile iterator could be implemented to compute these functions analytically and load the appropriate
 elements. However, the resulting modulo arithmetic would be computationally intensive, and overhead would
 limit performance of a GEMM kernel targeting Turing Tensor Cores. 
@ -169,7 +167,7 @@ This enables 128-bit vector memory acceses which lead to efficient CUDA kernels.
 CUTLASS defines CUDA C++ templates accepting numerous template arguments to specialize the resulting
 kernel by operation, data type, tile configuration, math instruction, and fused output operation.

-In [turing_tensorop_conv2dfprop.cu](/examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu), a convolution
+In [turing_tensorop_conv2dfprop.cu](https://github.com/NVIDIA/cutlass/tree/main/examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu), a convolution
 operation is defined as follows.

 ```c++
@ -232,7 +230,7 @@ Internal accumulation is performed using 32-bit integers (`int32_t`), and an ele
 is performed on the output in single-precision floating point (`float`).

 The threadblock and warp-level tile shapes refer to the hierarchically blocked GEMM computation 
-[described here](/media/docs/gemm_api.md). Larger tiles achieve greater reuse of data loaded through shared memory
+[described here](gemm_api.md). Larger tiles achieve greater reuse of data loaded through shared memory
 but launch fewer CTAs and may not fully occupy the GPU for small problem sizes. Smaller tile configurations achieve
 lower peak utilizations but may better match the number of SMs within the GPU for real-world workloads.

@ -318,7 +316,7 @@ if (status != cutlass::Status::kSuccess) {
 ```

 The example demonstrates how the input and output tensors may be written to a file as CSV using
-`cutlass::HostTensor<>` defined in the [CUTLASS Utilities](/media/docs/utilities.md).
+`cutlass::HostTensor<>` defined in the [CUTLASS Utilities](utilities.md).

 ```c++
  std::ofstream output_workspace(ss.str());
@ -339,41 +337,41 @@ The example demonstrates how the input and output tensors may be written to a fi
 CUTLASS defines the following CUDA C++ templates to implement Implicit GEMM Convolution which are described in greater detail in subsequent sections.

 **Activations tile iterators** load the activations tile into registers. Two implementations are provided:
- [conv2d_fprop_activation_tile_access_iterator_analytic.h](/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h) computes pointer deltas and masks analytically
- [conv2d_fprop_activation_tile_access_iterator_optimized.h](/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h) optimizes iterating over global memory and 
+- [conv2d_fprop_activation_tile_access_iterator_analytic.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h) computes pointer deltas and masks analytically
+- [conv2d_fprop_activation_tile_access_iterator_optimized.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h) optimizes iterating over global memory and 
 creating GEMM-A tile in shared memory.

 **Filter tile iterators** load filters into registers. Similarly, two implementations are provided:
- [conv2d_fprop_filter_tile_access_iterator_analytic.h](/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h) computes pointer deltas and masks analytically
- [conv2d_fprop_filter_tile_access_iterator_optimized.h](/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_optimized.h) optimizes iterating over global memory and 
+- [conv2d_fprop_filter_tile_access_iterator_analytic.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h) computes pointer deltas and masks analytically
+- [conv2d_fprop_filter_tile_access_iterator_optimized.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_optimized.h) optimizes iterating over global memory and 
 creating GEMM-B tile in shared memory.

 The improvements covered by optimized iterators are:

 a. Precomputing kernel-invariant pointer deltas on the host 
 b. Computing cta-invariant mask predicates on device-side iterator ctors
-c. Use of [fast divmod](/include/cutlass/fast_math.h) to map GEMM dimensions to convolution tensors.
+c. Use of [fast divmod](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/fast_math.h) to map GEMM dimensions to convolution tensors.

 For example, an _optimized_ activation iterator uses fast divmod to map GEMM _M_ to NPQ.

 **Pipelined mainloop** loads threadblock-scoped tiles from global memory into shared memory and then applies
 CUTLASS warp-level GEMM operations to load from Shared Memory and issue instructions to Turing Tensor Cores.
- [mma_pipelined.h](/include/cutlass/conv/threadblock/implicit_gemm_pipelined.h)
+- [mma_pipelined.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/implicit_gemm_pipelined.h)

 Operations for storing to shared memory and performing warp-wide matrix multiply operations using
 Turing Tensor Cores are applied directly from the CUTLASS GEMM components. These include the
 following components.

 **Regular Tile Iterator** implemented in 
-[transform::threadblock::RegularTileIterator](/include/cutlass/transform/threadblock/regular_tile_iterator.h)
+[transform::threadblock::RegularTileIterator](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/transform/threadblock/regular_tile_iterator.h)
 stores register-backed fragments to Shared Memory in permuted layouts.

-**Warp-level GEMM** defined in [cutlass::gemm::warp::MmaTensorOp](/include/cutlass/gemm/warp/mma_tensor_op.h)
+**Warp-level GEMM** defined in [cutlass::gemm::warp::MmaTensorOp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/warp/mma_tensor_op.h)
 defines tile iterators to load from Shared Memory and issue math instructions to Turing Tensor Cores.
-Further details are [described in here](/media/docs/gemm_api.md#warp-level-matrix-multiply-api).
+Further details are [described in here](gemm_api.md#warp-level-matrix-multiply-api).

 **Epilogue** reorders accumulator elements among threads within a threadblock to efficiently update
-the output tensor. It is implemented in [epilogue::threadblock::Epilogue](/include/cutlass/epilogue/threadblock/epilogue.h).
+the output tensor. It is implemented in [epilogue::threadblock::Epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/threadblock/epilogue.h).

 ### Loading Activations and Filters

@ -383,7 +381,7 @@ of channels. After iterating over all filter positions, the convolution algorith
 next interval of channels and proceeds from filter `r=0, s=0`. 

 The matrix product of one threadblock tile is computed per iteration of 
-the mainloop as described in the [CUTLASS GEMM implementation](/media/docs/efficient_gemm.md). To
+the mainloop as described in the [CUTLASS GEMM implementation](efficient_gemm.md). To
 summarize, the threadblock tile of activations and filters are loaded from tensors in global memory
 and stored to shared memory. Each thread within the threadblock loads one or more vectors and
 collectively span the entire tile. 
@ -394,9 +392,9 @@ Filters tensors. Each index in the GEMM _M_ dimension corresponds to a unique _(
 index of the output tensor, and pointers may be computed based on this as well as 
 filter position _(r,s)_.

-![ALT](/media/images/conv2d-fprop-int4.png "Convolution Forward Propagation on INT4 data.")
+![ALT](../../images/conv2d-fprop-int4.png "Convolution Forward Propagation on INT4 data.")

-The CUTLASS component that embodies this functionality is [Conv2dFpropFilterTileAccessIteratorAnalytic](/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h).
+The CUTLASS component that embodies this functionality is [Conv2dFpropFilterTileAccessIteratorAnalytic](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h).
 Its constructor computes the mapping of GEMM _M_ to _(N, P, Q)_, the `at()` method maps the linear offset into the Activations 
 tensor for each memory access the thread is to perform. Additionally, the method `valid()` computes the valided of the access 
 for each filter position and for each memory access to indicate whether the memory access will be within the bounds of the 
@ -456,11 +454,11 @@ void advance() {
 }
 ```

-Similar logic holds for [Conv2dFpropFilterTileAccessIteratorAnalytic](/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h).
+Similar logic holds for [Conv2dFpropFilterTileAccessIteratorAnalytic](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h).

 To reduce computational overhead in the mainloop body, the pointer offsets may be precomputed
 in host code and provided to the CUDA kernel as a lookup table in its `Params` structure. 
-As shown in [Conv2dFpropFilterTileAccessIteratorOptimized](/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h),
+As shown in [Conv2dFpropFilterTileAccessIteratorOptimized](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h),
 the logic to compute offsets from filter position has been extracted to the `Params` constructor.

 ```c++
@ -535,11 +533,11 @@ threads within a warp. The following operations are supported.
 Functionally, the Turing 8x8x32 matrix multiply operation distributes the _A_, _B_, and _C_ matrix across 32
 threads within a warp according to the following illustration.

-![ALT](/media/images/mma-8x8x32.png "Turing Tensor Op")
+![ALT](../../images/mma-8x8x32.png "Turing Tensor Op")

 This Tensor Core operation is accessible to the CUDA programmer via the PTX instruction
 [`mma.sync`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-8832).
-CUTLASS wraps inline PTX with device-side intrinsics defined in [`cutlass/arch/mma_sm75.h`](/include/cutlass/arch/mma_sm75.h) 
+CUTLASS wraps inline PTX with device-side intrinsics defined in [`cutlass/arch/mma_sm75.h`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/mma_sm75.h) 
 as in the following example.

 ```c++
@ -565,7 +563,7 @@ per row.
 The arrangement of SMEM pointers and destination registers within threads is illustrated as follows. Thread 0 is highlighted
 in the illustration to emphasize the mapping. 

-![ALT](/media/images/ldmatrix-8x128bx4.png "Turing ldmatrix PTX instruction")
+![ALT](../../images/ldmatrix-8x128bx4.png "Turing ldmatrix PTX instruction")

 The size of the Turing Tensor Core operation computing matrix multiply-accumulate on INT4 data is 8-by-8-by-32
 elements. `ldmatrix` fetches up to 32 rows (or columns) per operation. Sixteen Tensor Core operations may be issued
@ -574,7 +572,7 @@ as shown in the following figure. Larger tiles are possible by increasing the nu
 and issuing more Tensor Core operations, up to warp-level matrix operations of size 64-by-64-by-32. The limit is
 the number of registers to hold the accumulator elements.

-![ALT](/media/images/ldmatrix-tensorop-32x32x32.png "Turing ldmatrix PTX instruction feeding Tensor Core operations")
+![ALT](../../images/ldmatrix-tensorop-32x32x32.png "Turing ldmatrix PTX instruction feeding Tensor Core operations")

 ### Shared Memory Layouts

@ -588,7 +586,7 @@ load from Shared Memory using `ldmatrix`. The following figure illustrates the t
 the loading the activations and filters threadblock tiles from global memory and the permuted layout in
 Shared Memory. 

-![ALT](/media/images/tensor-op-permuted-smem-layout-TN.png "Shared Memory layout used for Turing Tensor Cores")
+![ALT](../../images/tensor-op-permuted-smem-layout-TN.png "Shared Memory layout used for Turing Tensor Cores")

 In the illustration, one warp-wide memory access is highlighted in blue, with individual threads
 loading one 128-bit vector. The tile in global memory could correspond either to the activations
@ -618,7 +616,7 @@ The following figure shows how the first sixteen threads participating in an `ld
 logically map to the c=0..31 slice of a matrix in Shared Memory. This slice is known as a "k-group" 
 within the code because it corresponds to the same K-index of a warp-level matrix multiply. 

-![ALT](/media/images/tensor-op-permuted-smem-layout-TN-k0.png "Load kgroup=0 from Shared Memory using ldmatrix")
+![ALT](../../images/tensor-op-permuted-smem-layout-TN-k0.png "Load kgroup=0 from Shared Memory using ldmatrix")

 The lower half of the figure shows the physical arrangement in Shared Memory, with threads offset by row and column
 according to the XOR function. By inspection, we can observe there are no bank conflicts, as _T0 ... T7_ each access unique
@ -632,9 +630,9 @@ the following sequence:
 - **^3** advances from _k=3_ to _k=0_

 The first of these transitions is shown below.
-![ALT](/media/images/tensor-op-permuted-smem-layout-TN-k1.png "Advance to kgroup=1 from Shared Memory using ldmatrix")
+![ALT](../../images/tensor-op-permuted-smem-layout-TN-k1.png "Advance to kgroup=1 from Shared Memory using ldmatrix")

-The [CUTLASS warp-level GEMM API](/media/docs/gemm_api.md#warp-level-matrix-multiply-api) defines templates for
+The [CUTLASS warp-level GEMM API](gemm_api.md#warp-level-matrix-multiply-api) defines templates for
 loading slices of data from permuted Shared Memory and issuing operations to Tensor Cores.

 ### Updating the Output Tensor
@ -647,11 +645,11 @@ needed.
 The **Epilogue** is the component for exchanging accumulator elements through Shared Memory, loading slices of the output
 matrix or tensor, applying an elementwise operation such as linear scaling or bias, and storing the result to the output tensor. 
 CUTLASS structures this as several components:
- [cutlass::epilogue::threadblock::Epilogue](/include/cutlass/epilogue/threadblock/epilogue.h) - the top-level component for looping over the entire threadblock tile
- [cutlass::epilogue::warp::TileIteratorTensorOp](/include/cutlass/epilogue/warp/tile_iterator_tensor_op.h) - a specialized component for storing accumulators for Tensor Core to Shared Memory
- [cutlass::epilogue::threadblock::SharedLoadIterator](/include/cutlass/epilogue/threadblock/shared_load_iterator.h) - a component for loading elements from a row-major arrangement in Shared Memory
- [cutlass::epilogue::threadblock::PredicatedTileIterator](/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h) - a component for loading or storing matrix fragments to Global Memory (with bounds checks)
- [cutlass::epilogue::thread::LinearCombination](/include/cutlass/epilogue/thread/linear_combination.h) - an element-wise function computing `alpha * AB + beta * C` to compute the final output
+- [cutlass::epilogue::threadblock::Epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/threadblock/epilogue.h) - the top-level component for looping over the entire threadblock tile
+- [cutlass::epilogue::warp::TileIteratorTensorOp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/warp/tile_iterator_tensor_op.h) - a specialized component for storing accumulators for Tensor Core to Shared Memory
+- [cutlass::epilogue::threadblock::SharedLoadIterator](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/threadblock/shared_load_iterator.h) - a component for loading elements from a row-major arrangement in Shared Memory
+- [cutlass::epilogue::threadblock::PredicatedTileIterator](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h) - a component for loading or storing matrix fragments to Global Memory (with bounds checks)
+- [cutlass::epilogue::thread::LinearCombination](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/thread/linear_combination.h) - an element-wise function computing `alpha * AB + beta * C` to compute the final output

 ## Unit Tests

@ -663,13 +661,13 @@ b. showcase instantiation of use of these templates in device code, and
 c. assert functional correctness.

 **Convolution unit tests**
- Device-wide convolution operator: [conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu)
+- Device-wide convolution operator: [conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu)

 **GEMM unit tests**
- Warp-scoped matrix multiply for Turing Tensor Cores: [gemm_sm75.cu](/test/unit/gemm/warp/gemm_sm75.cu)
+- Warp-scoped matrix multiply for Turing Tensor Cores: [gemm_sm75.cu](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/warp/gemm_sm75.cu)

 **Epilogue unit tests**
- Epilogue for Turing Tensor Cores: [epilogue_tensor_op.cu](/test/unit/epilogue/threadblock/epilogue_tensor_op.cu)
+- Epilogue for Turing Tensor Cores: [epilogue_tensor_op.cu](https://github.com/NVIDIA/cutlass/tree/main/test/unit/epilogue/threadblock/epilogue_tensor_op.cu)


 # Convolution Example
@ -681,10 +679,10 @@ of Implicit GEMM Convolution.

 Example `09_turing_tensorop_conv2dfprop` computes a forward convolutional layer in which inputs and
 outputs are 4-b integers. The example source is visible in 
-[examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu](/examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu).
+[examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu](https://github.com/NVIDIA/cutlass/tree/main/examples/09_turing_tensorop_conv2dfprop/turing_tensorop_conv2dfprop.cu).


-Before building the example, first perform the prerequisite steps for building any CUTLASS component [described here](/media/docs/quickstart.md).
+Before building the example, first perform the prerequisite steps for building any CUTLASS component [described here](quickstart.md).
 Compute capability 7.5 refers to the Turing architecture, and this work requires CUDA 10.2 Toolkit or later to target
 Turing Tensor Cores using the native `mma` [PTX instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-8832).

@ -708,7 +706,7 @@ initialize them to random values, and compute the result of a convolutional laye
 tensors may be saved to .csv files, and the CUTLASS host-side reference check may be executed to verify correctness.

 The complete usage statement is visible by running with `--help`:
-```bash
+```
 $ ./examples/09_turing_tensorop_conv2dfprop/09_turing_tensorop_conv2dfprop --help
 09_turing_tensorop_conv2dfprop example

--- a/media/docs/cpp/layout.md
+++ b/media/docs/cpp/layout.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Layouts and Tensors")
-
-[README](../../README.md#documentation) > **Layouts and Tensors**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Layouts and Tensors")

 Note: This document talks about CUTLASS 2.x layout tag types.
 CUTLASS 3.0 deprecates all legacy 2.x layout tags in favour of a single `cute::Layout<Shape, Stride>`
--- a/media/docs/cpp/overview.md
+++ b/media/docs/cpp/overview.md
@ -0,0 +1,619 @@
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
+
+# Overview
+
+# CUTLASS 3.9.0
+
+_CUTLASS 3.9.0 - March 2025_
+
+CUTLASS is a collection of CUDA C++ template abstractions for implementing
+high-performance matrix-matrix multiplication (GEMM) and related computations at all levels 
+and scales within CUDA. It incorporates strategies for hierarchical decomposition and 
+data movement similar to those used to implement cuBLAS and cuDNN.  CUTLASS decomposes 
+these "moving parts" into reusable, modular software components abstracted by C++ template 
+classes.  Primitives for different levels of a conceptual parallelization hierarchy
+can be specialized and tuned via custom tiling sizes, data types,
+and other algorithmic policy. The resulting flexibility simplifies their use
+as building blocks within custom kernels and applications.
+
+To support a wide variety of applications, CUTLASS provides extensive support for
+mixed-precision computations, providing specialized data-movement and
+multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16,
+[FP32 emulation via tensor core instruction](https://github.com/NVIDIA/cutlass/tree/main/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm), 
+ 8b floating point types (e5m2 and e4m3),
+ block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8),
+ narrow integer types (4 and 8b signed and unsigned integers),
+ and binary 1b data types (where architectures allow for the
+native support of such data types).
+CUTLASS demonstrates optimal matrix multiply operations
+targeting the programmable, high-throughput _Tensor Cores_ implemented by
+NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.
+
+In addition to GEMMs, CUTLASS implements high-performance convolution via
+the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution
+operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline.
+This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
+
+See the [Quick Start Guide](quickstart.md) to get started quickly.
+
+See the [functionality docs](functionality.md) for a more comprehensive
+list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU
+architecture.
+
+# What's New in CUTLASS 3.9
+
+* Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
+  - Collective mainloops that target for:
+    * [Blockscaled datatypes with support for dense GEMM](../../../include/cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp)
+    * [Blockscaled datatypes with support for sparse GEMM](../../../include/cutlass/gemm/collective/sm120_blockscaled_sparse_mma_tma.hpp)
+  - New [GEMM](../../../include/cutlass/gemm/dispatch_policy.hpp) and [epilogue](../../../include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.
+  - [Blackwell SM120 epilogue](../../../include/cutlass/epilogue/fusion/sm120_visitor_store_tma_warpspecialized.hpp) and [full set of EVT fusions](../../../include/cutlass/epilogue/fusion/sm120_callbacks_tma_warpspecialized.hpp).
+* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
+  - [Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor](../../../examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu).
+  - [Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation](../../../examples/79_blackwell_geforce_gemm/79b_blackwell_geforce_nvfp4_nvfp4_gemm.cu).
+  - [Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor](../../../examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu).
+* Set of unit tests that demonstrate the usage of both [sparse](../../../test/unit/gemm/device/sm120_blockscaled_sparse_tensorop_gemm/) and [dense](../../../test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/) Blackwell SM120 blockscaled GEMM.
+* Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
+  - Enhancement of [blockwise GEMM](../../../examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) for Hopper architecture.
+  - Enhancement of [groupwise GEMM](../../../examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) for Hopper architecture.
+  - Support for [grouped GEMM with blockwise and groupwise scaling](../../../examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
+  - Support for [blockwise GEMM](../../../examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu) for Blackwell architecture.
+  - Support for [groupwise GEMM](../../../examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu) for Blackwell architecture.
+  - Support for [grouped GEMM with blockwise](../../../examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu) and [groupwise scaling](../../../examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu) for Blackwell architecture.
+* Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
+  - Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
+  - Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
+  - Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
+  - More detailed introductions and examples to leverage this feature can be found in [profiler.md](./profiler.md#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
+
+Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
+CUTLASS team is working on a fix.
+
+**See the [CHANGELOG](../release_notes.md) for details of all past releases and updates.**
+
+# Performance
+
+CUTLASS primitives are very efficient.  When used to construct device-wide GEMM kernels,
+they exhibit nearly optimal utilization of peak theoretical throughput. The figure below
+shows CUTLASS 3.8's performance as a % of theoretical peak utilization 
+on various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU.
+
+![ALT](../../images/cutlass-3.8-blackwell-gemm-peak-performance.svg "")
+
+The two figures below show the continual CUTLASS performance improvements 
+on an [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) (NVIDIA Hopper architecture) since
+CUTLASS 3.1.
+CUTLASS 3.5.1 was compiled with the [CUDA 12.5u1 Toolkit](https://developer.nvidia.com/cuda-downloads). 
+Tensor Core operations are implemented using CUDA's 
+[mma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma) and
+[wgmma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions) instructions.
+
+![ALT](../../images/cutlass-3.5.1-gemm-peak-performance.png "")
+![ALT](../../images/cutlass-3.5.1-gemm-peak-performance-fp8.png "")
+
+# CuTe
+
+CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data.
+CuTe is a collection of C++ CUDA template abstractions for
+defining and operating on hierarchically multidimensional layouts of threads and data.
+CuTe provides `Layout` and `Tensor` objects that compactly package the type,
+shape, memory space, and layout of data, while performing the complicated indexing for the user.
+This lets programmers focus on the logical descriptions of their algorithms while
+CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design,
+implement, and modify all dense linear algebra operations.
+
+The core abstractions of CuTe are hierarchically multidimensional layouts
+which can be composed with data arrays to represent tensors.
+The representation of layouts is powerful enough to represent nearly
+everything we need to implement efficient dense linear algebra.
+Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
+
+CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates.
+This greatly simplifies the design and improves code composability and readability.
+More documentation specific to CuTe can be found in its
+[dedicated documentation directory](cute/00_quickstart.md).
+
+# Compatibility
+
+Minimum requirements:
+
+- Architecture: Volta (compute capability 7.0)
+- Compiler: Must support at least C++17
+- CUDA Toolkit version: 11.4
+
+CUTLASS requires a C++17 host compiler and 
+performs best when built with the [**CUDA 12.8 Toolkit**](https://developer.nvidia.com/cuda-downloads).
+It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, and all other CUDA 12.x versions.
+
+## Operating Systems
+
+We have tested the following environments.
+
+|**Operating System** | **Compiler** |
+|-----------------|----------|
+| Ubuntu 18.04 | GCC 7.5.0  |
+| Ubuntu 20.04 | GCC 10.3.0 |
+| Ubuntu 22.04 | GCC 11.2.0 |
+
+Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.
+
+Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
+CUTLASS team is working on a fix.
+
+## Hardware
+
+CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs.
+
+|**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit Required by CUTLASS-3**|
+|---|---|---|
+|NVIDIA V100 Tensor Core GPU            |7.0|11.4|
+|NVIDIA TitanV                          |7.0|11.4|
+|NVIDIA GeForce RTX 20x0 series         |7.5|11.4|
+|NVIDIA T4                              |7.5|11.4|
+|NVIDIA A100 Tensor Core GPU            |8.0|11.4|
+|NVIDIA A10                             |8.6|11.4|
+|NVIDIA GeForce RTX 30x0 series         |8.6|11.4|
+|NVIDIA GeForce RTX 40x0 series         |8.9|11.8|
+|NVIDIA L40                             |8.9|11.8|
+|NVIDIA H100 Tensor Core GPU            |9.0|11.8|
+|NVIDIA H200 Tensor Core GPU            |9.0|11.8|
+|NVIDIA B200 Tensor Core GPU            |10.0|12.8|
+|NVIDIA GeForce RTX 50x0 series         |10.0|12.8|
+
+## Target Architecture
+
+In general, PTX code generated for one target architecture can be run on future architectures
+(i.e., it is forward compatible).
+However, CUDA 12.0 introduced the concept of "architecture-accelerated features" whose
+PTX does not have forward compatibility guarantees.
+Several Hopper and Blackwell PTX instructions fall under this category of
+architecture-accelerated features, and thus require a `sm_90a` or `sm100a` target architecture
+(note the "a" appended). For more details on this and other architecture-accelerated instructions,
+please refer to the [CUDA Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availability).
+
+The target architecture information is passed on to CUTLASS via the cmake flag
+`CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100,
+users are required to build CUTLASS with `90a` as the target architecture.
+If a user accidentally builds a kernel which uses SM90a features
+(e.g. Hopper Tensor Core Instructions), using the SM90 target
+(note the lack of "a"), with either CUDA Toolkit 12 or 11.8,
+the kernel is expected to fail with a runtime error.
+
+```
+cmake .. -DCUTLASS_NVCC_ARCHS="90a"
+```
+Or 
+
+```
+cmake .. -DCUTLASS_NVCC_ARCHS="100a" 
+```
+
+Note: The NVIDIA Blackwell SM100 architecture used in the datacenter 
+products has a different compute capability than the one underpinning 
+NVIDIA Blackwell GeForce RTX 50 series GPUs. As a result, kernels 
+compiled for Blackwell SM100 architecture with arch conditional features 
+(using `sm100a`) are not compatible with RTX 50 series GPUs. 
+
+Please refer to the [functionality documentation](functionality.md)
+for details on which kernels require which target architectures.
+
+# Documentation
+
+CUTLASS is described in the following documents and the accompanying
+[Doxygen documentation](https://nvidia.github.io/cutlass).
+
+- [Quick Start Guide](quickstart.md) - basics of building and running CUTLASS
+- [Functionality](functionality.md) - summarizes functionality available in CUTLASS
+- [Efficient GEMM in CUDA](efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
+- [CUTLASS 3.x Design](cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
+- [GEMM API 3.x](gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
+- [GEMM API 2.x](gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
+- [Implicit GEMM Convolution](implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
+- [Code Organization](code_organization.md) - describes the organization and contents of the CUTLASS project
+- [Terminology](terminology.md) - describes terms used in the code
+- [Programming Guidelines](programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
+- [Fundamental types](fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
+- [Layouts](layout.md) - describes layouts of matrices and tensors in memory
+- [Tile Iterators](tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
+- [CUTLASS Profiler](profiler.md) - command-line driven profiling application
+- [CUTLASS Utilities](utilities.md) - additional templates used to facilitate rapid development
+- [Dependent kernel launch](dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent 
+kernels in the same stream, and how it is used in CUTLASS.
+
+# Resources
+We have also described the structure of an efficient GEMM in our talk at the
+[GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).
+
+- [CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA](https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8854/)
+- [Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100](https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/)
+- [Accelerating Convolution with Tensor Cores in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31883/)
+- [Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41996/)
+- [CUTLASS: Python API, Enhancements, and NVIDIA Hopper](https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41131/)
+
+# Building CUTLASS
+
+CUTLASS is a header-only template library and does not need to be built to be used by other
+projects. Client applications should target CUTLASS's `include/` directory in their include
+paths.
+
+CUTLASS unit tests, examples, and utilities can be build with CMake.
+The minimum version of CMake is given in the [Quickstart guide](quickstart.md).
+Make sure the `CUDACXX` environment  variable points to NVCC in the CUDA Toolkit installed
+on your system.
+
+```bash
+$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
+```
+
+Create a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels
+for CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 8.9, and 9.0.
+To reduce compile time you can specify
+the architectures to build CUTLASS for by changing the CMake configuration setting
+`CUTLASS_NVCC_ARCHS`.
+
+```bash
+$ mkdir build && cd build
+
+$ cmake .. -DCUTLASS_NVCC_ARCHS=80               # compiles for NVIDIA's Ampere Architecture
+```
+
+From the `build/` directory, compile and run the CUTLASS unit tests by building the target `test_unit` with make.
+
+The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS,
+and they may be executed in parallel via make's `-j` command line argument.
+
+```bash
+$ make test_unit -j
+...
+...
+...
+[----------] Global test environment tear-down
+[==========] 946 tests from 57 test cases ran. (10812 ms total)
+[  PASSED  ] 946 tests.
+```
+
+All tests should pass on supported platforms, though the exact number of tests may vary over time.
+
+
+# Project Structure
+
+CUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests. 
+[Doxygen documentation](https://nvidia.github.io/cutlass) provides a complete list of files, classes, 
+and template concepts defined in the CUTLASS project.
+
+A detailed explanation of the source code organization may be found in the 
+[CUTLASS documentation](code_organization.md), but several main components are summarized below.
+
+## CUTLASS Template Library
+
+```
+include/                     # client applications should target this directory in their build's include paths
+
+  cutlass/                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only
+
+    arch/                    # direct exposure of architecture features (including instruction-level GEMMs)
+
+    conv/                    # code specialized for convolution
+
+    epilogue/                # code specialized for the epilogue of gemm/convolution
+
+    gemm/                    # code specialized for general matrix product computations
+
+    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory
+
+    platform/                # CUDA-capable Standard Library components
+
+    reduction/               # bandwidth-limited reduction kernels that do not fit the "gemm" model
+
+    thread/                  # simt code that can be performed within a CUDA thread
+    
+    transform/               # code specialized for layout, type, and domain transformations
+
+    *                        # core vocabulary types, containers, and basic numeric operations
+
+  cute/                      # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy
+
+    algorithm/               # Definitions of core operations such as copy, gemm, and operations on cute::tuples
+
+    arch/                    # Bare bones PTX wrapper structs for copy and math instructions
+
+    atom/                    # Meta-information either link to or built from arch/ operators
+
+      mma_atom.hpp           # cute::Mma_Atom and cute::TiledMma
+
+      copy_atom.hpp          # cute::Copy_Atom and cute::TiledCopy
+
+      *sm*.hpp               # Arch specific meta-information for copy and math operations
+
+    *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations
+
+```
+
+### CUTLASS SDK Examples
+
+[CUTLASS SDK examples](https://github.com/NVIDIA/cutlass/tree/main/examples) apply CUTLASS templates to implement basic computations.
+
+### Tools
+
+```
+tools/
+  library/                   # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates
+    include/
+      cutlass/
+        library/
+
+  profiler/                  # CUTLASS Profiler         - command-line utility for executing operations in the
+                             #                            CUTLASS Library
+  
+  util/                      # CUTLASS Utilities        - contains numerous helper classes for
+    include/                 #                            manging tensors in device memory, reference
+      cutlass/               #                            implementations for GEMM, random initialization
+        util/                #                            of tensors, and I/O.
+```
+
+### Test
+
+The `test/unit/` directory consist of unit tests implemented with Google Test that demonstrate
+basic usage of Core API components and complete tests of the CUTLASS GEMM computations.
+
+Instructions for building and running the Unit tests are described in the [Quickstart guide](quickstart.md).
+
+# Performance Profiling
+
+The `tools/profiler/` directory contains a command-line utility for launching each of the GEMM kernels.
+It can be built as follows:
+
+```bash
+$ make cutlass_profiler -j16
+```
+## Building all GEMM and Convolution kernels (_long_ build times)
+
+By default, only one tile size is instantiated for each data type, math instruction, and layout.
+To instantiate all, set the following environment variable when running CMake from an empty `build/` directory.
+Beware, this results in *tens of thousands* of kernels and long build times. 
+This would also result in a large binary size and on some platforms linker to fail on building the library.
+Therefore, it's highly recommended to generate only a subset of kernels as demonstrated in the sub-section below.
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all
+...
+$ make cutlass_profiler -j16
+```
+
+## Building a subset of GEMM and Convolution kernels (_reduced_ build times)
+
+To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with 
+wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one
+or a subset of kernels for NVIDIA Ampere and Turing architecture:
+
+### Building a subset Tensor Core GEMM kernels
+
+To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture, 
+use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:
+```bash
+./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096
+
+...
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: gemm
+       Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+          cuBLAS: Passed
+
+       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1  \
+                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128  \
+                  --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \
+                  --max_cc=1024
+
+           Bytes: 118489088  bytes
+           FLOPs: 115992428544  flops
+
+         Runtime: 1.55948  ms
+          Memory: 70.7616 GiB/s
+
+            Math: 74378.8 GFLOP/s
+
+
+
+=============================
+...
+```
+
+### Building one CUDA Core GEMM kernel
+
+To compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling single SGEMM CUDA kernel is as follows:
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
+
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: gemm
+       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+          cuBLAS: Passed
+
+       Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1  \
+                  --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
+                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
+
+           Bytes: 180355072  bytes
+           FLOPs: 115992428544  flops
+
+         Runtime: 6.73655  ms
+          Memory: 24.934 GiB/s
+
+            Math: 17218.4 GFLOP/s
+
+=============================
+```
+
+### Building a subset of Tensor Core Convolution kernels
+
+To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation 
+and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling a subset of Tensor Core convolution kernels is as follows:
+
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
+
+...
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: conv2d
+       Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+
+       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
+                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc  \
+                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
+                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5  \
+                  --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
+
+           Bytes: 1130659840  bytes
+           FLOPs: 118482796544  flops
+
+         Runtime: 0.711496  ms
+          Memory: 1479.99 GiB/s
+
+            Math: 166526 GFLOP/s
+
+=============================
+...
+```
+
+
+### Building one Convolution CUDA kernel
+
+To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation 
+and FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling one CUDA Core convolution kernel:
+
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
+
+
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: conv2d
+       Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+
+       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
+                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc  \
+                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
+                  --eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
+                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
+
+           Bytes: 2055798784  bytes
+           FLOPs: 118482796544  flops
+
+         Runtime: 7.34266  ms
+          Memory: 260.752 GiB/s
+
+            Math: 16136.2 GFLOP/s
+
+
+=============================
+
+```
+
+## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
+- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
+  - [GEMM CMake Examples](quickstart.md#gemm-cmake-examples) 
+  - [Implicit GEMM convolution CMake Examples](quickstart.md#convolution-cmake-examples)
+- [Further details about the CUTLASS Profiler are described here.](profiler.md)
+
+
+# About
+
+CUTLASS is released by NVIDIA Corporation as Open Source software under the 
+[3-clause "New" BSD license](LICENSE.txt).
+
+# Contributors
+
+The official list of CUTLASS developers and contributors is available here: [CONTRIBUTORS](CONTRIBUTORS.md).
+
+# Copyright
+
+Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: BSD-3-Clause
+
+```
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions are met:
+
+  1. Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+  3. Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```
--- a/media/docs/cpp/pipeline.md
+++ b/media/docs/cpp/pipeline.md
@ -42,10 +42,10 @@ CUTLASS now includes abstractions
 for the following features introduced in Hopper.

 1. Thread block cluster - level synchronization and query
-   [APIs](/include/cute/arch/cluster_sm90.hpp)
+   [APIs](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/cluster_sm90.hpp)

 2. Abstractions for new
-   [barrier instructions](/include/cutlass/arch/barrier.h)
+   [barrier instructions](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/barrier.h)
   which help with efficient synchronization
   of threads within a thread block cluster.

@ -54,7 +54,7 @@ for the following features introduced in Hopper.
 In order to write a performant GEMM Kernel,
 software pipelining is critical to hide the latency of global memory loads.
 (Please refer to the
-[Efficient GEMM](/media/docs/efficient_gemm.md#pipelining) document.)
+[Efficient GEMM](efficient_gemm.md#pipelining) document.)
 Different threads or groups of threads
 may have different roles in the pipeline.
 Some are "producers" that load data or perform computations
@ -73,7 +73,7 @@ dozens of different kinds of asynchronously executing operations
 that synchronize using multiple barriers organized as a circular list.
 This complexity is too much for human programmers to manage by hand.
 As a result, we have developed
-[asynchronous Pipeline classes](/include/cutlass/pipeline/).
+[asynchronous Pipeline classes](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/pipeline/).
 These classes help developers orchestrate a pipeline
 of asynchronous producer and consumer threads,
 without needing to worry about lower-level hardware details.
@ -173,8 +173,8 @@ and then synchronize among 3 asynchronously executing threads:
 Please note that this is a basic example.
 There are different versions possible,
 depending on what the producer and consumer threads are doing.
-Please refer to our [unit tests](/test/unit/pipeline)
-and the other [pipeline classes](/include/cutlass/pipeline/pipeline.hpp)
+Please refer to our [unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/pipeline)
+and the other [pipeline classes](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/pipeline/pipeline.hpp)
 for more details.

 # Copyright
--- a/media/docs/cpp/profiler.md
+++ b/media/docs/cpp/profiler.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Profiler")
-
-[README](../../README.md#documentation) > **CUTLASS Profiler**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Profiler")

 # CUTLASS Profiler

@ -33,9 +31,9 @@ tools/

 # Emitting kernels via `emit_kernel_listing.py`

-We provide a Python script `emit_kernel_listing.py` that allows a user to selectively test a subset of profiler-based kernels stamped out in `generator.py`. A unique benefit to generate kernels and test via this script is that it can feed a series of runtime arguments, such as different `M`/`N`/`K` and `alpha`/`beta`, to each kernel, instead of relying on a single default value. It also properly generates runtime datatype and cluster shapes for certain kernels to help reduce the generated kernel count and accordingly the total compilation time. An interested user may refer to [emit_kernel_listing.py](../../python/cutlass_library/emit_kernel_listing.py) for details. To enable this new feature, a user should add `-DCUTLASS_BUILD_FOR_PROFILER_REGRESSIONS=ON` when building CUTLASS profiler.
+We provide a Python script `emit_kernel_listing.py` that allows a user to selectively test a subset of profiler-based kernels stamped out in `generator.py`. A unique benefit to generate kernels and test via this script is that it can feed a series of runtime arguments, such as different `M`/`N`/`K` and `alpha`/`beta`, to each kernel, instead of relying on a single default value. It also properly generates runtime datatype and cluster shapes for certain kernels to help reduce the generated kernel count and accordingly the total compilation time. An interested user may refer to [emit_kernel_listing.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/emit_kernel_listing.py) for details. To enable this new feature, a user should add `-DCUTLASS_BUILD_FOR_PROFILER_REGRESSIONS=ON` when building CUTLASS profiler.

-### Instantiating more kernels with Hopper
+## Instantiating more kernels with Hopper
 With Hopper (SM90), you will need to use an additional flag,
 `CUTLASS_LIBRARY_INSTANTIATION_LEVEL`, in order to instantiate all possible combinations,
 which unlike previous architectures, will be in the order of millions of kernels.
@ -81,12 +79,12 @@ Instruction shape levels control the selection of WGMMA shapes used in kernel ge
 - **Level 2**: Includes shapes that are powers of 2.
 - **Level 3**: Includes all other shapes.

-The detailed defination of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](../../python/cutlass_library/sm90_shapes.py).
+The detailed defination of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_shapes.py).

-Schedule pruning levels decide the epilogue schedule and mainloop schedule to stamp out a kernel instance. As defined in `get_valid_schedules` in [sm90_utils.py](../../python/cutlass_library/sm90_utils.py),
+Schedule pruning levels decide the epilogue schedule and mainloop schedule to stamp out a kernel instance. As defined in `get_valid_schedules` in [sm90_utils.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_utils.py),

 - **Level >= 1**: Indicates that no pruning is being applied.
- **Level 0**: Indicates pruning according to existing [generator.py](../../python/cutlass_library/generator.py) behavior.
+- **Level 0**: Indicates pruning according to existing [generator.py](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/generator.py) behavior.

 An instantiation level `500`, which is padded to `0500`, thus indicates:

@ -95,7 +93,7 @@ An instantiation level `500`, which is padded to `0500`, thus indicates:
 - **Cluster Sizes**: At level 5, allowing for clusters with 1, 2, 4, 8, or 16 CTAs.
 - **Schedule Pruning**: At level 0, where pruning is applied according to the existing `generator.py` behavior.

-### Mixed input data type kernels for Hopper
+## Mixed input data type kernels for Hopper

 With Hopper (SM90), the kernel generator will generate the following combinations of mixed input data types ("mixed dtype"):

@ -118,7 +116,7 @@ For each mixed dtype kernel, the kernel generator will generate combinations of

 For {4-bits-dtype, 8-bits-dtype} x 16-bits-dtype, the kernel generator will further generate kernels using shuffled layouts for the narrow data type matrix, which may have a better performance compared to its non-shuffle counter parts.

-### CUTLASS Profiler usage
+## CUTLASS Profiler usage

 The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
 ```bash
@ -364,11 +362,11 @@ Profile when execution is performed on device 0 and the C tensor is located on a
  $ cutlass_profiler --device=0 --allocations=C:1,D:2 --operation=Gemm --m=1024 --n=1024 --k=128
 ```

-The format of tensor argument is followed by `<type>:<layout>`. The type could be `f32` as 32-bit floating point, `s8` as 8-bit signed integer, etc. The available types can be referred to the `NumericTypeID_enumerants` in [util.cu](tools/library/src/util.cu). The layout could be `row` or `column`. If `--enable_sm90_mixed_dtype_shuffle_test=true` is used, the actual layout of the narrow data type matrix is a shuffled layout, neither `row` nor `column`.
+The format of tensor argument is followed by `<type>:<layout>`. The type could be `f32` as 32-bit floating point, `s8` as 8-bit signed integer, etc. The available types can be referred to the `NumericTypeID_enumerants` in [util.cu](https://github.com/NVIDIA/cutlass/tree/main/tools/library/src/util.cu). The layout could be `row` or `column`. If `--enable_sm90_mixed_dtype_shuffle_test=true` is used, the actual layout of the narrow data type matrix is a shuffled layout, neither `row` nor `column`.

 In addition to encoded data types, CUTLASS profiler allows non-encoded generic data types, namely `f8`, `f6`, and `f4`, with corresponding encoding specified through GEMM input argument: `--runtime_input_datatype_a` and `--runtime_input_datatype_b`. Currently, six encoding schemes are supported: `e4m3`, `e5m2`, `e3m2`, `e2m3`, and `e2m1`.

-Cluster shapes can be statically set to `Shape<int,int,_1>;` and specified via runtime arguments: `cluster_m`, `cluster_n` and `cluster_k` in CUTLASS profiler.  In addition to preferred cluster shapes, a user can also specify fallback cluster shapes via runtime arguments: `cluster_m_fallback`, `cluster_n_fallback` and `cluster_k_fallback` in CUTLASS profiler. Those fallback cluster shapes are smaller shapes than the preferred ones for the hardware to assign when there is no chance to issue a larger preferred CGA cluster to the GPU. There are several rules for using a flexible CGA: 1) Preferred CGA size should be divisible by fallback CGA size. 2) Grid dim should be divisible by preferred CGA size. 3) Preferred CGA and fallback CGA must have the same depth (cluster_dim.z must be equal). One may refer to our CUTLASS Example [73_blackwell_gemm_flexible_cluster](../../examples/73_blackwell_gemm_preferred_cluster/blackwell_gemm_preferred_cluster.cu) for more details of the this feature. 
+Cluster shapes can be statically set to `Shape<int,int,_1>;` and specified via runtime arguments: `cluster_m`, `cluster_n` and `cluster_k` in CUTLASS profiler.  In addition to preferred cluster shapes, a user can also specify fallback cluster shapes via runtime arguments: `cluster_m_fallback`, `cluster_n_fallback` and `cluster_k_fallback` in CUTLASS profiler. Those fallback cluster shapes are smaller shapes than the preferred ones for the hardware to assign when there is no chance to issue a larger preferred CGA cluster to the GPU. There are several rules for using a flexible CGA: 1) Preferred CGA size should be divisible by fallback CGA size. 2) Grid dim should be divisible by preferred CGA size. 3) Preferred CGA and fallback CGA must have the same depth (cluster_dim.z must be equal). One may refer to our CUTLASS Example [73_blackwell_gemm_flexible_cluster](https://github.com/NVIDIA/cutlass/tree/main/examples/73_blackwell_gemm_preferred_cluster/blackwell_gemm_preferred_cluster.cu) for more details of the this feature. 
 Please be noted that this feature (flexible cluster shapes within a single grid) is only applicable to `sm100a` kernels. The hardware will rasterize into a single cluster shape for those kernels that do not support this feature even with preferred or fallback cluster shapes assigned.

 CUTLASS 3.x kernels for Hopper and Blackwell also support a new feature called programatic dependent launch (PDL). This can be enabled with `--use-pdl`, and can overlap the epilogue of the prior kernel with the prologue of the next kernel. This can effectively hide kernel prologues. Using PDL can improve performance for back to back GEMMs. See [dependent kernel launch](dependent_kernel_launch.md) for more information. CUDA graphs can also be used (`--use-cuda-graphs`) with PDL to ensure that smaller kernels are enqueued back-to-back on a stream.
--- a/media/docs/cpp/programming_guidelines.md
+++ b/media/docs/cpp/programming_guidelines.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Programming Guidelines")
-
-[README](../../README.md#documentation) > **Programming Guidelines**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Programming Guidelines")

 # Programming Guidelines

@ -954,9 +952,9 @@ For example:
 ```

 Header files such as
-[cutlass/cutlass.h](../../include/cutlass/cutlass.h)
+[cutlass/cutlass.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/cutlass.h)
 and
-[cute/config.hpp](../../include/cutlass/cutlass.h)
+[cute/config.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/cutlass.h)
 offer macros for expressing compiler-dependent behavior.
 These include

--- a/media/docs/cpp/quickstart.md
+++ b/media/docs/cpp/quickstart.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Quick Start Guide")
-
-[README](../../README.md#documentation) > **Quick Start**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Quick Start Guide")

 # Quickstart

@ -217,7 +215,7 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS="50;53"          # compiles for NVIDIA Maxwell G

 ## Using CUTLASS within other applications

-Applications should list [`/include`](/include) within their include paths. They must be
+Applications should list [`/include`](https://github.com/NVIDIA/cutlass/tree/main/include) within their include paths. They must be
 compiled as C++17 or greater.

 **Example:** print the contents of a variable storing half-precision data.
@ -466,7 +464,7 @@ int main(int argc, char const **args) {

 # CUTLASS Library

-The [CUTLASS Library](/tools/library) defines an API for managing and executing collections of compiled
+The [CUTLASS Library](https://github.com/NVIDIA/cutlass/tree/main/tools/library) defines an API for managing and executing collections of compiled
 kernel instances and launching them from host code without template instantiations in client code.

 The host-side launch API is designed to be analogous to BLAS implementations for convenience, though its
@ -482,16 +480,16 @@ for dense matrix computations on NVIDIA GPUs.
 The CUTLASS Library is used by the CUTLASS Profiler to manage kernel instances, and it is also used
 by several SDK examples.

-* [10_planar_complex](/examples/10_planar_complex/planar_complex.cu)
-* [11_planar_complex_array](/examples/11_planar_complex_array/planar_complex_array.cu)
+* [10_planar_complex](https://github.com/NVIDIA/cutlass/tree/main/examples/10_planar_complex/planar_complex.cu)
+* [11_planar_complex_array](https://github.com/NVIDIA/cutlass/tree/main/examples/11_planar_complex_array/planar_complex_array.cu)

 The CUTLASS Library defines enumerated types describing numeric data types, matrix and tensor
 layouts, math operation classes, complex transformations, and more.

-Client applications should specify [`tools/library/include`](/tools/library/include) in their
+Client applications should specify [`tools/library/include`](https://github.com/NVIDIA/cutlass/tree/main/tools/library/include) in their
 include paths and link against libcutlas_lib.so.

-The CUTLASS SDK example [10_planar_complex](/examples/10_planar_complex/CMakeLists.txt) specifies
+The CUTLASS SDK example [10_planar_complex](https://github.com/NVIDIA/cutlass/tree/main/examples/10_planar_complex/CMakeLists.txt) specifies
 its dependency on the CUTLASS Library with the following CMake command.
 ```
 target_link_libraries(
@ -662,7 +660,7 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=tensorop*s*
 ## Instantiating a Blackwell SM100 GEMM kernel

 Blackwell SM100 kernels are instantiated very similarly to Hopper kernels. Let us start with an
-[FP8 GEMM without blockscaling](../../test/unit/gemm/device/sm100_gemm_f8_f8_f8_tensor_op_s32_batch_alpha_beta.cu)
+[FP8 GEMM without blockscaling](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_gemm_f8_f8_f8_tensor_op_s32_batch_alpha_beta.cu)
 as an example.

 The kernel starts with setting up datatypes and cluster shapes. 
@ -706,7 +704,7 @@ for Blackwell, so the epilogue fusion is built in a same way as an SM90 epilogue
 ```

 One can refer to our Sm100 unit tests as examples of how to correctly
-choose mainloop schedules. All of our dispatch policies can be found in [dispatch_policy.hpp](../../include/cutlass/gemm/dispatch_policy.hpp)
+choose mainloop schedules. All of our dispatch policies can be found in [dispatch_policy.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/dispatch_policy.hpp)
 and more comprehensive Blackwell specific documentation for valid 
 dispatch policies can be in [blackwell_functionality.md](./blackwell_functionality.md).

@ -729,7 +727,7 @@ dispatch policies can be in [blackwell_functionality.md](./blackwell_functionali
  >;
 ```

-Instantiating a blockscaled GEMM kernel is slightly different. Referring to an [MXFP8 GEMM](./../../test/unit/gemm/device/sm100_gemm_mxf8_mxf8_mxf8_tensor_op_f32_auto.cu) sample unit test, it takes a different tensor operation class:
+Instantiating a blockscaled GEMM kernel is slightly different. Referring to an [MXFP8 GEMM](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_gemm_mxf8_mxf8_mxf8_tensor_op_f32_auto.cu) sample unit test, it takes a different tensor operation class:
 
 ```c++
  using ElementA = cutlass::mx_float8_t<cutlass::float_e4m3_t>;
--- a/media/docs/cpp/terminology.md
+++ b/media/docs/cpp/terminology.md
@ -1,13 +1,11 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Terminology")
-
-[README](../../README.md#documentation) > **Terminology**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Terminology")

 # CUTLASS Terminology

 **cute::Layout**: A `cute::Layout` vocabulary type composed of the hierarchical `cute::Shape` and `cute::Stride`
-tuples that is used throughout CUTLASS 3.0 to represent and manipulate thread and data layouts. More details are included in the [CuTe specific tensor type documentation](/media/docs/cute/03_tensor.md).
+tuples that is used throughout CUTLASS 3.0 to represent and manipulate thread and data layouts. More details are included in the [CuTe specific tensor type documentation](cute/03_tensor.md).

-**cute::Tensor**: A pointer backed by a `cute::Layout` used to represent a tensor. More details are included in the [CuTe specific tensor type documentation](/media/docs/cute/03_tensor.md).
+**cute::Tensor**: A pointer backed by a `cute::Layout` used to represent a tensor. More details are included in the [CuTe specific tensor type documentation](cute/03_tensor.md).

 **Capacity**: (scalar) physical number of elements in memory required to store a multidimensional object; expressed as the type's LongIndex type
  - example: the capacity of a column-major matrix is `lda * N`
@ -71,7 +69,7 @@ contiguous and strided dimensions of a tile.
  `sizeof(Array<T, N>)` - gives expected value in units of bytes with minimum storage of `1 B`: (sizeof_bits<T>::value * N) / 8

 **Operator**: an object performing a computation on matrix or tensor objects. May be further refined by scope within the execution model hierarchy. Deprecated starting CUTLASS 3.0,
-replaced by [MMA and Copy atoms from CuTe](/media/docs/cute/0t_mma_atom.md).
+replaced by [MMA and Copy atoms from CuTe](cute/0t_mma_atom.md).

 **Tile Iterator**: abstraction for accessing and traversing a sequence of tiles in a tensor; CUTLASS specifies 
  [formal concepts for tile iterators](tile_iterator_concept.md). Deprecated starting CUTLASS 3.0.
--- a/media/docs/cpp/tile_iterator_concept.md
+++ b/media/docs/cpp/tile_iterator_concept.md
@ -1,6 +1,4 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Tile Iterator Concepts")
-
-[README](../../README.md#documentation) > **Tile Iterator Concepts**
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Tile Iterator Concepts")

 # Tile Iterator Concepts

--- a/media/docs/cpp/utilities.md
+++ b/media/docs/cpp/utilities.md
@ -1,13 +1,12 @@
-![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")
+![ALT](../../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")

-[README](../../README.md#documentation) > **CUTLASS Utilities**

 Note: This document discusses utilities commonly used with code that targets CUTLASS 2.x.
 Although CUTLASS 3.0's primary entry point APIs do not transact in these `cutlass::*` tensor types anymore,
 users can still find them convenient for managing allocations with trivial affine layouts.
-For more advanced host side tensor management, [`cute::Tensor`](/media/docs/cute/03_tensor.md)s
+For more advanced host side tensor management, [`cute::Tensor`](cute/03_tensor.md)s
 can be used on either host or device for any memory space and full expressive power of
-[`cute::Layout`](/media/docs/cute/01_layout.md)s.
+[`cute::Layout`](cute/01_layout.md)s.

 # CUTLASS Utilities

@ -17,12 +16,12 @@ flexible implementations of needed functionality, but they are not expected to b
 Applications should configure their builds to list `/tools/util/include` in their include
 paths.

-Source code is in [`/tools/util/include/cutlass/util/`](/tools/util/include/cutlass/util).
+Source code is in [`/tools/util/include/cutlass/util/`](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util).

 ## Tensor Allocation and I/O

 To allocate a tensor with storage in both host and device memory, use `HostTensor` in
-[`cutlass/util/host_tensor.h`](/tools/util/include/cutlass/util/host_tensor.h)
+[`cutlass/util/host_tensor.h`](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util/host_tensor.h)

 ```c++
 template <typename Element, typename Layout>
@ -61,7 +60,7 @@ cutlass::TensorView<float, cutlass::layout::ColumnMajor> device_view = tensor.de
 ```

 Printing to human-readable CSV output is accoplished with `std::ostream::operator<<()` defined in
-[`cutlass/util/tensor_view_io.h`](/tools/util/include/cutlass/util/tensor_view_io.h). 
+[`cutlass/util/tensor_view_io.h`](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util/tensor_view_io.h). 
 Note, this assumes all views refer to host memory.
 ```c++
 #include <cutlass/util/tensor_view_io.h>
@ -428,7 +427,7 @@ synclog at [synclog_at]: [header] line=[line] thread=[threadIdx.xyz] block=[bloc
 * `header`: Name of the synchronization event.
 * `line`: Code line number of the synchronization operation calling into `synclog`.

-Additional information may appear at the end of each line, such as shared memory address, phase bit, and arrive count. For more detailed information on `synclog` output, refer to [synclog.hpp](../../include/cutlass/arch/synclog.hpp) in the CUTLASS source code. 
+Additional information may appear at the end of each line, such as shared memory address, phase bit, and arrive count. For more detailed information on `synclog` output, refer to [synclog.hpp](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/synclog.hpp) in the CUTLASS source code. 

 Please note that `synclog` is an experimental feature, and its functionality is not always guaranteed. We encourage its use in custom kernels and CUTLASS examples, though it is known to be incompatible with profiler kernels.