CUTLASS 2.2 (#96)

Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended.
This commit is contained in:
Andrew Kerr
2020-06-08 16:17:35 -07:00
committed by GitHub
parent e33d90b361
commit 86931fef85
584 changed files with 51080 additions and 3373 deletions

View File

@ -27,7 +27,16 @@ Hyperlinks to relevant unit tests demonstrate how specific template instances ma
| **TensorOp** | 75 | 10.2+ | `s8 * s8 + s32 => {s32, s8}` | { T } x { N } => {N,T} | [example](/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm75.cu) |
| **TensorOp** | 75 | 10.2+ | `s4 * s4 + s32 => {s32, s4}` | { T } x { N } => {N,T} | [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm75.cu) |
| **TensorOp** | 75 | 10.2+ | `b1 ^ b1 + s32 => {s32, b1}` | { T } x { N } => {N,T} | [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm75.cu) |
| **TensorOp** | 80 | 11.0+ | `f16 * f16 + f16 => f16` | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f16_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `f16 * f16 + f32 => {f16, f32}`| {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `bf16 * bf16 + f32 => {bf16, f32}`| {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_bf16n_bf16t_bf16t_tensor_op_f32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `tf32 * tf32 + f32 => f32`| {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `s8 * s8 + s32 => {s32, s8}` | { T } x { N } => {N,T} | [example](/test/unit/gemm/device/gemm_s8t_s8n_s32n_tensor_op_s32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `s4 * s4 + s32 => {s32, s4}` | { T } x { N } => {N,T} | [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `b1 ^ b1 + s32 => {s32, b1}` | { T } x { N } => {N,T} | [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `f64 * f64 + f64 => f64` | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `cf32 * cf32 + cf32 => cf32` | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_cf32n_cf32t_cf32t_tensor_op_tf32_f32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `cf64 * cf64 + cf64 => cf64` | {N,T} x {N,T} => {N,T} | [example](/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_sm80.cu), [Gaussian 3m](/test/unit/gemm/device/gemm_cf64n_cf64t_cf64t_tensor_op_f64_gaussian_sm80.cu) |
## Warp-level Matrix Multiply with Tensor Cores
@ -37,9 +46,13 @@ The following table summarizes supported warp level shapes for each TensorOp ins
|-----------------|-----------------------|--------------------------------------------|
| **TensorOp** | 8-by-8-by-4 | 32x32x4, 32x64x4, 64x32x4, 64x64x4 |
| **TensorOp** | 16-by-8-by-8 | 32x32x8, 32x64x8, 64x32x8, 64x64x8 |
| **TensorOp** | 16-by-8-by-16 | 32x32x16, 32x64x16, 64x32x16, 64x64x16 |
| **TensorOp** | 8-by-8-by-16 | 32x32x16, 32x64x16, 64x32x16, 64x64x16 |
| **TensorOp** | 8-by-8-by-32 | 32x32x32, 32x64x32, 64x32x32, 64x64x32 |
| **TensorOp** | 16-by-8-by-32 | 32x32x32, 32x64x32, 64x32x32, 64x64x32 |
| **TensorOp** | 16-by-8-by-64 | 32x32x64, 32x64x64, 64x32x64, 64x64x64 |
| **TensorOp** | 8-by-8-by-128 | 32x32x128, 32x64x128, 64x32x128, 64x64x128 |
| **TensorOp** | 16-by-8-by-256 | 32x32x256, 32x64x256, 64x32x256, 64x64x256 |
TensorOp instructions depend on a permuted shared memory layout that can be efficiently
loaded from. The following tables summarize the destination shared memory layout that
@ -68,6 +81,38 @@ from global memory with layout specified in the column "GMEM Layout."
| **C** | `half_t` | `RowMajor` | `RowMajor` |
| **C** | `float` | `RowMajor` | `RowMajor` |
**TensorOp 16-by-8-by-8.**
|**Operand**|**Element** | **GMEM Layout** | **SMEM Layout** |
|-----------|--------------|-----------------|------------------------------------|
| **A** | `tfloat32_t` | `ColumnMajor` | `ColumnMajorTensorOpCongruous<32>` |
| **A** | `tfloat32_t` | `RowMajor` | `RowMajorTensorOpCrosswise<32>` |
| **B** | `tfloat32_t` | `ColumnMajor` | `ColumnMajorTensorOpCrosswise<32>` |
| **B** | `tfloat32_t` | `RowMajor` | `RowMajorTensorOpCongruous<32>` |
| **C** | `float` | `RowMajor` | `RowMajor` |
**TensorOp 16-by-8-by-16.**
|**Operand**|**Element** | **GMEM Layout** | **SMEM Layout** |
|-----------|--------------|-----------------|------------------------------------|
| **A** | `half_t`, `bfloat16_t` | `ColumnMajor` | `ColumnMajorTensorOpCongruous<16>` |
| **A** | `half_t`, `bfloat16_t` | `RowMajor` | `RowMajorTensorOpCrosswise<16>` |
| **B** | `half_t`, `bfloat16_t` | `ColumnMajor` | `ColumnMajorTensorOpCrosswise<16>` |
| **B** | `half_t`, `bfloat16_t` | `RowMajor` | `RowMajorTensorOpCongruous<16>` |
| **C** | `half_t` | `RowMajor` | `RowMajor` |
| **C** | `float` | `RowMajor` | `RowMajor` |
**TensorOp 8-by-8-by-4.**
|**Operand**|**Element** | **GMEM Layout** | **SMEM Layout** |
|-----------|--------------|-----------------|------------------------------------|
| **A** | `double` | `ColumnMajor` | `ColumnMajorTensorOpCongruous<64>` |
| **A** | `double` | `RowMajor` | `RowMajorTensorOpCrosswise<64>` |
| **B** | `double` | `ColumnMajor` | `ColumnMajorTensorOpCrosswise<64>` |
| **B** | `double` | `RowMajor` | `RowMajorTensorOpCongruous<64>` |
| **C** | `double` | `RowMajor` | `RowMajor` |
**TensorOp 8-by-8-by-16.**
|**Operand**|**Element** | **GMEM Layout** | **SMEM Layout** |
@ -76,6 +121,14 @@ from global memory with layout specified in the column "GMEM Layout."
| **B** | `int8_t` | `ColumnMajor` | `ColumnMajorTensorOpCongruous<8>` |
| **C** | `int32_t` | `RowMajor` | `RowMajor` |
**TensorOp 16-by-8-by-32.**
|**Operand**|**Element** | **GMEM Layout** | **SMEM Layout** |
|-----------|--------------|-----------------|------------------------------------|
| **A** | `int8_t` | `RowMajor` | `RowMajorTensorOpCrosswise<8>` |
| **B** | `int8_t` | `ColumnMajor` | `ColumnMajorTensorOpCongruous<8>` |
| **C** | `int32_t` | `RowMajor` | `RowMajor` |
**TensorOp 8-by-8-by-32.**
|**Operand**|**Element** | **GMEM Layout** | **SMEM Layout** |
@ -84,6 +137,14 @@ from global memory with layout specified in the column "GMEM Layout."
| **B** | `int4b_t` | `ColumnMajor` | `ColumnMajorTensorOpCongruous<4>` |
| **C** | `int32_t` | `RowMajor` | `RowMajor` |
**TensorOp 16-by-8-by-64.**
|**Operand**|**Element** | **GMEM Layout** | **SMEM Layout** |
|-----------|--------------|-----------------|------------------------------------|
| **A** | `int4b_t` | `RowMajor` | `RowMajorTensorOpCrosswise<4>` |
| **B** | `int4b_t` | `ColumnMajor` | `ColumnMajorTensorOpCongruous<4>` |
| **C** | `int32_t` | `RowMajor` | `RowMajor` |
**TensorOp 8-by-8-by-128.**
|**Operand**|**Element** | **GMEM Layout** | **SMEM Layout** |
@ -119,7 +180,7 @@ CUDA exposes warp-level matrix operations in the CUDA C++ WMMA API. The CUDA C++
# Copyright
Copyright (c) 2017-2019, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2017-2020, NVIDIA CORPORATION. All rights reserved.
```
Redistribution and use in source and binary forms, with or without modification, are permitted