CUTLASS 2.1 (#83)

CUTLASS 2.1 contributes: - BLAS-style host-side API added to CUTLASS Library - Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores - Minor enhancements and bug fixes
2020-04-07 13:51:25 -07:00
parent 7c0cd26d13
commit 96dab34ad9
196 changed files with 20653 additions and 1995 deletions
--- a/media/docs/functionality.md
+++ b/media/docs/functionality.md
@ -28,6 +28,7 @@ Hyperlinks to relevant unit tests demonstrate how specific template instances ma
 | **TensorOp**        | 75                 |  10.2+           | `s4 * s4 + s32 => {s32, s4}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm75.cu) |
 | **TensorOp**        | 75                 |  10.2+           | `b1 ^ b1 + s32 => {s32, b1}`   | { T } x { N } => {N,T} |  [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm75.cu) |

+
 ## Warp-level Matrix Multiply with Tensor Cores

 The following table summarizes supported warp level shapes for each TensorOp instruction.
--- a/media/docs/quickstart.md
+++ b/media/docs/quickstart.md
@ -141,7 +141,7 @@ int main() {
 }
 ```

-## Launching a GEMM kernel
+## Launching a GEMM kernel in CUDA

 **Example:** launch a mixed-precision GEMM targeting Volta Tensor Cores.
 ```c++
@ -235,9 +235,172 @@ Note, the above could be simplified as follows using helper methods defined in `
  });
 ```

+# CUTLASS Library
+
+The [CUTLASS Library](./tools/library) defines an API for managing and executing collections of compiled
+kernel instances and launching them from host code without template instantiations in client code.
+
+The host-side launch API is designed to be analogous to BLAS implementations for convenience, though its 
+kernel selection procedure is intended only to be functionally sufficient. It may not launch the 
+optimal tile size for a given problem. It chooses the first available kernel whose data types, 
+layouts, and alignment constraints satisfy the given problem. Kernel instances and a data structure
+describing them are completely available to client applications which may choose to implement their
+own selection logic.
+
+[cuBLAS](https://developer.nvidia.com/cublas) offers the best performance and functional coverage
+for dense matrix computations on NVIDIA GPUs.
+
+The CUTLASS Library is used by the CUTLASS Profiler to manage kernel instances, and it is also used
+by several SDK examples.
+
+* [10_planar_complex](/examples/10_planar_complex/planar_complex.cu)
+* [11_planar_complex_array](/examples/11_planar_complex_array/planar_complex_array.cu)
+
+The CUTLASS Library defines enumerated types describing numeric data types, matrix and tensor
+layouts, math operation classes, complex transformations, and more. 
+
+Client applications should specify [`tools/library/include`](/tools/library/include) in their
+include paths and link against libcutlas_lib.so.
+
+The CUTLASS SDK example [10_planar_complex](/examples/10_planar_complex/CMakeLists.txt) specifies 
+its dependency on the CUTLASS Library with the following CMake command.
+```
+target_link_libraries(
+  10_planar_complex
+  PRIVATE
+  cutlass_lib
+  cutlass_tools_util_includes
+)
+```
+
+A sample kernel launch from host-side C++ is shown as follows.
+
+```c++
+#include "cutlass/library/library.h"
+#include "cutlass/library/handle.h"
+
+int main() {
+
+  //
+  // Define the problem size
+  //
+  int M = 512;
+  int N = 256;
+  int K = 128;
+
+  float alpha = 1.25f;
+  float beta = -1.25f;
+
+  //
+  // Allocate device memory
+  //
+
+  cutlass::HostTensor<float, cutlass::layout::ColumnMajor> A({M, K});
+  cutlass::HostTensor<float, cutlass::layout::ColumnMajor> B({K, N});
+  cutlass::HostTensor<float, cutlass::layout::ColumnMajor> C({M, N});
+
+  float const *ptrA = A.device_data();
+  float const *ptrB = B.device_data();
+  float const *ptrC = C.device_data();
+  float       *ptrD = C.device_data();
+
+  int lda = A.device_ref().stride(0);
+  int ldb = B.device_ref().stride(0);
+  int ldc = C.device_ref().stride(0);
+  int ldd = D.device_ref().stride(0);
+
+  //
+  // CUTLASS Library call to execute device GEMM
+  //
+  
+  cutlass::library::Handle handle;
+
+  //
+  // Launch GEMM on CUDA device.
+  //
+
+  cutlass::Status status = handle.gemm(
+    M,
+    N,
+    K,
+
+    cutlass::library::NumericTypeID::kF32,          // data type of internal accumulation
+    cutlass::library::NumericTypeID::kF32,          // data type of alpha/beta scalars
+
+    &alpha,                                         // pointer to alpha scalar
+
+    cutlass::library::NumericTypeID::kF32,          // data type of A matrix
+    cutlass::library::LayoutTypeID::kColumnMajor,   // layout of A matrix
+    ptrA,                                           // pointer to A matrix in device memory
+    lda,                                            // leading dimension of A matrix
+
+    cutlass::library::NumericTypeID::kF32,          // data type of B matrix
+    cutlass::library::LayoutTypeID::kColumnMajor,   // layout of B matrix
+    ptrB,                                           // pointer to B matrix in device memory
+    ldb,                                            // leading dimension of B matrix
+
+    &beta,                                          // pointer to beta scalar
+
+    cutlass::library::NumericTypeID::kF32,          // data type of C and D matrix
+
+    ptrC,                                           // pointer to C matrix in device memory
+    ldc,                                            // leading dimension fo C matrix
+
+    ptrD,                                           // pointer to D matrix in device memory
+    ldd                                             // leading dimension of D matrix
+  );
+  
+  if (status != cutlass::Status::kSuccess) {
+    return -1;
+  }
+
+  return 0;
+}
+```
+
+Kernels can be selectively included in the CUTLASS Library by specifying filter strings when
+executing CMake. For example, only single-precision GEMM kernels can be instantiated as follows.
+
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=sgemm
+```
+
+Compling only the kernels desired reduces compilation time.
+
+To instantiate kernels of all tile sizes, data types, and alignment constraints, specify 
+`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
+
+Several recipes are defined below for convenience. They may be combined as a comma-delimited list.
+
+**Example.** All kernels for Volta and Turing architectures.
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75" -DCUTLASS_LIBRARY_KERNELS=all
+```
+
+**Example.** All GEMM kernels targeting Turing Tensor Cores.
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
+```
+
+**Example.** All GEMM kernels with single-precision accumulation.
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75" -DCUTLASS_LIBRARY_KERNELS=s*gemm
+```
+
+**Example.** All kernels which expect A and B to be column-major.
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75" -DCUTLASS_LIBRARY_KERNELS=gemm*nn
+```
+
+**Example.** All planar complex GEMM variants.
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75" -DCUTLASS_LIBRARY_KERNELS=planar_complex
+```
+
+
 # Copyright

-Copyright (c) 2017-2019, NVIDIA CORPORATION.  All rights reserved.
+Copyright (c) 2017-2020, NVIDIA CORPORATION.  All rights reserved.

 ```
  Redistribution and use in source and binary forms, with or without modification, are permitted