CUTLASS 2.1 (#83)
CUTLASS 2.1 contributes: - BLAS-style host-side API added to CUTLASS Library - Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores - Minor enhancements and bug fixes
This commit is contained in:
@ -28,6 +28,7 @@ Hyperlinks to relevant unit tests demonstrate how specific template instances ma
|
||||
| **TensorOp** | 75 | 10.2+ | `s4 * s4 + s32 => {s32, s4}` | { T } x { N } => {N,T} | [example](/test/unit/gemm/device/gemm_s4t_s4n_s32n_tensor_op_s32_sm75.cu) |
|
||||
| **TensorOp** | 75 | 10.2+ | `b1 ^ b1 + s32 => {s32, b1}` | { T } x { N } => {N,T} | [example](/test/unit/gemm/device/gemm_b1t_b1n_s32n_tensor_op_s32_sm75.cu) |
|
||||
|
||||
|
||||
## Warp-level Matrix Multiply with Tensor Cores
|
||||
|
||||
The following table summarizes supported warp level shapes for each TensorOp instruction.
|
||||
|
||||
@ -141,7 +141,7 @@ int main() {
|
||||
}
|
||||
```
|
||||
|
||||
## Launching a GEMM kernel
|
||||
## Launching a GEMM kernel in CUDA
|
||||
|
||||
**Example:** launch a mixed-precision GEMM targeting Volta Tensor Cores.
|
||||
```c++
|
||||
@ -235,9 +235,172 @@ Note, the above could be simplified as follows using helper methods defined in `
|
||||
});
|
||||
```
|
||||
|
||||
# CUTLASS Library
|
||||
|
||||
The [CUTLASS Library](./tools/library) defines an API for managing and executing collections of compiled
|
||||
kernel instances and launching them from host code without template instantiations in client code.
|
||||
|
||||
The host-side launch API is designed to be analogous to BLAS implementations for convenience, though its
|
||||
kernel selection procedure is intended only to be functionally sufficient. It may not launch the
|
||||
optimal tile size for a given problem. It chooses the first available kernel whose data types,
|
||||
layouts, and alignment constraints satisfy the given problem. Kernel instances and a data structure
|
||||
describing them are completely available to client applications which may choose to implement their
|
||||
own selection logic.
|
||||
|
||||
[cuBLAS](https://developer.nvidia.com/cublas) offers the best performance and functional coverage
|
||||
for dense matrix computations on NVIDIA GPUs.
|
||||
|
||||
The CUTLASS Library is used by the CUTLASS Profiler to manage kernel instances, and it is also used
|
||||
by several SDK examples.
|
||||
|
||||
* [10_planar_complex](/examples/10_planar_complex/planar_complex.cu)
|
||||
* [11_planar_complex_array](/examples/11_planar_complex_array/planar_complex_array.cu)
|
||||
|
||||
The CUTLASS Library defines enumerated types describing numeric data types, matrix and tensor
|
||||
layouts, math operation classes, complex transformations, and more.
|
||||
|
||||
Client applications should specify [`tools/library/include`](/tools/library/include) in their
|
||||
include paths and link against libcutlas_lib.so.
|
||||
|
||||
The CUTLASS SDK example [10_planar_complex](/examples/10_planar_complex/CMakeLists.txt) specifies
|
||||
its dependency on the CUTLASS Library with the following CMake command.
|
||||
```
|
||||
target_link_libraries(
|
||||
10_planar_complex
|
||||
PRIVATE
|
||||
cutlass_lib
|
||||
cutlass_tools_util_includes
|
||||
)
|
||||
```
|
||||
|
||||
A sample kernel launch from host-side C++ is shown as follows.
|
||||
|
||||
```c++
|
||||
#include "cutlass/library/library.h"
|
||||
#include "cutlass/library/handle.h"
|
||||
|
||||
int main() {
|
||||
|
||||
//
|
||||
// Define the problem size
|
||||
//
|
||||
int M = 512;
|
||||
int N = 256;
|
||||
int K = 128;
|
||||
|
||||
float alpha = 1.25f;
|
||||
float beta = -1.25f;
|
||||
|
||||
//
|
||||
// Allocate device memory
|
||||
//
|
||||
|
||||
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> A({M, K});
|
||||
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> B({K, N});
|
||||
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> C({M, N});
|
||||
|
||||
float const *ptrA = A.device_data();
|
||||
float const *ptrB = B.device_data();
|
||||
float const *ptrC = C.device_data();
|
||||
float *ptrD = C.device_data();
|
||||
|
||||
int lda = A.device_ref().stride(0);
|
||||
int ldb = B.device_ref().stride(0);
|
||||
int ldc = C.device_ref().stride(0);
|
||||
int ldd = D.device_ref().stride(0);
|
||||
|
||||
//
|
||||
// CUTLASS Library call to execute device GEMM
|
||||
//
|
||||
|
||||
cutlass::library::Handle handle;
|
||||
|
||||
//
|
||||
// Launch GEMM on CUDA device.
|
||||
//
|
||||
|
||||
cutlass::Status status = handle.gemm(
|
||||
M,
|
||||
N,
|
||||
K,
|
||||
|
||||
cutlass::library::NumericTypeID::kF32, // data type of internal accumulation
|
||||
cutlass::library::NumericTypeID::kF32, // data type of alpha/beta scalars
|
||||
|
||||
&alpha, // pointer to alpha scalar
|
||||
|
||||
cutlass::library::NumericTypeID::kF32, // data type of A matrix
|
||||
cutlass::library::LayoutTypeID::kColumnMajor, // layout of A matrix
|
||||
ptrA, // pointer to A matrix in device memory
|
||||
lda, // leading dimension of A matrix
|
||||
|
||||
cutlass::library::NumericTypeID::kF32, // data type of B matrix
|
||||
cutlass::library::LayoutTypeID::kColumnMajor, // layout of B matrix
|
||||
ptrB, // pointer to B matrix in device memory
|
||||
ldb, // leading dimension of B matrix
|
||||
|
||||
&beta, // pointer to beta scalar
|
||||
|
||||
cutlass::library::NumericTypeID::kF32, // data type of C and D matrix
|
||||
|
||||
ptrC, // pointer to C matrix in device memory
|
||||
ldc, // leading dimension fo C matrix
|
||||
|
||||
ptrD, // pointer to D matrix in device memory
|
||||
ldd // leading dimension of D matrix
|
||||
);
|
||||
|
||||
if (status != cutlass::Status::kSuccess) {
|
||||
return -1;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Kernels can be selectively included in the CUTLASS Library by specifying filter strings when
|
||||
executing CMake. For example, only single-precision GEMM kernels can be instantiated as follows.
|
||||
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=sgemm
|
||||
```
|
||||
|
||||
Compling only the kernels desired reduces compilation time.
|
||||
|
||||
To instantiate kernels of all tile sizes, data types, and alignment constraints, specify
|
||||
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
|
||||
|
||||
Several recipes are defined below for convenience. They may be combined as a comma-delimited list.
|
||||
|
||||
**Example.** All kernels for Volta and Turing architectures.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75" -DCUTLASS_LIBRARY_KERNELS=all
|
||||
```
|
||||
|
||||
**Example.** All GEMM kernels targeting Turing Tensor Cores.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
|
||||
```
|
||||
|
||||
**Example.** All GEMM kernels with single-precision accumulation.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75" -DCUTLASS_LIBRARY_KERNELS=s*gemm
|
||||
```
|
||||
|
||||
**Example.** All kernels which expect A and B to be column-major.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75" -DCUTLASS_LIBRARY_KERNELS=gemm*nn
|
||||
```
|
||||
|
||||
**Example.** All planar complex GEMM variants.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75" -DCUTLASS_LIBRARY_KERNELS=planar_complex
|
||||
```
|
||||
|
||||
|
||||
# Copyright
|
||||
|
||||
Copyright (c) 2017-2019, NVIDIA CORPORATION. All rights reserved.
|
||||
Copyright (c) 2017-2020, NVIDIA CORPORATION. All rights reserved.
|
||||
|
||||
```
|
||||
Redistribution and use in source and binary forms, with or without modification, are permitted
|
||||
|
||||
Reference in New Issue
Block a user