CUTLASS 2.0 (#62)
CUTLASS 2.0 Substantially refactored for - Better performance, particularly for native Turing Tensor Cores - Robust and durable templates spanning the design space - Encapsulated functionality embodying modern C++11 programming techniques - Optimized containers and data types for efficient, generic, portable device code Updates to: - Quick start guide - Documentation - Utilities - CUTLASS Profiler Native Turing Tensor Cores - Efficient GEMM kernels targeting Turing Tensor Cores - Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands Coverage of existing CUTLASS functionality: - GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs - Volta Tensor Cores through native mma.sync and through WMMA API - Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions - Batched GEMM operations - Complex-valued GEMMs Note: this commit and all that follow require a host compiler supporting C++11 or greater.
This commit is contained in:
315
media/docs/programming_guidelines.md
Normal file
315
media/docs/programming_guidelines.md
Normal file
@ -0,0 +1,315 @@
|
||||

|
||||
|
||||
[README](/README.md#documentation) > **Programming Guidelines**
|
||||
|
||||
# Programming Guidelines
|
||||
|
||||
## Hierarchical Organization
|
||||
|
||||
CUTLASS embodies a design paradigm exemplified by the [CUB library](https://nvlabs.github.io/cub/)
|
||||
for expressing collective operations. Objects expose an interface for a problem that is then decomposed
|
||||
into concurrent subtasks executed by cooperating threadblocks, warps, and threads. For example, a grid-level
|
||||
object may be constructed with base pointers to the start of a GEMM operation, add a threadblock-dependent
|
||||
offset to partition the problem, and then compute a per-threadblock GEMM. This in turn performs some
|
||||
operations as a collection of cooperating threads, while it may partition other parts of the task into
|
||||
warp-level subtasks.
|
||||
|
||||
Consequently, CUTLASS components are organized by the computation then by the layer of
|
||||
the following hierarchy.
|
||||
|
||||
* *device*: an operation is _device-wide_ and may launch one or more kernels on the GPU
|
||||
* *kernel*: an operation is implemented by a CUDA kernel with definitions for `__shared__` memory and constant memory allocations
|
||||
* *threadblock*: an operation is collectivey executed by a threadblock; any component calling `__syncthreads()` is likely to be threadblock-scope
|
||||
* *warp*: an operation is collectively executed by a warp; threads within the context of a warp are referred to as _lane_
|
||||
* *thread*: an operation is performed by an individual thread with no other data sharing or interaction with other threads
|
||||
* *instruction*: an operation corresponds to an individual hardware or PTX instruction
|
||||
|
||||
## Design Patterns
|
||||
|
||||
CUTLASS strives to achieve the highest performance possible on NVIDIA GPUs while also offering a
|
||||
flexible composition that an be easily applied to solve new problems related to Deep Learning and
|
||||
linear algebra. Though we intend to make CUTLASS as simple and straightforward as possible, given
|
||||
a tradeoff between simplicity and performance, CUTLASS chooses performance. Consequently, several
|
||||
design patterns are necessary to yield a composable structure while also satisfying these performance
|
||||
objectives. This section is intended to provide more detail.
|
||||
|
||||
### Templates
|
||||
|
||||
CUDA C++ templates and modern generic programming techniques enable CUTLASS device code to span a large design space.
|
||||
|
||||
This design space includes:
|
||||
* Mixed precision arithmetic and data storage
|
||||
* Kernels specialized for layout and problem size
|
||||
* Support for kernel fusion
|
||||
|
||||
Moreover, templates provided a structured approach to collecting compile-time constants such as tile dimensions. These
|
||||
must be template arguments to target static array allocation and take advantage of loop unrolling, constant folding,
|
||||
and function inlining.
|
||||
|
||||
### Constant Memory
|
||||
|
||||
Several CUTLASS template classes exhibit a pattern in which problem-specific internal state is known at kernel
|
||||
launch time and remains invariant throughout the execution of a kernel. For example, tile iterators compute several
|
||||
offsets based on the strides of the input tensor that is added to an internal pointer when loading the elements
|
||||
of a tile. These are computed from the tensor stride and never updated; the per-thread internal state consists
|
||||
only of the internal global memory pointer.
|
||||
|
||||
CUTLASS can take advantage of this CUDA grid-invariant property by constructing the object in host code and passing
|
||||
a composed parameters structure to the kernel. This confers two benefits: (1.) invariant state is held in constant
|
||||
memory, and (2.) there is no overhead to compute the initial state by each thread.
|
||||
|
||||
The design pattern in CUTLASS is for classes with nontrivial constructors to define `struct Params` as an inner class
|
||||
which contains grid-invariant state. These should define a constructor and an `initialize()` method. The `Params`
|
||||
structure should also include a data member corresponding to each data member in the parent class, so these too can
|
||||
be properly constructed in host code. The parent class should define a constructor which accepts `Params const &` as
|
||||
its first argument.
|
||||
|
||||
|
||||
### Composable Shared Memory
|
||||
|
||||
Shared memory requires explicit effort by the programmer to allocate and de-allocate. CUTLASS follows the paradigm
|
||||
introduced by [CUB](https://nvlabs.github.io/cub/) to define composed structures for storing data intended to be held
|
||||
in shared memory. Any object requiring shared memory storage for itself or its data members should define a child
|
||||
structure called `SharedStorage`. This holds data needed by the class and also instantiates `SharedStorage`
|
||||
objects for each data member.
|
||||
|
||||
To be consistent, this pattern defines a convention in which classes define internal shared memory storage requirements.
|
||||
Classes should consider all SharedStorage structures to be opaque other than their own child class. When the lifetimes
|
||||
of child objects are known to be non-overlapping, unions may be used to alias multiple SharedStorage objects to the same
|
||||
shared memory region and reduce overall SMEM capacity.
|
||||
|
||||
### Loop Unrolling
|
||||
|
||||
CUTLASS requires tiles of data to be stored in registers for high-bandwidth access. Simultaneously, high-throughput math instructions
|
||||
must be issued concurrently with memory instructions to hide latency with relatively few concurrent threads. These objectives are
|
||||
achieved by unrolling loops whose iteration counts are known at compile time.
|
||||
|
||||
Consequently, most loops within the CUTLASS GEMM implementation are specified by constant values and template arguments. The CUDA compiler
|
||||
is able to unroll the loop bodies, map array elements to registers, and construct an efficient instruction schedule.
|
||||
|
||||
All loops expected to be unrolled should be annotated with `CUTLASS_PRAGMA_UNROLL` to explicitly direct the compiler
|
||||
to unroll them.
|
||||
|
||||
```
|
||||
int const kN = 8;
|
||||
Array<float, kN> x; // Array we would like to store in registers
|
||||
|
||||
CUTLASS_PRAGMA_UNROLL // Directs the CUDA compiler to unroll this loop.
|
||||
for (int idx = 0; idx < kN; ++idx) { // Loop has constant number of iterations
|
||||
|
||||
x[i] = float(idx); // Indirect access by induction variable results in direct register access
|
||||
}
|
||||
```
|
||||
|
||||
## Style
|
||||
|
||||
### CUDA Built-in Variables
|
||||
|
||||
Avoid direct access to CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, and `gridDim` within
|
||||
CUTLASS components except in special circumstances.
|
||||
|
||||
Using built-in 'global' variables directly within resuable components necessitates that all components
|
||||
use them consistently which may not be possible if CUTLASS components are used in other contexts.
|
||||
|
||||
Instead, components should accept a linear ID identifying threads, warps, and threadblocks from calling
|
||||
code. The top-level kernel may then decide how to map threads, warps, and blocks to the problem it is
|
||||
solving.
|
||||
|
||||
### Use CUTLASS Fundamental Types
|
||||
|
||||
Use the [fundamental types](fundamental_types.md) defined in CUTLASS consistently. Doing so contributes
|
||||
to a framework of interoperable, consistent components.
|
||||
|
||||
In particular, be sure to use:
|
||||
|
||||
* [Numeric types](fundamental_types.md#numeric-types) to represent numeric data in host and device code
|
||||
* [Containers](fundamental_types.md#containers) to store data in register-backed arrays
|
||||
* [functional.h](fundamental_types.md#functional) to perform numeric operations in generic code
|
||||
* [Layouts](layout.md) to store stride and partially specialize template classes
|
||||
* [`TensorRef` and `TensorView`](layout.md#tensorref) to pass pointers and layout objects
|
||||
|
||||
Avoid defining alternative implementations of the same functionality. Instead, prefer to enhance
|
||||
or extend additional components where it makes sense.
|
||||
|
||||
### C++ Style
|
||||
|
||||
CUTLASS source code follows the
|
||||
[Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html) with exceptions and extensions.
|
||||
|
||||
Design choices should be consistent with the
|
||||
[CppCoreGuidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md) recommendations by Stroustrup and Sutter.
|
||||
|
||||
### Classes and Structs
|
||||
|
||||
Type names use `CapitalLetters` except when implementations are a _perfect_ drop-in replacement for
|
||||
Standard Library components.
|
||||
|
||||
Follow the [CppCoreGuidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rc-struct)
|
||||
to decide whether to use `class` or `struct`. Namely,
|
||||
* use `class` when the object must maintain an invariant. Data members related to the invariant should be private.
|
||||
* use `struct` when the class has no invariant to maintain, and data members may vary arbitrarily.
|
||||
|
||||
### Class Members
|
||||
|
||||
Methods and members are written using `snake_case`.
|
||||
|
||||
Private data and function members have suffix `_`.
|
||||
|
||||
### Constant names
|
||||
|
||||
CUTLASS makes extensive use of constants and compile-time evaluation. Constant variable names should have
|
||||
prefix `k` and use mixed case. True compile-time constsants should be defined as `constexpr` to enable
|
||||
dependent `constexpr` functions.
|
||||
|
||||
CUTLASS uses ["East const"](http://slashslash.info/2018/02/a-foolish-consistency/) style, placing `constexpr` keyword
|
||||
after the type name.
|
||||
|
||||
```c++
|
||||
float constexpr kPi = 3.14159f;
|
||||
```
|
||||
|
||||
### Class Member Order
|
||||
|
||||
Members within classes and structures should be organized as follows:
|
||||
|
||||
1. Type and constant definitions
|
||||
2. Data members
|
||||
3. Constructors
|
||||
4. Other methods
|
||||
|
||||
This convention follows the [CUB library](https://nvlabs.github.io/cub/),
|
||||
and it also approximates the usual order of Systems and Controls textbooks. That is, they start by
|
||||
(1.) identifying relevant constants, (2.) define a state-space representation of the dynamical system
|
||||
under study (i.e. the data members), and (3.) devote subsequent chapters to definining dynamical behavior
|
||||
of the system (i.e. the methods).
|
||||
|
||||
_Example_:
|
||||
```c++
|
||||
class A {
|
||||
public:
|
||||
// Type definitions
|
||||
protected:
|
||||
// protected Type definitions
|
||||
private:
|
||||
// private Type definitions
|
||||
|
||||
public:
|
||||
// Data members
|
||||
protected:
|
||||
// protected data members
|
||||
private:
|
||||
// private data members
|
||||
|
||||
public:
|
||||
// Methods
|
||||
protected:
|
||||
// protected methods
|
||||
private:
|
||||
// private methods
|
||||
|
||||
};
|
||||
|
||||
```
|
||||
|
||||
### File Names
|
||||
|
||||
Files should be named using `snake_case` with extension `.h` for header files, `.cu` for CUDA sources,
|
||||
and `.cpp` for C++ host-only source files.
|
||||
|
||||
### Use scoped enums
|
||||
|
||||
Use scoped enums added in C++11 for enumerated types. Use capital letters for the enumerated type name
|
||||
and prefix `k` for enumerators like other constants.
|
||||
|
||||
```c++
|
||||
enum class MatrixOperation {
|
||||
kNone,
|
||||
kTranspose,
|
||||
kConjugate,
|
||||
kHermitian
|
||||
};
|
||||
```
|
||||
|
||||
### Namespaces
|
||||
|
||||
Namespaces are all lower case. The top-level namespace is `cutlass::`. The second nested namespace refers
|
||||
top the general category of operation performed by its members, and the third nested namespace refers to
|
||||
the CUDA execution model scope (if applicable).
|
||||
|
||||
The bodies of namespace definitions should not be intented, and comments on the closing brace are welcome.
|
||||
|
||||
```c++
|
||||
namespace cutlass {
|
||||
namespace gemm {
|
||||
namespace warp {
|
||||
|
||||
struct MmaTensorCore {
|
||||
|
||||
};
|
||||
|
||||
} // namespace warp
|
||||
} // namespace gemm
|
||||
} // namespace cutlass
|
||||
```
|
||||
|
||||
### Macros
|
||||
|
||||
Avoid defining macros except where preprocessing is obligatory. In particular,
|
||||
avoid using macros for constants.
|
||||
|
||||
Several existing macros defined in `cutlass/cutlass.h` are useful for working around compiler-dependent
|
||||
behavior.
|
||||
|
||||
Annotations for device code:
|
||||
* `CUTLASS_HOST_DEVICE` for functions running on the host and the device
|
||||
* `CUTLASS_DEVICE` for functions running on the device only
|
||||
|
||||
Loop unrolling:
|
||||
* `CUTLASS_PRAGMA_UNROLL` for full unrolling of loops with constant trip counts
|
||||
* `CUTLASS_PRAGMA_NO_UNROLL` to prevent unrolling
|
||||
|
||||
### #pragma once
|
||||
|
||||
Use `#pragma once` to guard all headers.
|
||||
|
||||
```c++
|
||||
/*!
|
||||
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
|
||||
...
|
||||
```
|
||||
|
||||
### Source Line Length
|
||||
|
||||
Avoid lines longer than 100 characters. These typically wrap unfavorably when viewed in
|
||||
Github's pretty printer.
|
||||
|
||||
|
||||
# Copyright
|
||||
|
||||
Copyright (c) 2017-2019, NVIDIA CORPORATION. All rights reserved.
|
||||
|
||||
```
|
||||
Redistribution and use in source and binary forms, with or without modification, are permitted
|
||||
provided that the following conditions are met:
|
||||
* Redistributions of source code must retain the above copyright notice, this list of
|
||||
conditions and the following disclaimer.
|
||||
* Redistributions in binary form must reproduce the above copyright notice, this list of
|
||||
conditions and the following disclaimer in the documentation and/or other materials
|
||||
provided with the distribution.
|
||||
* Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
|
||||
to endorse or promote products derived from this software without specific prior written
|
||||
permission.
|
||||
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
|
||||
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
|
||||
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
|
||||
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
|
||||
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
|
||||
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
|
||||
STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
```
|
||||
Reference in New Issue
Block a user