CUTLASS 2.0 (#62)

CUTLASS 2.0 Substantially refactored for - Better performance, particularly for native Turing Tensor Cores - Robust and durable templates spanning the design space - Encapsulated functionality embodying modern C++11 programming techniques - Optimized containers and data types for efficient, generic, portable device code Updates to: - Quick start guide - Documentation - Utilities - CUTLASS Profiler Native Turing Tensor Cores - Efficient GEMM kernels targeting Turing Tensor Cores - Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands Coverage of existing CUTLASS functionality: - GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs - Volta Tensor Cores through native mma.sync and through WMMA API - Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions - Batched GEMM operations - Complex-valued GEMMs Note: this commit and all that follow require a host compiler supporting C++11 or greater.
2019-11-19 16:55:34 -08:00
parent b5cab177a9
commit fb335f6a5f
5434 changed files with 599799 additions and 250176 deletions
--- a/media/docs/programming_guidelines.md
+++ b/media/docs/programming_guidelines.md
@ -0,0 +1,315 @@
+![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Programming Guidelines")
+
+[README](/README.md#documentation) > **Programming Guidelines**
+
+# Programming Guidelines
+
+## Hierarchical Organization
+
+CUTLASS embodies a design paradigm exemplified by the [CUB library](https://nvlabs.github.io/cub/) 
+for expressing collective operations. Objects expose an interface for a problem that is then decomposed 
+into concurrent subtasks executed by cooperating threadblocks, warps, and threads. For example, a grid-level 
+object may be constructed with base pointers to the start of a GEMM operation, add a threadblock-dependent 
+offset to partition the problem, and then compute a per-threadblock GEMM. This in turn performs some 
+operations as a collection of cooperating threads, while it may partition other parts of the task into 
+warp-level subtasks.
+
+Consequently, CUTLASS components are organized by the computation then by the layer of
+the following hierarchy.
+
+* *device*: an operation is _device-wide_ and may launch one or more kernels on the GPU
+* *kernel*: an operation is implemented by a CUDA kernel with definitions for `__shared__` memory and constant memory allocations
+* *threadblock*: an operation is collectivey executed by a threadblock; any component calling `__syncthreads()` is likely to be threadblock-scope
+* *warp*: an operation is collectively executed by a warp; threads within the context of a warp are referred to as _lane_
+* *thread*: an operation is performed by an individual thread with no other data sharing or interaction with other threads
+* *instruction*: an operation corresponds to an individual hardware or PTX instruction
+
+## Design Patterns
+
+CUTLASS strives to achieve the highest performance possible on NVIDIA GPUs while also offering a
+flexible composition that an be easily applied to solve new problems related to Deep Learning and
+linear algebra. Though we intend to make CUTLASS as simple and straightforward as possible, given
+a tradeoff between simplicity and performance, CUTLASS chooses performance. Consequently, several
+design patterns are necessary to yield a composable structure while also satisfying these performance
+objectives. This section is intended to provide more detail.
+
+### Templates
+
+CUDA C++ templates and modern generic programming techniques enable CUTLASS device code to span a large design space.
+
+This design space includes:
+* Mixed precision arithmetic and data storage
+* Kernels specialized for layout and problem size
+* Support for kernel fusion
+
+Moreover, templates provided a structured approach to collecting compile-time constants such as tile dimensions. These
+must be template arguments to target static array allocation and take advantage of loop unrolling, constant folding,
+and function inlining.
+
+### Constant Memory
+
+Several CUTLASS template classes exhibit a pattern in which problem-specific internal state is known at kernel 
+launch time and remains invariant throughout the execution of a kernel. For example, tile iterators compute several 
+offsets based on the strides of the input tensor that is added to an internal pointer when loading the elements 
+of a tile. These are computed from the tensor stride and never updated; the per-thread internal state consists 
+only of the internal global memory pointer.
+
+CUTLASS can take advantage of this CUDA grid-invariant property by constructing the object in host code and passing 
+a composed parameters structure to the kernel. This confers two benefits: (1.) invariant state is held in constant 
+memory, and (2.) there is no overhead to compute the initial state by each thread.
+
+The design pattern in CUTLASS is for classes with nontrivial constructors to define `struct Params` as an inner class 
+which contains grid-invariant state. These should define a constructor and an `initialize()` method. The `Params` 
+structure should also include a data member corresponding to each data member in the parent class, so these too can 
+be properly constructed in host code. The parent class should define a constructor which accepts `Params const &` as 
+its first argument.
+
+
+### Composable Shared Memory
+
+Shared memory requires explicit effort by the programmer to allocate and de-allocate. CUTLASS follows the paradigm 
+introduced by [CUB](https://nvlabs.github.io/cub/) to define composed structures for storing data intended to be held 
+in shared memory. Any object requiring shared memory storage for itself or its data members should define a child 
+structure called `SharedStorage`. This holds data needed by the class and also instantiates `SharedStorage` 
+objects for each data member.
+
+To be consistent, this pattern defines a convention in which classes define internal shared memory storage requirements. 
+Classes should consider all SharedStorage structures to be opaque other than their own child class. When the lifetimes 
+of child objects are known to be non-overlapping, unions may be used to alias multiple SharedStorage objects to the same
+shared memory region and reduce overall SMEM capacity.
+
+### Loop Unrolling
+
+CUTLASS requires tiles of data to be stored in registers for high-bandwidth access. Simultaneously, high-throughput math instructions
+must be issued concurrently with memory instructions to hide latency with relatively few concurrent threads. These objectives are
+achieved by unrolling loops whose iteration counts are known at compile time.
+
+Consequently, most loops within the CUTLASS GEMM implementation are specified by constant values and template arguments. The CUDA compiler
+is able to unroll the loop bodies, map array elements to registers, and construct an efficient instruction schedule.
+
+All loops expected to be unrolled should be annotated with `CUTLASS_PRAGMA_UNROLL` to explicitly direct the compiler
+to unroll them. 
+
+```
+int const kN = 8;
+Array<float, kN> x;                               // Array we would like to store in registers
+
+CUTLASS_PRAGMA_UNROLL                            // Directs the CUDA compiler to unroll this loop.
+for (int idx = 0; idx < kN; ++idx) {             // Loop has constant number of iterations
+
+  x[i] = float(idx);                             // Indirect access by induction variable results in direct register access
+}
+```
+
+## Style
+
+### CUDA Built-in Variables
+
+Avoid direct access to CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, and `gridDim` within
+CUTLASS components except in special circumstances. 
+
+Using built-in 'global' variables directly within resuable components necessitates that all components
+use them consistently which may not be possible if CUTLASS components are used in other contexts.
+
+Instead, components should accept a linear ID identifying threads, warps, and threadblocks from calling
+code. The top-level kernel may then decide how to map threads, warps, and blocks to the problem it is
+solving.
+
+### Use CUTLASS Fundamental Types
+
+Use the [fundamental types](fundamental_types.md) defined in CUTLASS consistently. Doing so contributes
+to a framework of interoperable, consistent components.
+
+In particular, be sure to use:
+
+* [Numeric types](fundamental_types.md#numeric-types) to represent numeric data in host and device code
+* [Containers](fundamental_types.md#containers) to store data in register-backed arrays
+* [functional.h](fundamental_types.md#functional) to perform numeric operations in generic code
+* [Layouts](layout.md) to store stride and partially specialize template classes
+* [`TensorRef` and `TensorView`](layout.md#tensorref) to pass pointers and layout objects
+
+Avoid defining alternative implementations of the same functionality. Instead, prefer to enhance
+or extend additional components where it makes sense.
+
+### C++ Style
+
+CUTLASS source code follows the 
+[Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html) with exceptions and extensions.
+
+Design choices should be consistent with the 
+[CppCoreGuidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md) recommendations by Stroustrup and Sutter.
+
+### Classes and Structs
+
+Type names use `CapitalLetters` except when implementations are a _perfect_ drop-in replacement for
+Standard Library components.
+
+Follow the [CppCoreGuidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rc-struct) 
+to decide whether to use `class` or `struct`. Namely,
+* use `class` when the object must maintain an invariant. Data members related to the invariant should be private.
+* use `struct` when the class has no invariant to maintain, and data members may vary arbitrarily.
+
+### Class Members
+
+Methods and members are written using `snake_case`.
+
+Private data and function members have suffix `_`.
+
+### Constant names
+
+CUTLASS makes extensive use of constants and compile-time evaluation. Constant variable names should have
+prefix `k` and use mixed case. True compile-time constsants should be defined as `constexpr` to enable
+dependent `constexpr` functions.
+
+CUTLASS uses ["East const"](http://slashslash.info/2018/02/a-foolish-consistency/) style, placing `constexpr` keyword
+after the type name.
+
+```c++
+float constexpr kPi = 3.14159f;
+```
+
+### Class Member Order
+
+Members within classes and structures should be organized as follows:
+
+1. Type and constant definitions
+2. Data members
+3. Constructors
+4. Other methods
+
+This convention follows the [CUB library](https://nvlabs.github.io/cub/), 
+and it also approximates the usual order of Systems and Controls textbooks. That is, they start by 
+(1.) identifying relevant constants, (2.) define a state-space representation of the dynamical system 
+under study (i.e. the data members), and (3.) devote subsequent chapters to definining dynamical behavior
+of the system (i.e. the methods).
+
+_Example_:
+```c++
+class A {
+public:
+  // Type definitions
+protected:
+  // protected Type definitions
+private:
+  // private Type definitions
+
+public:
+  // Data members
+protected:
+  // protected data members
+private:
+  // private data members
+
+public:
+  // Methods
+protected:
+  // protected methods
+private:
+  // private methods
+
+};
+
+```
+
+### File Names
+
+Files should be named using `snake_case` with extension `.h` for header files, `.cu` for CUDA sources,
+and `.cpp` for C++ host-only source files.
+
+### Use scoped enums
+
+Use scoped enums added in C++11 for enumerated types. Use capital letters for the enumerated type name
+and prefix `k` for enumerators like other constants.
+
+```c++
+enum class MatrixOperation {
+  kNone,
+  kTranspose,
+  kConjugate,
+  kHermitian
+};
+```
+
+### Namespaces
+
+Namespaces are all lower case. The top-level namespace is `cutlass::`. The second nested namespace refers
+top the general category of operation performed by its members, and the third nested namespace refers to
+the CUDA execution model scope (if applicable).
+
+The bodies of namespace definitions should not be intented, and comments on the closing brace are welcome.
+
+```c++
+namespace cutlass {
+namespace gemm {
+namespace warp {
+
+struct MmaTensorCore {
+
+};
+
+} // namespace warp
+} // namespace gemm
+} // namespace cutlass
+```
+
+### Macros
+
+Avoid defining macros except where preprocessing is obligatory. In particular, 
+avoid using macros for constants.
+
+Several existing macros defined in `cutlass/cutlass.h` are useful for working around compiler-dependent
+behavior.
+
+Annotations for device code:
+* `CUTLASS_HOST_DEVICE` for functions running on the host and the device
+* `CUTLASS_DEVICE` for functions running on the device only
+
+Loop unrolling:
+* `CUTLASS_PRAGMA_UNROLL` for full unrolling of loops with constant trip counts
+* `CUTLASS_PRAGMA_NO_UNROLL` to prevent unrolling
+
+### #pragma once
+
+Use `#pragma once` to guard all headers.
+
+```c++
+/*!
+
+*/
+
+#pragma once
+
+...
+```
+
+### Source Line Length
+
+Avoid lines longer than 100 characters. These typically wrap unfavorably when viewed in
+Github's pretty printer.
+
+
+# Copyright
+
+Copyright (c) 2017-2019, NVIDIA CORPORATION.  All rights reserved.
+
+```
+  Redistribution and use in source and binary forms, with or without modification, are permitted
+  provided that the following conditions are met:
+      * Redistributions of source code must retain the above copyright notice, this list of
+        conditions and the following disclaimer.
+      * Redistributions in binary form must reproduce the above copyright notice, this list of
+        conditions and the following disclaimer in the documentation and/or other materials
+        provided with the distribution.
+      * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
+        to endorse or promote products derived from this software without specific prior written
+        permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
+  IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
+  FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+  OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+  STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```