Fix typos 2 (#842)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
c4f6b8c6bc
commit
7e370c9637
@ -347,7 +347,7 @@ creating GEMM-B tile in shared memory.
|
||||
The improvements covered by optimized iterators are:
|
||||
- (a) Precomputing kernel-invariant pointer deltas on the host
|
||||
- (b) Computing cta-invariant mask predicates on device-side iterator ctors
|
||||
- (c) Use of [fast divmod](/include/cutlass/fast_math.h) to map GEMM dimenstions to convolution tensors.
|
||||
- (c) Use of [fast divmod](/include/cutlass/fast_math.h) to map GEMM dimensions to convolution tensors.
|
||||
For example, _optimized_ activation iterator uses fast divmod to map GEMM _M_ to NPQ
|
||||
for activation iterator
|
||||
|
||||
|
||||
@ -587,7 +587,8 @@ To instantiate all operations supporting all tile sizes, data types, and alignme
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=all
|
||||
```
|
||||
The above command line generates about twenty thousand kernels targetting NVIDIA Ampere, Turing, and Volta architectures.
|
||||
|
||||
The above command line generates about twenty thousand kernels targeting NVIDIA Ampere, Turing, and Volta architectures.
|
||||
Compiling thousands of kernels for three different architectures is time consuming. Additionaly, this would also result
|
||||
in a large binary size and on some platforms linker to fail on building the library.
|
||||
|
||||
@ -641,13 +642,13 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop,s1681
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='50;60;61;70;75;80' -DCUTLASS_LIBRARY_KERNELS=sfprop
|
||||
```
|
||||
|
||||
**Example.** All forward propagation (fprop) convolution kernels with FP32 accumulation and FP16 input targetting NVIDIA Ampere's 16816 Tensor Core operation
|
||||
**Example.** All forward propagation (fprop) convolution kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere's 16816 Tensor Core operation
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop_*_f16
|
||||
```
|
||||
|
||||
**Example.** All backward weight gradient (wgrad) convolution kernels with FP32 accumulation, FP16 input, and optimized global memory iterator
|
||||
targetting NVIDIA Ampere, Turing, and Volta Tensor Core operations
|
||||
targeting NVIDIA Ampere, Turing, and Volta Tensor Core operations
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=tensorop*s*wgrad_optimized_f16
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user