162
examples/40_cutlass_py/README.md
Normal file
162
examples/40_cutlass_py/README.md
Normal file
@ -0,0 +1,162 @@
|
||||
# CUTLASS Python Interface Example
|
||||
|
||||
## Using Docker
|
||||
You can run the PyCUTLASS on NGC pytorch container.
|
||||
```shell
|
||||
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.08-py3
|
||||
```
|
||||
PyCUTLASS requires additional dependency Boost C++ library, which can be installed with
|
||||
```bash
|
||||
apt-get update
|
||||
apt-get -y install libboost-all-dev
|
||||
```
|
||||
|
||||
|
||||
## Install the Python Interface
|
||||
The source code for python interface is allocated at `tools/library/script/pycutlass`. It requires two environment variables:
|
||||
* `CUTLASS_PATH`: the root directory of CUTLASS
|
||||
* `CUDA_INSTALL_PATH`: the directory where cuda toolkit is installed
|
||||
|
||||
After setting these two environment variables, PyCUTLASS can be installed with
|
||||
```shell
|
||||
cd $CUTLASS_PATH/tools/library/scripts/pycutlass && bash build.sh
|
||||
```
|
||||
***
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue 1: permission denied
|
||||
Building PyCUTLASS requires installing dependencies to python. So conda could an option if you don't have permission.
|
||||
|
||||
### Issue 2: rmm: module not found
|
||||
PyCUTLASS manages the device memory with [RMM](https://github.com/rapidsai/rmm). Our `build.sh` automatically pull the [rmm branch-22.08](https://github.com/rapidsai/rmm/tree/branch-22.08) from github and build it from source. The rmm is allocated at `$CUTLASS_PATH/tools/library/scripts/pycutlass/rmm`. It requires `cmake > 3.20.1`. If the build fails, it can be manually fixed with the following steps:
|
||||
```shell
|
||||
cd $CUTLASS_PATH/tools/library/scripts/pycutlass/rmm && ./build.sh librmm rmm
|
||||
|
||||
cd $CUTLASS_PATH/tools/library/scripts/pycutlass/rmm/python
|
||||
python setup.py build_ext --inplace
|
||||
python setup.py install
|
||||
```
|
||||
To test whether rmm is successfully installed, try `import rmm`. For other issues related to rmm, please check https://github.com/rapidsai/rmm/issues.
|
||||
|
||||
***
|
||||
For all the tests, add `--print_cuda` to print the underlying CUDA kernel. Use `-h` or `--help` to display the help message.
|
||||
## GEMM Examples
|
||||
The GEMM examples use numpy to create input tensors and verify the results.
|
||||
### GEMM F64 Example
|
||||
Example 1: SM80_Device_Gemm_f64t_f64n_f64n_tensor_op_f64_32x32x16_16x16x16
|
||||
```python
|
||||
python gemm.py -i 8 8 4 -ta float64 -tb float64 -tc float64 -tacc float64 -m multiply_add -op TensorOp -b 32 32 16 -s 4 -w 2 2 1 -cc 80 -la ColumnMajor -aa 1 -lb RowMajor -ab 1 -lc RowMajor -ac 1 -te float64 -ep LinearCombination -sw IdentitySwizzle1 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm Gemm -k 1
|
||||
```
|
||||
Example 2: SM80_Device_Gemm_f64n_f64t_f64n_tensor_op_f64_64x64x16_32x32x16, split_k(2)_serial
|
||||
```python
|
||||
python gemm.py -i 8 8 4 -ta float64 -tb float64 -tc float64 -tacc float64 -m multiply_add -op TensorOp -b 64 64 16 -s 4 -w 2 2 1 -cc 80 -la RowMajor -aa 1 -lb ColumnMajor -ab 1 -lc RowMajor -ac 1 -te float64 -ep LinearCombination -sw IdentitySwizzle1 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm Gemm -k 2
|
||||
```
|
||||
|
||||
### GEMM F32 Example
|
||||
Example 1: SM80_Device_Gemm_f32n_f32t_f32n_tensor_op_bf16_f32_128x128x32_64x64x32
|
||||
```python
|
||||
python gemm.py -i 16 8 8 -ta float32 -tb float32 -tc float32 -tacc float32 -m multiply_add_fast_bf16 -op TensorOp -b 128 128 32 -s 3 -w 2 2 1 -cc 80 -la RowMajor -aa 4 -lb ColumnMajor -ab 4 -lc RowMajor -ac 4 -te float32 -ep LinearCombination -sw IdentitySwizzle1 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm Gemm -k 1
|
||||
```
|
||||
Example 2: SM80_Device_Gemm_f32t_f32t_f32n_tensor_op_f32_128x128x32_64x64x32, split_k(2)_parallel
|
||||
```python
|
||||
python gemm.py -i 16 8 8 -ta float32 -tb float32 -tc float32 -tacc float32 -m multiply_add -op TensorOp -b 128 128 32 -s 3 -w 2 2 1 -cc 80 -la ColumnMajor -aa 4 -lb ColumnMajor -ab 4 -lc RowMajor -ac 4 -te float32 -ep LinearCombination -sw IdentitySwizzle1 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm GemmSplitKParallel -k 2
|
||||
```
|
||||
Example 3: SM80_Device_Gemm_f32t_f32t_f32n_tensor_op_fast_accurate_f32_64x64x32_32x32x32, split_k(4)_serial
|
||||
```python
|
||||
python gemm.py -i 16 8 8 -ta float32 -tb float32 -tc float32 -tacc float32 -m multiply_add_fast_f32 -op TensorOp -b 64 64 32 -s 3 -w 2 2 1 -cc 80 -la ColumnMajor -aa 4 -lb ColumnMajor -ab 4 -lc RowMajor -ac 4 -te float32 -ep LinearCombination -sw IdentitySwizzle1 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm Gemm -k 4
|
||||
```
|
||||
|
||||
### GEMM F16 Example
|
||||
Example 1: SM80_Device_Gemm_f32t_f32n_f32t_tensor_op_bf16_f32_128x128x32_64x64x32
|
||||
```python
|
||||
python gemm.py -i 16 8 16 -ta float16 -tb float16 -tc float32 -tacc float32 -m multiply_add -op TensorOp -b 128 128 32 -s 3 -w 2 2 1 -cc 80 -la ColumnMajor -aa 8 -lb RowMajor -ab 8 -lc ColumnMajor -ac 4 -te float32 -ep LinearCombination -sw IdentitySwizzle4 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm Gemm -k 1
|
||||
```
|
||||
Example 2: SM80_Device_Gemm_f16t_f16t_f16n_tensor_op_f32_128x128x64_64x64x64, split_k(2)_serial
|
||||
```python
|
||||
python gemm.py -i 16 8 16 -ta float16 -tb float16 -tc float16 -tacc float32 -m multiply_add -op TensorOp -b 128 128 64 -s 3 -w 2 2 1 -cc 80 -la ColumnMajor -aa 8 -lb ColumnMajor -ab 8 -lc RowMajor -ac 8 -te float32 -ep LinearCombination -sw IdentitySwizzle2 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm Gemm -k 2
|
||||
```
|
||||
Example 3: SM80_Device_Gemm_f16t_f16t_f32n_tensor_op_f32_256x128x64_64x64x64, split_k(3)_serial
|
||||
```python
|
||||
python gemm.py -i 16 8 16 -ta float16 -tb float16 -tc float32 -tacc float32 -m multiply_add -op TensorOp -b 256 128 64 -s 3 -w 4 2 1 -cc 80 -la ColumnMajor -aa 8 -lb ColumnMajor -ab 8 -lc RowMajor -ac 4 -te float32 -ep LinearCombination -sw IdentitySwizzle1 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm GemmSplitKParallel -k 3
|
||||
```
|
||||
|
||||
### GEMM BF16 Example
|
||||
Example 1: Device_Gemm_bf16t_bf16t_f32n_tensor_op_f32_64x128x64_32x64x64, split_k(5)_parallel
|
||||
```python
|
||||
python gemm.py -i 16 8 16 -ta bfloat16 -tb bfloat16 -tc float32 -tacc float32 -m multiply_add -op TensorOp -b 64 128 64 -s 3 -w 2 2 1 -cc 80 -la ColumnMajor -aa 8 -lb ColumnMajor -ab 8 -lc RowMajor -ac 4 -te float32 -ep LinearCombination -sw IdentitySwizzle2 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm GemmSplitKParallel -k 5
|
||||
```
|
||||
|
||||
### GEMM Int8 Example
|
||||
Example 1: SM80_Device_Gemm_s8n_s8t_s8n_tensor_op_s32_256x128x128_64x64x128
|
||||
```python
|
||||
python gemm.py -i 16 8 32 -ta int8 -tb int8 -tc int8 -tacc int32 -m multiply_add -op TensorOp -b 128 128 128 -s 3 -w 2 2 1 -cc 80 -la RowMajor -aa 16 -lb ColumnMajor -ab 16 -lc RowMajor -ac 16 -te float32 -ep FastLinearCombinationClamp -sw IdentitySwizzle2 -p 512 512 512 -alpha 1.0 -beta 0.0 -gm Gemm -k 1
|
||||
```
|
||||
***
|
||||
## GEMM Grouped Examples
|
||||
The GEMM Grouped examples use numpy to create input tensors and verify the results.
|
||||
|
||||
Example 1: SM80_Device_GemmGrouped_f16t_f16t_f32t_tensor_op_f32_128x128x32_64x64x32, device schedule
|
||||
```python
|
||||
python gemm_grouped.py -i 16 8 16 -ta float16 -tb float16 -tc float32 -tacc float32 -m multiply_add -op TensorOp -b 128 128 32 -s 3 -w 2 2 1 -cc 80 -la ColumnMajor -aa 8 -lb ColumnMajor -ab 8 -lc ColumnMajor -ac 4 -te float32 -ep LinearCombination -sw IdentitySwizzle1 -p ./grouped_gemm_problem_size.csv -alpha 1.0 -beta 0.0 -pm Device
|
||||
```
|
||||
Example 2: SM80_Device_GemmGrouped_f64n_f64n_f64t_tensor_op_f64_64x64x16_32x32x16, host schedule
|
||||
```python
|
||||
python gemm_grouped.py -i 8 8 4 -ta float64 -tb float64 -tc float64 -tacc float64 -m multiply_add -op TensorOp -b 64 64 16 -s 4 -w 2 2 1 -cc 80 -la RowMajor -aa 1 -lb RowMajor -ab 1 -lc ColumnMajor -ac 1 -te float64 -ep LinearCombination -sw IdentitySwizzle2 -p ./grouped_gemm_problem_size.csv -alpha 1.0 -beta 1.0 -pm Host
|
||||
```
|
||||
Example 3: SM80_Device_GemmGrouped_f32n_f32n_f32n_simt_f32_128x64x8_64x32x1, device schedule
|
||||
```python
|
||||
python gemm_grouped.py -i 1 1 1 -ta float32 -tb float32 -tc float32 -tacc float32 -m multiply_add -op Simt -b 128 64 8 -s 4 -w 2 2 1 -cc 80 -la RowMajor -aa 1 -lb RowMajor -ab 1 -lc RowMajor -ac 1 -te float32 -ep LinearCombination -sw IdentitySwizzle4 -p ./grouped_gemm_problem_size.csv -alpha 2.0 -beta 1.0 -pm Device
|
||||
```
|
||||
Example 4: SM80_Device_GemmGrouped_f16t_f16t_f32t_tensor_op_f32_128x128x32_64x64x32, device schedule
|
||||
```python
|
||||
python gemm_grouped.py -i 16 8 16 -ta float16 -tb float16 -tc float32 -tacc float32 -m multiply_add -op TensorOp -b 128 128 32 -s 3 -w 2 2 1 -cc 80 -la ColumnMajor -aa 8 -lb ColumnMajor -ab 8 -lc ColumnMajor -ac 4 -te float32 -ep LinearCombination -sw IdentitySwizzle8 -p ./grouped_gemm_problem_size.csv -alpha 2.0 -beta 1.0 -pm Device
|
||||
```
|
||||
***
|
||||
## Conv2d Example
|
||||
The Conv2d examples use pytorch to create input tensors and verify the results. Pytorch can be installed following the [official website](https://pytorch.org/#:~:text=Aid%20to%20Ukraine.-,INSTALL%20PYTORCH,-Select%20your%20preferences).
|
||||
### Conv2d F32 Fprop
|
||||
Example 1: SM80_Device_Conv2d_Fprop_Analytic_ImplicitGemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32
|
||||
```python
|
||||
python conv2d.py -i 16 8 8 -ta float32 -tb float32 -tc float32 -tacc float32 -m multiply_add -op TensorOp -b 128 128 16 -s 3 -w 2 2 1 -cc 80 -la TensorNHWC -aa 4 -lb TensorNHWC -ab 4 -lc TensorNHWC -ac 4 -te float32 -ep LinearCombination -sw IdentitySwizzle1 -co fprop -st Strided -ia optimized -sm Serial -k 1 -nhwc 1 13 17 8 -krsc 24 3 3 8 -pad 0 0 0 0 -stride 2 2 -dilation 1 1 -alpha 1.0 -beta 0.0
|
||||
```
|
||||
Example 2: SM80_Device_Conv2d_Fprop_Optimized_ImplicitGemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_align2
|
||||
```python
|
||||
python conv2d.py -i 16 8 8 -ta float32 -tb float32 -tc float32 -tacc float32 -m multiply_add -op TensorOp -b 128 128 16 -s 3 -w 2 2 1 -cc 80 -la TensorNHWC -aa 2 -lb TensorNHWC -ab 2 -lc TensorNHWC -ac 2 -te float32 -ep LinearCombination -sw IdentitySwizzle2 -co fprop -st Strided -ia optimized -sm Serial -k 2 -nhwc 1 4 4 12 -krsc 8 3 3 12 -pad 0 0 0 0 -stride 3 3 -dilation 1 1 -alpha 1.0 -beta 1.0
|
||||
```
|
||||
Example 3: SM80_Device_Conv2d_Fprop_Analytic_ImplicitGemm_f32nhwc_f32nhwc_f32nhwc_simt_f32
|
||||
```python
|
||||
python conv2d.py -i 1 1 1 -ta float32 -tb float32 -tc float32 -tacc float32 -m multiply_add -op Simt -b 128 128 8 -s 4 -w 4 2 1 -cc 80 -la TensorNHWC -aa 4 -lb TensorNHWC -ab 4 -lc TensorNHWC -ac 1 -te float32 -ep LinearCombination -sw IdentitySwizzle4 -co fprop -st Strided -ia analytic -sm Parallel -k 3 -nhwc 1 71 80 32 -krsc 64 5 5 32 -pad 2 2 2 2 -stride 2 2 -dilation 1 1 -alpha 1.0 -beta 1.0
|
||||
```
|
||||
### Conv2d F32 Wgrad
|
||||
Example 1: Device_Conv2d_Wgrad_Optimized_ImplicitGemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_align1
|
||||
```python
|
||||
python conv2d.py -i 16 8 8 -ta float32 -tb float32 -tc float32 -tacc float32 -m multiply_add -op TensorOp -b 128 128 32 -s 3 -w 2 2 1 -cc 80 -la TensorNHWC -aa 1 -lb TensorNHWC -ab 1 -lc TensorNHWC -ac 4 -te float32 -ep LinearCombination -sw IdentitySwizzle1 -co wgrad -st Strided -ia optimized -sm Serial -k 1 -nhwc 1 8 8 1 -krsc 1 3 3 1 -pad 1 1 1 1 -stride 1 1 -dilation 1 1 -alpha 1.0 -beta 0.0
|
||||
```
|
||||
Example 2: Device_Conv2d_Wgrad_Analytic_ImplicitGemm_f32nhwc_f32nhwc_f32nhwc_simt_f32
|
||||
```python
|
||||
python conv2d.py -i 1 1 1 -ta float32 -tb float32 -tc float32 -tacc float32 -m multiply_add -op Simt -b 128 128 8 -s 4 -w 2 4 1 -cc 80 -la TensorNHWC -aa 4 -lb TensorNHWC -ab 4 -lc TensorNHWC -ac 1 -te float32 -ep LinearCombination -sw IdentitySwizzle1 -co wgrad -st Strided -ia optimized -sm Serial -k 2 -nhwc 1 27 27 256 -krsc 512 3 3 256 -pad 1 1 1 1 -stride 2 1 -dilation 1 1 -alpha 1.0 -beta 0.0
|
||||
```
|
||||
### Conv2d F32 Dgrad
|
||||
Example 1: Device_Conv2d_Dgrad_Analytic_ImplicitGemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32
|
||||
```python
|
||||
python conv2d.py -i 16 8 8 -ta float32 -tb float32 -tc float32 -tacc float32 -m multiply_add -op TensorOp -b 128 128 16 -s 3 -w 2 2 1 -cc 80 -la TensorNHWC -aa 4 -lb TensorNHWC -ab 4 -lc TensorNHWC -ac 4 -te float32 -ep LinearCombination -sw StridedDgradIdentitySwizzle1 -co dgrad -st Strided -ia optimized -sm Serial -k 2 -nhwc 1 27 27 256 -krsc 512 3 3 256 -pad 1 1 1 1 -stride 2 1 -dilation 1 1 -alpha 1.0 -beta 0.0
|
||||
```
|
||||
|
||||
### Conv2d F16 Fprop
|
||||
Example 1: SM80_Device_Conv2d_Fprop_Analytic_ImplicitGemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32
|
||||
```python
|
||||
python conv2d.py -i 16 8 16 -ta float16 -tb float16 -tc float16 -tacc float32 -m multiply_add -op TensorOp -b 128 128 64 -s 3 -w 2 2 1 -cc 80 -la TensorNHWC -aa 8 -lb TensorNHWC -ab 8 -lc TensorNHWC -ac 8 -te float32 -ep LinearCombination -sw IdentitySwizzle1 -co fprop -st Strided -ia optimized -sm Serial -k 1 -nhwc 1 27 27 256 -krsc 512 3 3 256 -pad 1 1 1 1 -stride 2 1 -dilation 1 1 -alpha 1.0 -beta 0.0
|
||||
```
|
||||
Example 2: SM80_Device_Conv2d_Fprop_Few_Channels_ImplicitGemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_channels_2
|
||||
```python
|
||||
python conv2d.py -i 16 8 16 -ta float16 -tb float16 -tc float16 -tacc float32 -m multiply_add -op TensorOp -b 128 128 64 -s 3 -w 2 2 1 -cc 80 -la TensorNHWC -aa 2 -lb TensorNHWC -ab 2 -lc TensorNHWC -ac 8 -te float32 -ep LinearCombination -sw IdentitySwizzle1 -co fprop -st Strided -ia few_channels -sm Serial -k 1 -nhwc 1 16 16 2 -krsc 16 3 3 2 -pad 1 1 1 1 -stride 2 2 -dilation 1 1 -alpha 1.0 -beta 0.0
|
||||
```
|
||||
Example 3: SM80_Device_Conv2d_Fprop_Fixed_Channels_ImplicitGemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_channels_8
|
||||
```python
|
||||
python conv2d.py -i 16 8 16 -ta float16 -tb float16 -tc float16 -tacc float32 -m multiply_add -op TensorOp -b 128 128 64 -s 3 -w 2 2 1 -cc 80 -la TensorNHWC -aa 8 -lb TensorNHWC -ab 8 -lc TensorNHWC -ac 8 -te float32 -ep LinearCombination -sw IdentitySwizzle2 -co fprop -st Strided -ia fixed_channels -sm Serial -k 1 -nhwc 1 8 8 8 -krsc 16 3 3 8 -pad 1 1 1 1 -stride 2 2 -dilation 1 1 -alpha 1.0 -beta 0.0
|
||||
```
|
||||
Example 4: SM80_Device_Conv2d_Strided_Dgrad_Optimized_ImplicitGemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_128x128_32x3_64x64x32_align4
|
||||
```python
|
||||
python conv2d.py -i 16 8 16 -ta float16 -tb float16 -tc float16 -tacc float32 -m multiply_add -op TensorOp -b 128 128 32 -s 3 -w 2 2 1 -cc 80 -la TensorNHWC -aa 4 -lb TensorNHWC -ab 4 -lc TensorNHWC -ac 4 -te float32 -ep LinearCombination -sw StridedDgradIdentitySwizzle1 -co dgrad -st Strided -ia optimized -sm Serial -k 1 -nhwc 1 56 56 12 -krsc 8 1 1 12 -pad 0 0 0 0 -stride 2 2 -dilation 1 1 -alpha 1.0 -beta 0.0
|
||||
```
|
||||
277
examples/40_cutlass_py/conv2d.py
Normal file
277
examples/40_cutlass_py/conv2d.py
Normal file
@ -0,0 +1,277 @@
|
||||
################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
################################################################################
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass.conv2d_operation import *
|
||||
from pycutlass.utils import reference_model
|
||||
|
||||
import argparse
|
||||
|
||||
# parse the arguments
|
||||
parser = argparse.ArgumentParser(description="Launch CUTLASS convolution 2d kernels from python")
|
||||
|
||||
# Operation description
|
||||
# math instruction description
|
||||
parser.add_argument("-i", "--instruction_shape",
|
||||
default=[1, 1, 1], nargs=3, type=int,
|
||||
help="This option describes the size of MMA op")
|
||||
parser.add_argument("-ta", "--element_a", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of elements in input tensor A')
|
||||
parser.add_argument("-tb", "--element_b", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of elements in input tensor B')
|
||||
parser.add_argument("-tc", "--element_c", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of elements in input tensor C and output tensor D')
|
||||
parser.add_argument("-tacc", "--element_acc", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of accumulator')
|
||||
parser.add_argument('-m', "--math", default="multiply_add",
|
||||
type=str, choices=["multiply_add", "multiply_add_fast_bf16", "multiply_add_fast_f32"], help="math instruction")
|
||||
parser.add_argument('-op', "--opcode", default="simt", type=str,
|
||||
choices=["Simt", 'TensorOp'],
|
||||
help='This option describes whether you want to use tensor \
|
||||
cores (TensorOp) or regular SIMT cores (Simt) on GPU SM')
|
||||
# tile description
|
||||
parser.add_argument("-b", "--threadblock_shape",
|
||||
default=[128, 128, 8], nargs=3, type=int,
|
||||
help="This option describes the tile size a thread block with compute")
|
||||
parser.add_argument("-s", "--stages", default=4,
|
||||
type=int, help="Number of pipelines you want to use")
|
||||
parser.add_argument("-w", "--warp_count", default=[
|
||||
4, 2, 1], nargs=3, type=int,
|
||||
help="This option describes the number of warps along M, N, and K of the threadblock")
|
||||
parser.add_argument("-cc", "--compute_capability", default=80,
|
||||
type=int, help="This option describes CUDA SM architecture number")
|
||||
# A
|
||||
parser.add_argument('-la', "--layout_a", default="TensorNHWC", type=str, choices=[
|
||||
"TensorNHWC", "TensorNC32HW32"],
|
||||
help="Memory layout of input tensor A")
|
||||
parser.add_argument('-aa', '--alignment_a', default=1,
|
||||
type=int, help="Memory alignement of input tensor A")
|
||||
# B
|
||||
parser.add_argument('-lb', "--layout_b", default="TensorNHWC", type=str, choices=[
|
||||
"TensorNHWC", "TensorC32RSK32"],
|
||||
help="Memory layout of input tensor B")
|
||||
parser.add_argument('-ab', '--alignment_b', default=1,
|
||||
type=int, help="Memory alignment of input tensor B")
|
||||
# C
|
||||
parser.add_argument('-lc', "--layout_c", default="TensorNHWC", type=str, choices=[
|
||||
"TensorNHWC", "TensorNC32HW32"],
|
||||
help="Memory layout of input tensor C and output tensor D")
|
||||
parser.add_argument('-ac', '--alignment_c', default=1,
|
||||
type=int, help="Memory alignment of input tensor C and output tensor D")
|
||||
# epilogue
|
||||
parser.add_argument("-te", "--element_epilogue", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16'],
|
||||
help='Data type of computation in the epilogue')
|
||||
parser.add_argument("-ep", "--epilogue_functor", default="LinearCombination",
|
||||
type=str, choices=['LinearCombination', 'FastLinearCombinationClamp', 'LinearCombinationClamp'],
|
||||
help="This option describes the epilogue part of the kernel")
|
||||
# swizzling
|
||||
parser.add_argument("-sw", "--swizzling_functor", default="IdentitySwizzle1", type=str, choices=[
|
||||
"IdentitySwizzle1", "IdentitySwizzle2", "IdentitySwizzle4", "IdentitySwizzle8",
|
||||
"HorizontalSwizzle", "StridedDgradIdentitySwizzle1", "StridedDgradIdentitySwizzle4",
|
||||
"StridedDgradHorizontalSwizzle"],
|
||||
help="This option describes how thread blocks are scheduled on GPU")
|
||||
# conv related
|
||||
parser.add_argument("-co", "--conv_kind", default="fprop", type=str, choices=['fprop', 'dgrad', 'wgrad'],
|
||||
help="The type of convolution: forward propagation (fprop), \
|
||||
gradient of activation (dgrad), gradient of weight (wgrad)")
|
||||
parser.add_argument("-st", "--stride_support", default="Strided", type=str, choices=["Strided", "Unity"],
|
||||
)
|
||||
parser.add_argument("-ia", "--iterator_algorithm", default="analytic", type=str,
|
||||
choices=["analytic", "optimized", "fixed_channels", "few_channels"],
|
||||
help="This option describes iterator algorithm")
|
||||
|
||||
# arguments
|
||||
parser.add_argument("-sm", "--split_k_mode", default="Serial", type=str, choices=["Serial", "Parallel"],
|
||||
help="Split K Mode. Serial is used for non-splitK or serial-splitK.\
|
||||
Parallel is used for parallel splitK.")
|
||||
parser.add_argument('-k', '--split_k_slices', default=1,
|
||||
type=int, help="Number of split-k partitions. (default 1)")
|
||||
parser.add_argument("-nhwc", "--nhwc", nargs=4, type=int, help="input size (NHWC)")
|
||||
parser.add_argument("-krsc", "--krsc", nargs=4, type=int, help="filter size (KRSC)")
|
||||
parser.add_argument("-pad", "--pad", nargs=4, type=int, help="padding (pad_h, _, pad_w, _)")
|
||||
parser.add_argument("-stride", "--stride", nargs=2, type=int, help="stride (stride_h, stride_w)")
|
||||
parser.add_argument("-dilation", "--dilation", nargs=2, type=int, help="dilation (dilation_h, dilation_w)")
|
||||
parser.add_argument("-alpha", "--alpha", default=1.0, type=float, help="alpha")
|
||||
parser.add_argument("-beta", "--beta", default=0.0, type=float, help="beta")
|
||||
|
||||
parser.add_argument('--print_cuda', action="store_true",
|
||||
help="print the underlying CUDA kernel")
|
||||
|
||||
try:
|
||||
args = parser.parse_args()
|
||||
except:
|
||||
sys.exit(0)
|
||||
|
||||
pycutlass.get_memory_pool(init_pool_size=2**30, max_pool_size=2**32)
|
||||
|
||||
element_a = getattr(cutlass, args.element_a)
|
||||
element_b = getattr(cutlass, args.element_b)
|
||||
element_c = getattr(cutlass, args.element_c)
|
||||
element_acc = getattr(cutlass, args.element_acc)
|
||||
math_operation = getattr(MathOperation, args.math)
|
||||
opclass = getattr(cutlass.OpClass, args.opcode)
|
||||
|
||||
math_inst = MathInstruction(
|
||||
args.instruction_shape, element_a, element_b,
|
||||
element_acc, opclass, math_operation
|
||||
)
|
||||
|
||||
tile_description = TileDescription(
|
||||
args.threadblock_shape, args.stages, args.warp_count,
|
||||
math_inst, args.compute_capability, args.compute_capability
|
||||
)
|
||||
|
||||
layout_a = getattr(cutlass, args.layout_a)
|
||||
layout_b = getattr(cutlass, args.layout_b)
|
||||
layout_c = getattr(cutlass, args.layout_c)
|
||||
|
||||
A = TensorDescription(
|
||||
element_a, layout_a, args.alignment_a
|
||||
)
|
||||
|
||||
B = TensorDescription(
|
||||
element_b, layout_b, args.alignment_b
|
||||
)
|
||||
|
||||
C = TensorDescription(
|
||||
element_c, layout_c, args.alignment_c
|
||||
)
|
||||
|
||||
element_epilogue = getattr(cutlass, args.element_epilogue)
|
||||
epilogue_functor = getattr(EpilogueFunctor, args.epilogue_functor)
|
||||
iterator_algorithm = getattr(cutlass.conv.IteratorAlgorithm, args.iterator_algorithm)
|
||||
swizzling_functor = getattr(cutlass, args.swizzling_functor)
|
||||
stride_support = getattr(StrideSupport, args.stride_support)
|
||||
conv_kind = getattr(cutlass.conv.Operator, args.conv_kind)
|
||||
|
||||
operation = Conv2dOperation(
|
||||
conv_kind=conv_kind, iterator_algorithm=iterator_algorithm,
|
||||
arch=args.compute_capability, tile_description=tile_description,
|
||||
A=A, B=B, C=C, element_epilogue=element_epilogue, stride_support=stride_support,
|
||||
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor
|
||||
)
|
||||
|
||||
if args.print_cuda:
|
||||
print(operation.rt_module.emit())
|
||||
|
||||
operations = [operation,]
|
||||
|
||||
if args.split_k_mode == "Parallel" and args.split_k_slices > 1:
|
||||
reduction_operation = ReductionOperation(
|
||||
shape=cutlass.MatrixCoord(4, 32 * C.alignment),
|
||||
C=C, element_accumulator=element_acc,
|
||||
element_compute=element_epilogue,
|
||||
count=C.alignment
|
||||
)
|
||||
operations.append(reduction_operation)
|
||||
|
||||
pycutlass.compiler.add_module(operations)
|
||||
|
||||
problem_size = cutlass.conv.Conv2dProblemSize(
|
||||
cutlass.Tensor4DCoord(args.nhwc[0], args.nhwc[1], args.nhwc[2], args.nhwc[3]),
|
||||
cutlass.Tensor4DCoord(args.krsc[0], args.krsc[1], args.krsc[2], args.krsc[3]),
|
||||
cutlass.Tensor4DCoord(args.pad[0], args.pad[1], args.pad[2], args.pad[3]),
|
||||
cutlass.MatrixCoord(args.stride[0], args.stride[1]),
|
||||
cutlass.MatrixCoord(args.dilation[0], args.dilation[1]),
|
||||
cutlass.conv.Mode.cross_correlation,
|
||||
args.split_k_slices, 1
|
||||
)
|
||||
|
||||
|
||||
# User-provide inputs
|
||||
tensor_A_size = cutlass.conv.implicit_gemm_tensor_a_size(
|
||||
conv_kind, problem_size
|
||||
)
|
||||
tensor_B_size = cutlass.conv.implicit_gemm_tensor_b_size(
|
||||
conv_kind, problem_size
|
||||
)
|
||||
tensor_C_size = cutlass.conv.implicit_gemm_tensor_c_size(
|
||||
conv_kind, problem_size
|
||||
)
|
||||
|
||||
if args.element_a != "int8":
|
||||
tensor_A = torch.ceil(torch.empty(size=(tensor_A_size,), dtype=getattr(torch, args.element_a), device="cuda").uniform_(-8.5, 7.5))
|
||||
else:
|
||||
tensor_A = torch.empty(size=(tensor_A_size,), dtype=getattr(torch, args.element_a), device="cuda").uniform_(-2, 2)
|
||||
|
||||
if args.element_b != "int8":
|
||||
tensor_B = torch.ceil(torch.empty(size=(tensor_B_size,), dtype=getattr(torch, args.element_b), device="cuda").uniform_(-8.5, 7.5))
|
||||
else:
|
||||
tensor_B = torch.empty(size=(tensor_B_size,), dtype=getattr(torch, args.element_b), device="cuda").uniform_(-2, 2)
|
||||
|
||||
if args.element_c != "int8":
|
||||
tensor_C = torch.ceil(torch.empty(size=(tensor_C_size,), dtype=getattr(torch, args.element_c), device="cuda").uniform_(-8.5, 7.5))
|
||||
else:
|
||||
tensor_C = torch.empty(size=(tensor_C_size,), dtype=getattr(torch, args.element_c), device="cuda").uniform_(-2, 2)
|
||||
|
||||
tensor_D = torch.ones_like(tensor_C)
|
||||
|
||||
arguments = Conv2dArguments(
|
||||
operation=operation, problem_size=problem_size, A=tensor_A,
|
||||
B=tensor_B, C=tensor_C, D=tensor_D,
|
||||
output_op = LinearCombinationFunctorArguments(args.alpha, args.beta),
|
||||
split_k_mode=getattr(cutlass.conv.SplitKMode, args.split_k_mode),
|
||||
split_k_slices=problem_size.split_k_slices
|
||||
)
|
||||
|
||||
if args.split_k_mode == "Parallel" and args.split_k_slices > 1:
|
||||
implicit_gemm_size = cutlass.conv.implicit_gemm_problem_size(conv_kind, arguments.problem_size)
|
||||
reduction_arguments = ReductionArguments(
|
||||
reduction_operation,
|
||||
problem_size=[implicit_gemm_size.m(), implicit_gemm_size.n()],
|
||||
partitions=problem_size.split_k_slices,
|
||||
workspace=arguments.ptr_D,
|
||||
destination=tensor_D,
|
||||
source=tensor_C,
|
||||
output_op = LinearCombinationFunctorArguments(args.alpha, args.beta)
|
||||
)
|
||||
|
||||
operation.run(arguments)
|
||||
|
||||
if args.split_k_mode == "Parallel" and args.split_k_slices > 1:
|
||||
reduction_operation.run(reduction_arguments)
|
||||
reduction_arguments.sync()
|
||||
else:
|
||||
arguments.sync()
|
||||
|
||||
reference_model = Conv2dReferenceModule(A, B, C, conv_kind)
|
||||
|
||||
tensor_D_ref = reference_model.run(tensor_A, tensor_B, tensor_C, arguments.problem_size, args.alpha, args.beta)
|
||||
|
||||
assert torch.equal(tensor_D, tensor_D_ref)
|
||||
|
||||
print("Passed.")
|
||||
266
examples/40_cutlass_py/gemm.py
Normal file
266
examples/40_cutlass_py/gemm.py
Normal file
@ -0,0 +1,266 @@
|
||||
################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
################################################################################
|
||||
import numpy as np
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
import cutlass
|
||||
from bfloat16 import bfloat16
|
||||
|
||||
import argparse
|
||||
|
||||
|
||||
# parse the arguments
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Launch CUTLASS GEMM kernels from python: 'D = alpha * A * B + beta * C'")
|
||||
|
||||
# Operation description
|
||||
# math instruction description
|
||||
parser.add_argument("-i", "--instruction_shape",
|
||||
default=[1, 1, 1], nargs=3, type=int,
|
||||
help="This option describes the size of MMA op")
|
||||
parser.add_argument("-ta", "--element_a", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of elements in input tensor A')
|
||||
parser.add_argument("-tb", "--element_b", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of elements in input tensor B')
|
||||
parser.add_argument("-tc", "--element_c", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of elements in input tensor C and output tensor D')
|
||||
parser.add_argument("-tacc", "--element_acc", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of accumulator')
|
||||
parser.add_argument('-m', "--math", default="multiply_add",
|
||||
type=str, choices=["multiply_add", "multiply_add_fast_bf16", "multiply_add_fast_f32"], help="math instruction")
|
||||
parser.add_argument('-op', "--opcode", default="simt", type=str,
|
||||
choices=["Simt", 'TensorOp'],
|
||||
help="This option describes whether you want to use tensor \
|
||||
cores (TensorOp) or regular SIMT cores (Simt) on GPU SM")
|
||||
# tile description
|
||||
parser.add_argument("-b", "--threadblock_shape",
|
||||
default=[128, 128, 8], nargs=3, type=int,
|
||||
help="This option describes the tile size a thread block with compute")
|
||||
parser.add_argument("-s", "--stages", default=4,
|
||||
type=int, help="Number of pipelines you want to use")
|
||||
parser.add_argument("-w", "--warp_count", default=[4, 2, 1], nargs=3, type=int,
|
||||
help="This option describes the number of warps along M, N, and K of the threadblock")
|
||||
parser.add_argument("-cc", "--compute_capability", default=80,
|
||||
type=int, help="This option describes CUDA SM architecture number")
|
||||
# A
|
||||
parser.add_argument('-la', "--layout_a", default="RowMajor", type=str, choices=[
|
||||
"RowMajor", "ColumnMajor", "RowMajorInterleaved32", "ColumnMajorInterleaved32"],
|
||||
help="Memory layout of input tensor A")
|
||||
parser.add_argument('-aa', '--alignment_a', default=1,
|
||||
type=int, help="Memory alignement of input tensor A")
|
||||
# B
|
||||
parser.add_argument('-lb', "--layout_b", default="RowMajor", type=str, choices=[
|
||||
"RowMajor", "ColumnMajor", "RowMajorInterleaved32", "ColumnMajorInterleaved32"],
|
||||
help="Memory layout of input tensor B")
|
||||
parser.add_argument('-ab', '--alignment_b', default=1,
|
||||
type=int, help="Memory alignment of input tensor B")
|
||||
# C
|
||||
parser.add_argument('-lc', "--layout_c", default="RowMajor", type=str, choices=[
|
||||
"RowMajor", "ColumnMajor", "RowMajorInterleaved32", "ColumnMajorInterleaved32"],
|
||||
help="Memory layout of input tensor C and output tensor D")
|
||||
parser.add_argument('-ac', '--alignment_c', default=1,
|
||||
type=int, help="Memory alignment of input tensor C and output tensor D")
|
||||
# epilogue
|
||||
parser.add_argument("-te", "--element_epilogue", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16'], help='Epilogue datatype')
|
||||
parser.add_argument("-ep", "--epilogue_functor", default="LinearCombination",
|
||||
type=str, choices=['LinearCombination', 'FastLinearCombinationClamp', 'LinearCombinationClamp'],
|
||||
help="This option describes the epilogue part of the kernel")
|
||||
# swizzling
|
||||
parser.add_argument("-sw", "--swizzling_functor", default="IdentitySwizzle1", type=str, choices=[
|
||||
"IdentitySwizzle1", "IdentitySwizzle2", "IdentitySwizzle4", "IdentitySwizzle8", "HorizontalSwizzle"],
|
||||
help="This option describes how thread blocks are scheduled on GPU")
|
||||
|
||||
# Argument
|
||||
parser.add_argument("-p", "--problem_size",
|
||||
default=[128, 128, 128], nargs=3, type=int,
|
||||
help="GEMM problem size M, N, K")
|
||||
parser.add_argument("-alpha", "--alpha", default=1.0, type=float,
|
||||
help="Scaling factor of A * B")
|
||||
parser.add_argument("-beta", "--beta", default=0.0, type=float,
|
||||
help="Scaling factor of C")
|
||||
parser.add_argument("-gm", "--gemm_mode", default="Gemm", type=str,
|
||||
choices=["Gemm", "GemmSplitKParallel"],
|
||||
help="GEMM mode. Gemm is used for non-splitK or serial-splitK. \
|
||||
GemmSplitKParallel is used for parallel splitK")
|
||||
parser.add_argument('-k', '--split_k_slices', default=1,
|
||||
type=int, help="Number of split-k partitions. (default 1)")
|
||||
|
||||
parser.add_argument('--print_cuda', action="store_true",
|
||||
help="print the underlying CUDA kernel")
|
||||
|
||||
# parser.add_argument('-h', '--help', action="store_true",
|
||||
# help="print help information")
|
||||
|
||||
try:
|
||||
args = parser.parse_args()
|
||||
except:
|
||||
sys.exit(0)
|
||||
|
||||
pycutlass.get_memory_pool(init_pool_size=2**30, max_pool_size=2**32)
|
||||
|
||||
element_a = getattr(cutlass, args.element_a)
|
||||
element_b = getattr(cutlass, args.element_b)
|
||||
element_c = getattr(cutlass, args.element_c)
|
||||
element_acc = getattr(cutlass, args.element_acc)
|
||||
math_operation = getattr(MathOperation, args.math)
|
||||
opclass = getattr(cutlass.OpClass, args.opcode)
|
||||
|
||||
math_inst = MathInstruction(
|
||||
args.instruction_shape, element_a, element_b,
|
||||
element_acc, opclass, math_operation
|
||||
)
|
||||
|
||||
tile_description = TileDescription(
|
||||
args.threadblock_shape, args.stages, args.warp_count,
|
||||
math_inst, args.compute_capability, args.compute_capability
|
||||
)
|
||||
|
||||
layout_a = getattr(cutlass, args.layout_a)
|
||||
layout_b = getattr(cutlass, args.layout_b)
|
||||
layout_c = getattr(cutlass, args.layout_c)
|
||||
|
||||
A = TensorDescription(
|
||||
element_a, layout_a, args.alignment_a
|
||||
)
|
||||
|
||||
B = TensorDescription(
|
||||
element_b, layout_b, args.alignment_b
|
||||
)
|
||||
|
||||
C = TensorDescription(
|
||||
element_c, layout_c, args.alignment_c
|
||||
)
|
||||
|
||||
element_epilogue = getattr(cutlass, args.element_epilogue)
|
||||
epilogue_functor = getattr(EpilogueFunctor, args.epilogue_functor)
|
||||
swizzling_functor = getattr(cutlass, args.swizzling_functor)
|
||||
|
||||
operation = GemmOperationUniversal(
|
||||
arch=args.compute_capability, tile_description=tile_description,
|
||||
A=A, B=B, C=C, element_epilogue=element_epilogue,
|
||||
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor
|
||||
)
|
||||
|
||||
if args.print_cuda:
|
||||
print(operation.rt_module.emit())
|
||||
|
||||
operations = [operation, ]
|
||||
|
||||
if args.gemm_mode == "GemmSplitKParallel":
|
||||
reduction_operation = ReductionOperation(
|
||||
shape=cutlass.MatrixCoord(4, 32 * C.alignment),
|
||||
C=C, element_accumulator=element_acc,
|
||||
element_compute=element_epilogue,
|
||||
count=C.alignment
|
||||
)
|
||||
operations.append(reduction_operation)
|
||||
|
||||
pycutlass.compiler.add_module(operations)
|
||||
|
||||
# User-provide inputs
|
||||
|
||||
problem_size = cutlass.gemm.GemmCoord(
|
||||
args.problem_size[0], args.problem_size[1], args.problem_size[2])
|
||||
|
||||
if args.element_a != "int8":
|
||||
if args.element_a == "bfloat16":
|
||||
tensor_A = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.m()
|
||||
* problem_size.k(),))).astype(bfloat16)
|
||||
else:
|
||||
tensor_A = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.m()
|
||||
* problem_size.k(),))).astype(getattr(np, args.element_a))
|
||||
else:
|
||||
tensor_A = np.random.uniform(low=-2, high=2, size=(problem_size.m()
|
||||
* problem_size.k(),)).astype(getattr(np, args.element_a))
|
||||
|
||||
if args.element_b != "int8":
|
||||
if args.element_b == "bfloat16":
|
||||
tensor_B = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.k()
|
||||
* problem_size.n(),))).astype(bfloat16)
|
||||
else:
|
||||
tensor_B = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.k()
|
||||
* problem_size.n(),))).astype(getattr(np, args.element_b))
|
||||
else:
|
||||
tensor_B = np.random.uniform(low=-2, high=2, size=(problem_size.k()
|
||||
* problem_size.n(),)).astype(getattr(np, args.element_b))
|
||||
|
||||
if args.element_c != "int8":
|
||||
if args.element_c == "bfloat16":
|
||||
tensor_C = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.m()
|
||||
* problem_size.n(),))).astype(bfloat16)
|
||||
else:
|
||||
tensor_C = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.m()
|
||||
* problem_size.n(),))).astype(getattr(np, args.element_c))
|
||||
else:
|
||||
tensor_C = np.random.uniform(low=-2, high=2, size=(problem_size.m()
|
||||
* problem_size.n(),)).astype(getattr(np, args.element_c))
|
||||
|
||||
tensor_D = np.ones_like(tensor_C)
|
||||
|
||||
arguments = GemmArguments(
|
||||
operation=operation, problem_size=problem_size,
|
||||
A=tensor_A, B=tensor_B, C=tensor_C, D=tensor_D,
|
||||
output_op=LinearCombinationFunctorArguments(args.alpha, args.beta),
|
||||
gemm_mode=getattr(cutlass.gemm.Mode, args.gemm_mode),
|
||||
split_k_slices=args.split_k_slices
|
||||
)
|
||||
|
||||
if args.gemm_mode == "GemmSplitKParallel":
|
||||
reduction_arguments = ReductionArguments(
|
||||
operation=reduction_operation,
|
||||
problem_size=[problem_size.m(), problem_size.n()],
|
||||
partitions=args.split_k_slices, workspace=arguments.ptr_D,
|
||||
destination=tensor_D, source=tensor_C,
|
||||
output_op=LinearCombinationFunctorArguments(args.alpha, args.beta)
|
||||
)
|
||||
|
||||
operation.run(arguments)
|
||||
|
||||
if args.gemm_mode == "GemmSplitKParallel":
|
||||
reduction_operation.run(reduction_arguments)
|
||||
reduction_arguments.sync()
|
||||
else:
|
||||
arguments.sync()
|
||||
|
||||
# run the host reference module
|
||||
reference = ReferenceModule(A, B, C)
|
||||
tensor_D_ref = reference.run(
|
||||
tensor_A, tensor_B, tensor_C, problem_size, args.alpha, args.beta)
|
||||
|
||||
assert np.array_equal(tensor_D, tensor_D_ref)
|
||||
|
||||
print("Passed.")
|
||||
248
examples/40_cutlass_py/gemm_grouped.py
Normal file
248
examples/40_cutlass_py/gemm_grouped.py
Normal file
@ -0,0 +1,248 @@
|
||||
################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
################################################################################
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
import csv
|
||||
|
||||
import argparse
|
||||
|
||||
# parse the arguments
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Launch CUTLASS GEMM Grouped kernels from python")
|
||||
|
||||
# Operation description
|
||||
# math instruction description
|
||||
parser.add_argument("-i", "--instruction_shape",
|
||||
default=[1, 1, 1], nargs=3, type=int,
|
||||
help="This option describes the size of MMA op")
|
||||
parser.add_argument("-ta", "--element_a", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of elements in input tensor A')
|
||||
parser.add_argument("-tb", "--element_b", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of elements in input tensor B')
|
||||
parser.add_argument("-tc", "--element_c", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of elements in input tensor C and output tensor D')
|
||||
parser.add_argument("-tacc", "--element_acc", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16', 'int32', 'int8'],
|
||||
help='Data type of accumulator')
|
||||
parser.add_argument('-m', "--math", default="multiply_add",
|
||||
type=str, choices=["multiply_add", "multiply_add_fast_bf16", "multiply_add_fast_f32"], help="math instruction")
|
||||
parser.add_argument('-op', "--opcode", default="simt", type=str,
|
||||
choices=["Simt", 'TensorOp'], help='This option describes whether you want to use tensor \
|
||||
cores (TensorOp) or regular SIMT cores (Simt) on GPU SM')
|
||||
# tile description
|
||||
parser.add_argument("-b", "--threadblock_shape",
|
||||
default=[128, 128, 8], nargs=3, type=int,
|
||||
help="This option describes the tile size a thread block with compute")
|
||||
parser.add_argument("-s", "--stages", default=4,
|
||||
type=int, help="Number of pipelines you want to use")
|
||||
parser.add_argument("-w", "--warp_count", default=[
|
||||
4, 2, 1], nargs=3, type=int,
|
||||
help="This option describes the number of warps along M, N, and K of the threadblock")
|
||||
parser.add_argument("-cc", "--compute_capability", default=80,
|
||||
type=int, help="This option describes CUDA SM architecture number")
|
||||
# A
|
||||
parser.add_argument('-la', "--layout_a", default="RowMajor", type=str, choices=[
|
||||
"RowMajor", "ColumnMajor", "RowMajorInterleaved32", "ColumnMajorInterleaved32"],
|
||||
help="Memory layout of input tensor A")
|
||||
parser.add_argument('-aa', '--alignment_a', default=1,
|
||||
type=int, help="Memory alignment of input tensor A")
|
||||
# B
|
||||
parser.add_argument('-lb', "--layout_b", default="RowMajor", type=str, choices=[
|
||||
"RowMajor", "ColumnMajor", "RowMajorInterleaved32", "ColumnMajorInterleaved32"],
|
||||
help="Memory layout of input tensor B")
|
||||
parser.add_argument('-ab', '--alignment_b', default=1,
|
||||
type=int, help="Memory alignment of input tensor B")
|
||||
# C
|
||||
parser.add_argument('-lc', "--layout_c", default="RowMajor", type=str, choices=[
|
||||
"RowMajor", "ColumnMajor", "RowMajorInterleaved32", "ColumnMajorInterleaved32"],
|
||||
help="Memory layout of input tensor C and output tensor D")
|
||||
parser.add_argument('-ac', '--alignment_c', default=1,
|
||||
type=int, help="Memory alignment of input tensor C and output tensor D")
|
||||
# epilogue
|
||||
parser.add_argument("-te", "--element_epilogue", default="float32", type=str,
|
||||
choices=['float64', 'float32', 'float16', 'bfloat16'], help='Epilogue datatype')
|
||||
parser.add_argument("-ep", "--epilogue_functor", default="LinearCombination",
|
||||
type=str, choices=['LinearCombination', 'FastLinearCombinationClamp', 'LinearCombinationClamp'],
|
||||
help="This option describes the epilogue part of the kernel")
|
||||
# swizzling
|
||||
parser.add_argument("-sw", "--swizzling_functor", default="IdentitySwizzle1", type=str, choices=[
|
||||
"IdentitySwizzle1", "IdentitySwizzle2", "IdentitySwizzle4", "IdentitySwizzle8", "HorizontalSwizzle"],
|
||||
help="This option describes how thread blocks are scheduled on GPU")
|
||||
# precompute mode
|
||||
parser.add_argument("-pm", "--precompute_mode",
|
||||
default="Device", type=str, choices=["Host", "Device"],
|
||||
help="Grouped Gemm Scheduing on device only (Device) or using host precompute (Host)")
|
||||
# arguments
|
||||
parser.add_argument("-p", "--problem_size_dir", type=str,
|
||||
help="path to the csv file contains the problem sizes")
|
||||
parser.add_argument("-alpha", "--alpha", default=1.0, type=float, help="alpha")
|
||||
parser.add_argument("-beta", "--beta", default=0.0, type=float, help="beta")
|
||||
|
||||
parser.add_argument('--print_cuda', action="store_true",
|
||||
help="print the underlying CUDA kernel")
|
||||
|
||||
try:
|
||||
args = parser.parse_args()
|
||||
except:
|
||||
sys.exit(0)
|
||||
|
||||
pycutlass.get_memory_pool(init_pool_size=2**30, max_pool_size=2**32)
|
||||
|
||||
element_a = getattr(cutlass, args.element_a)
|
||||
element_b = getattr(cutlass, args.element_b)
|
||||
element_c = getattr(cutlass, args.element_c)
|
||||
element_acc = getattr(cutlass, args.element_acc)
|
||||
math_operation = getattr(MathOperation, args.math)
|
||||
opclass = getattr(cutlass.OpClass, args.opcode)
|
||||
|
||||
math_inst = MathInstruction(
|
||||
args.instruction_shape, element_a, element_b,
|
||||
element_acc, opclass, math_operation
|
||||
)
|
||||
|
||||
tile_description = TileDescription(
|
||||
args.threadblock_shape, args.stages, args.warp_count,
|
||||
math_inst, args.compute_capability, args.compute_capability
|
||||
)
|
||||
|
||||
layout_a = getattr(cutlass, args.layout_a)
|
||||
layout_b = getattr(cutlass, args.layout_b)
|
||||
layout_c = getattr(cutlass, args.layout_c)
|
||||
|
||||
A = TensorDescription(
|
||||
element_a, layout_a, args.alignment_a
|
||||
)
|
||||
|
||||
B = TensorDescription(
|
||||
element_b, layout_b, args.alignment_b
|
||||
)
|
||||
|
||||
C = TensorDescription(
|
||||
element_c, layout_c, args.alignment_c
|
||||
)
|
||||
|
||||
element_epilogue = getattr(cutlass, args.element_epilogue)
|
||||
epilogue_functor = getattr(EpilogueFunctor, args.epilogue_functor)
|
||||
swizzling_functor = getattr(cutlass, args.swizzling_functor)
|
||||
precompute_mode = getattr(SchedulerMode, args.precompute_mode)
|
||||
|
||||
operation = GemmOperationGrouped(
|
||||
arch=args.compute_capability, tile_description=tile_description,
|
||||
A=A, B=B, C=C, element_epilogue=element_epilogue,
|
||||
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor,
|
||||
precompute_mode=precompute_mode
|
||||
)
|
||||
|
||||
if args.print_cuda:
|
||||
print(operation.rt_module.emit())
|
||||
|
||||
pycutlass.compiler.add_module([operation, ])
|
||||
|
||||
reference_module = ReferenceModule(A, B, C)
|
||||
|
||||
# get problems
|
||||
problem_sizes = []
|
||||
with open(args.problem_size_dir) as csv_file:
|
||||
reader = csv.reader(csv_file)
|
||||
for row in reader:
|
||||
problem_sizes.append(
|
||||
cutlass.gemm.GemmCoord(int(row[0]), int(row[1]), int(row[2]))
|
||||
)
|
||||
|
||||
problem_count = len(problem_sizes)
|
||||
|
||||
tensor_As = []
|
||||
tensor_Bs = []
|
||||
tensor_Cs = []
|
||||
tensor_Ds = []
|
||||
problem_sizes_coord = []
|
||||
tensor_D_refs = []
|
||||
|
||||
for problem_size in problem_sizes:
|
||||
if args.element_a != "int8":
|
||||
if args.element_a == "bfloat16":
|
||||
tensor_A = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.m()
|
||||
* problem_size.k(),))).astype(bfloat16)
|
||||
else:
|
||||
tensor_A = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.m()
|
||||
* problem_size.k(),))).astype(getattr(np, args.element_a))
|
||||
else:
|
||||
tensor_A = np.random.uniform(low=-2, high=2, size=(problem_size.m()
|
||||
* problem_size.k(),)).astype(getattr(np, args.element_a))
|
||||
|
||||
if args.element_b != "int8":
|
||||
if args.element_b == "bfloat16":
|
||||
tensor_B = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.k()
|
||||
* problem_size.n(),))).astype(bfloat16)
|
||||
else:
|
||||
tensor_B = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.k()
|
||||
* problem_size.n(),))).astype(getattr(np, args.element_b))
|
||||
else:
|
||||
tensor_B = np.random.uniform(low=-2, high=2, size=(problem_size.k()
|
||||
* problem_size.n(),)).astype(getattr(np, args.element_b))
|
||||
|
||||
if args.element_c != "int8":
|
||||
if args.element_c == "bfloat16":
|
||||
tensor_C = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.m()
|
||||
* problem_size.n(),))).astype(bfloat16)
|
||||
else:
|
||||
tensor_C = np.ceil(np.random.uniform(low=-8.5, high=7.5, size=(problem_size.m()
|
||||
* problem_size.n(),))).astype(getattr(np, args.element_c))
|
||||
else:
|
||||
tensor_C = np.random.uniform(low=-2, high=2, size=(problem_size.m()
|
||||
* problem_size.n(),)).astype(getattr(np, args.element_c))
|
||||
tensor_D = np.zeros_like(tensor_C)
|
||||
|
||||
tensor_As.append(tensor_A)
|
||||
tensor_Bs.append(tensor_B)
|
||||
tensor_Cs.append(tensor_C)
|
||||
tensor_Ds.append(tensor_D)
|
||||
tensor_D_refs.append(reference_module.run(
|
||||
tensor_A, tensor_B, tensor_C, problem_size, args.alpha, args.beta))
|
||||
problem_sizes_coord.append(problem_size)
|
||||
|
||||
arguments = GemmGroupedArguments(
|
||||
operation, problem_sizes_coord, tensor_As, tensor_Bs, tensor_Cs, tensor_Ds,
|
||||
output_op=LinearCombinationFunctorArguments(args.alpha, args.beta)
|
||||
)
|
||||
|
||||
operation.run(arguments)
|
||||
|
||||
arguments.sync()
|
||||
|
||||
for tensor_d, tensor_d_ref in zip(tensor_Ds, tensor_D_refs):
|
||||
assert np.array_equal(tensor_d, tensor_d_ref)
|
||||
|
||||
print("Passed.")
|
||||
3
examples/40_cutlass_py/grouped_gemm_problem_size.csv
Normal file
3
examples/40_cutlass_py/grouped_gemm_problem_size.csv
Normal file
@ -0,0 +1,3 @@
|
||||
128,128,128
|
||||
128,128,256
|
||||
512,128,384
|
||||
|
@ -1,169 +0,0 @@
|
||||
|
||||
# System modules
|
||||
import numpy as np
|
||||
import os.path
|
||||
import sys
|
||||
import ctypes
|
||||
|
||||
# CUDA Python modules
|
||||
from cuda import cuda
|
||||
from cuda import nvrtc
|
||||
|
||||
# CUTLASS modules
|
||||
import library
|
||||
import manifest as cutlass_manifest
|
||||
import generator
|
||||
import rt
|
||||
|
||||
|
||||
#
|
||||
# Construct an SGEMM
|
||||
#
|
||||
|
||||
manifest = cutlass_manifest.Manifest()
|
||||
|
||||
generator.GenerateSM50_Simt(manifest, "11.5.0")
|
||||
|
||||
#
|
||||
# Construct a GEMM operation
|
||||
#
|
||||
|
||||
operation = manifest.operations_by_name['cutlass_simt_sgemm_128x128_8x2_nt_align1']
|
||||
|
||||
#
|
||||
# Construct a runtime GEMM operation
|
||||
#
|
||||
gemm = rt.Gemm(operation)
|
||||
|
||||
#
|
||||
# Initialize context
|
||||
#
|
||||
err, = cuda.cuInit(0)
|
||||
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError("CUDA Error %s" % str(err))
|
||||
|
||||
err, device = cuda.cuDeviceGet(0)
|
||||
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError("CUDA Error %s" % str(err))
|
||||
|
||||
err, context = cuda.cuCtxCreate(0, device)
|
||||
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError("CUDA Error %s" % str(err))
|
||||
|
||||
#
|
||||
# Construct a module
|
||||
#
|
||||
|
||||
architectures = [80,]
|
||||
include_paths = [
|
||||
'../../include',
|
||||
'../../tools/util/include',
|
||||
]
|
||||
|
||||
compilation_options = rt.CompilationOptions(architectures, include_paths)
|
||||
|
||||
module = rt.Module('module.cu', [gemm], compilation_options)
|
||||
|
||||
#
|
||||
# Setup a workspace
|
||||
#
|
||||
|
||||
M, N, K = (128, 128, 128)
|
||||
|
||||
tensor_A = np.ndarray(M * K, dtype=np.float32)
|
||||
tensor_B = np.ndarray(N * K, dtype=np.float32)
|
||||
tensor_C = np.ndarray(M * N, dtype=np.float32)
|
||||
tensor_D = np.ndarray(M * N, dtype=np.float32)
|
||||
|
||||
err, tensor_A_d = cuda.cuMemAlloc(tensor_A.size * tensor_A.itemsize)
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError("CUDA Error %s" % str(err))
|
||||
|
||||
err, tensor_B_d = cuda.cuMemAlloc(tensor_B.size * tensor_B.itemsize)
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError("CUDA Error %s" % str(err))
|
||||
|
||||
err, tensor_C_d = cuda.cuMemAlloc(tensor_C.size * tensor_C.itemsize)
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError("CUDA Error %s" % str(err))
|
||||
|
||||
err, tensor_D_d = cuda.cuMemAlloc(tensor_D.size * tensor_D.itemsize)
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError("CUDA Error %s" % str(err))
|
||||
|
||||
err, stream = cuda.cuStreamCreate(0)
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError("CUDA Error %s" % str(err))
|
||||
|
||||
tensors = [
|
||||
(tensor_A_d, tensor_A),
|
||||
(tensor_B_d, tensor_B),
|
||||
(tensor_C_d, tensor_C),
|
||||
(tensor_D_d, tensor_D)
|
||||
]
|
||||
|
||||
for tensor_device, tensor_host in tensors:
|
||||
bytes = tensor_host.size * tensor_host.itemsize
|
||||
print("Tensor has dimensions: %s (%d bytes)" % (str(tensor_host.size), tensor_host.itemsize))
|
||||
err, = cuda.cuMemcpyHtoDAsync(tensor_device, tensor_host, bytes, stream)
|
||||
print("updating tensor in device memory ", hex(int(tensor_device)))
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError('CUDA Error %s' % str(err))
|
||||
|
||||
#
|
||||
# Initialize a host buffer
|
||||
#
|
||||
|
||||
arguments = rt.GemmArguments()
|
||||
|
||||
arguments.problem_size = rt.GemmCoord(M, N, K)
|
||||
|
||||
arguments.A = rt.TensorRef(tensor_A_d, M)
|
||||
arguments.B = rt.TensorRef(tensor_B_d, N)
|
||||
arguments.C = rt.TensorRef(tensor_C_d, M)
|
||||
arguments.D = rt.TensorRef(tensor_D_d, M)
|
||||
|
||||
host_workspace = bytearray(gemm.get_host_workspace_size(arguments))
|
||||
device_workspace = None
|
||||
|
||||
launch_config = gemm.plan(arguments)
|
||||
|
||||
byte_count = gemm.initialize(host_workspace, device_workspace, launch_config, arguments)
|
||||
|
||||
#
|
||||
# Launch the kernel
|
||||
#
|
||||
|
||||
err = gemm.run(host_workspace, device_workspace, launch_config)
|
||||
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError('CUDA Error %s' % str(err))
|
||||
|
||||
#
|
||||
# Verify results
|
||||
#
|
||||
err, = cuda.cuStreamSynchronize(stream)
|
||||
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
raise RuntimeError("CUDA Error %s" % str(err))
|
||||
|
||||
|
||||
#
|
||||
# Debug reporting of byte array contents
|
||||
#
|
||||
|
||||
def PrintBytearray(host_workspace):
|
||||
uint_str = None
|
||||
prefix = None
|
||||
print("uint32_t host_workspace[] = {")
|
||||
for idx, byte in enumerate(host_workspace):
|
||||
if not (idx % 4):
|
||||
if uint_str is not None:
|
||||
print(prefix, uint_str, ",")
|
||||
prefix = "/* offset: %d B */ 0x" % idx
|
||||
uint_str = ""
|
||||
uint_str = "{:02x}".format(byte) + uint_str
|
||||
print("};")
|
||||
Reference in New Issue
Block a user