@ -81,13 +81,24 @@ The tiling size of above operations can also be customized.
|
||||
## Installation
|
||||
|
||||
### Using Docker
|
||||
You can run the PyCUTLASS on NGC PyTorch container.
|
||||
We recommend using one of our provided Docker images for using PyCUTLASS.
|
||||
|
||||
**To run CUTLASS 3 GEMM kernels targetting the NVIDIA Hopper architecture via PyCUTLASS,** you can use an included [Dockerfile](docker/Dockerfile-cuda12.0) based on the NGC CUDA 12.0 container:
|
||||
```shell
|
||||
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.09-py3
|
||||
docker build -t pycutlass-cuda12.0:latest -f docker/Dockerfile-cuda12.0 .
|
||||
docker run --gpus all -it --rm pycutlass-cuda12.0:latest
|
||||
```
|
||||
Note that this Docker container does not include CuPy or PyTorch, and, thus, will not be able to run PyCUTLASS examples that
|
||||
leverage these packages.
|
||||
|
||||
**To run CUTLASS 2.x kernels targetting pre-SM90 architectures via PyCUTLASS,** you can use an included [Dockerfile](docker/Dockerfile-cuda11.8-pytorch) based on an NGC PyTorch container:
|
||||
```shell
|
||||
docker build -t pycutlass-cuda11.8-pytorch:latest -f docker/Dockerfile-cuda11.8-pytorch .
|
||||
docker run --gpus all -it --rm pycutlass-cuda11.8-pytorch:latest
|
||||
```
|
||||
|
||||
### Environment variables
|
||||
PyCUTLASSS requires two environment variables:
|
||||
PyCUTLASS requires two environment variables:
|
||||
* `CUTLASS_PATH`: the root directory of CUTLASS. You can set this from the location at which you cloned CUTLASS via: `export CUTLASS_PATH=$(pwd)`.
|
||||
* `CUDA_INSTALL_PATH`: the directory where cuda toolkit is installed. If running in bash with `nvcc` installed under a CUDA toolkit, you can set this to the location of your `nvcc` installation via: `export CUDA_INSTALL_PATH=$(which nvcc | awk -F'/bin/nvcc' '{print $1}')`
|
||||
|
||||
|
||||
@ -1,4 +1,36 @@
|
||||
pip install pybind11
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
pip install -U pybind11
|
||||
git clone https://github.com/google/googletest.git
|
||||
python setup.py install
|
||||
python setup.py develop --user
|
||||
python setup.py rmm
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
pip install enum-tools
|
||||
pip install sphinx-toolbox
|
||||
pip install m2r2
|
||||
|
||||
@ -0,0 +1,40 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
FROM nvcr.io/nvidia/pytorch:22.11-py3
|
||||
|
||||
RUN chmod ugo+rwx /home
|
||||
RUN pip uninstall -y rmm
|
||||
RUN pip install rmm-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
|
||||
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
||||
ENV LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
|
||||
ENV CUDA_INSTALL_PATH=/usr/local/cuda
|
||||
46
tools/library/scripts/pycutlass/docker/Dockerfile-cuda12.0
Normal file
46
tools/library/scripts/pycutlass/docker/Dockerfile-cuda12.0
Normal file
@ -0,0 +1,46 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
FROM nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu20.04
|
||||
|
||||
RUN apt-get update
|
||||
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
|
||||
RUN apt-get install -y git cmake vim python3 python3-pip
|
||||
RUN ln -s /usr/bin/python3 /usr/bin/python
|
||||
RUN chmod ugo+rwx /home
|
||||
RUN pip install numpy==1.23
|
||||
RUN pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
|
||||
RUN pip install cuml-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
|
||||
RUN pip install cugraph-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
|
||||
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
|
||||
ENV LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu/:$LIBRARY_PATH
|
||||
ENV CUDA_INSTALL_PATH=/usr/local/cuda
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
import distutils.cmd
|
||||
from setuptools import setup
|
||||
import setuptools.command.build_py
|
||||
@ -15,7 +47,7 @@ class BuildRMM(distutils.cmd.Command):
|
||||
import rmm
|
||||
except ImportError:
|
||||
print("installing rmm")
|
||||
os.system("git clone -b branch-22.08 --recurse-submodules https://github.com/rapidsai/rmm.git")
|
||||
os.system("git clone -b branch-22.10 --recurse-submodules https://github.com/rapidsai/rmm.git")
|
||||
os.chdir("./rmm")
|
||||
os.system("./build.sh librmm rmm")
|
||||
os.chdir("./python")
|
||||
@ -43,7 +75,11 @@ try:
|
||||
Pybind11Extension("cutlass",
|
||||
["src/cpp/cutlass.cpp"],
|
||||
include_dirs=include_dirs,
|
||||
extra_compile_args=["-fpermissive", "-w"])
|
||||
extra_compile_args=["-fpermissive", "-w", "-std=c++17"]),
|
||||
Pybind11Extension("cute",
|
||||
["src/cpp/cute.cpp"],
|
||||
include_dirs=include_dirs,
|
||||
extra_compile_args=["-fpermissive", "-w", "-std=c++17"])
|
||||
]
|
||||
except ImportError:
|
||||
pass
|
||||
@ -65,7 +101,7 @@ setup(
|
||||
install_requires=[
|
||||
"numpy<1.23",
|
||||
'pybind11',
|
||||
'cuda-python<11.7.0',
|
||||
'cuda-python>=11.8.0',
|
||||
'typeguard',
|
||||
'bfloat16',
|
||||
'typing',
|
||||
|
||||
54
tools/library/scripts/pycutlass/src/cpp/cute.cpp
Normal file
54
tools/library/scripts/pycutlass/src/cpp/cute.cpp
Normal file
@ -0,0 +1,54 @@
|
||||
/***************************************************************************************************
|
||||
* Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
* SPDX-License-Identifier: BSD-3-Clause
|
||||
*
|
||||
* Redistribution and use in source and binary forms, with or without
|
||||
* modification, are permitted provided that the following conditions are met:
|
||||
*
|
||||
* 1. Redistributions of source code must retain the above copyright notice, this
|
||||
* list of conditions and the following disclaimer.
|
||||
*
|
||||
* 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
* this list of conditions and the following disclaimer in the documentation
|
||||
* and/or other materials provided with the distribution.
|
||||
*
|
||||
* 3. Neither the name of the copyright holder nor the names of its
|
||||
* contributors may be used to endorse or promote products derived from
|
||||
* this software without specific prior written permission.
|
||||
*
|
||||
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*
|
||||
**************************************************************************************************/
|
||||
/* \file
|
||||
\brief binding CuTe C++ APIs to Python
|
||||
*/
|
||||
|
||||
#include <pybind11/pybind11.h>
|
||||
#include <pybind11/stl_bind.h>
|
||||
|
||||
#include "cute/arch/mma_sm90_gmma.hpp"
|
||||
|
||||
namespace py = pybind11;
|
||||
|
||||
|
||||
PYBIND11_MODULE(cute, m) {
|
||||
|
||||
// module doc
|
||||
m.doc() = "CuTe C++ bindings";
|
||||
|
||||
py::enum_<cute::GMMA::Major>(m, "GMMAMajor",
|
||||
R"pbdoc(classification of CuTe GMMA tensor major specification)pbdoc")
|
||||
.value("K", cute::GMMA::Major::K,
|
||||
R"pbdoc(Tensor is contiguous in reduction dimension)pbdoc")
|
||||
.value("MN", cute::GMMA::Major::MN,
|
||||
R"pbdoc(Tensor is contiguous in non-reduction dimension)pbdoc");
|
||||
}
|
||||
@ -29,8 +29,9 @@
|
||||
*
|
||||
**************************************************************************************************/
|
||||
/* \file
|
||||
\brief binding cutlass C++ APIs to python
|
||||
\brief binding CUTLASS C++ APIs to Python
|
||||
*/
|
||||
|
||||
#include <pybind11/pybind11.h>
|
||||
#include <pybind11/stl_bind.h>
|
||||
|
||||
|
||||
@ -34,6 +34,7 @@
|
||||
\brief A generic wrapper around an epilogue visitor operation
|
||||
*/
|
||||
|
||||
|
||||
#pragma once
|
||||
|
||||
#include "cutlass/cutlass.h"
|
||||
|
||||
@ -30,8 +30,8 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
|
||||
\brief Binary operations to be used within the epilogue visitor model.
|
||||
|
||||
\brief A file contains the binary ops
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
@ -44,7 +44,7 @@ namespace cutlass {
|
||||
/////////////////////////////////////////////////////////////////////////////////////////////////
|
||||
|
||||
|
||||
/// Elementwise addition of two arrays
|
||||
/// Scalar multiplication
|
||||
template <typename T, int N>
|
||||
struct VectorAdd {
|
||||
|
||||
|
||||
@ -30,8 +30,8 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
|
||||
\brief Unary operations to be used within the epilogue visitor model.
|
||||
|
||||
\brief A file contains the unary ops
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
|
||||
@ -30,8 +30,8 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
|
||||
\brief Epilogue visitor operation that simply returns the accumulator
|
||||
|
||||
\brief A file contains the epilogue visitor Op with accumulator
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
|
||||
@ -30,8 +30,8 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
|
||||
\brief Epilogue visitor operator performing a binary operation between two visitor nodes
|
||||
|
||||
\brief A file contains the epilogue visitor Op with Binary op
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
@ -84,7 +84,6 @@ public:
|
||||
/// Fragment type of accumulator
|
||||
using AccumulatorAccessType = Array<ElementAccumulator, kElementsPerAccess>;
|
||||
|
||||
/// Combination Op TODO: generalize this
|
||||
using BinaryOp = BinaryOp_<ElementCompute, kElementsPerAccess>;
|
||||
|
||||
static_assert(kElementsPerAccess==VisitAccessTypeA::kElements, "kElementsPerAccess mismatches with Visitor A");
|
||||
|
||||
@ -30,8 +30,8 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
|
||||
\brief Epilogue visitor operation that broadcasts a vector to all columns
|
||||
|
||||
\brief A file contains the epilogue visitor Op with broadcasting vector to all columns
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
|
||||
@ -30,8 +30,8 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
|
||||
\brief Epilogue visitor operation that performs a column-wise reduction within a threadblock
|
||||
|
||||
\brief A file contains the epilogue visitor Op with reduction over columns in CTA
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
@ -68,7 +68,6 @@ public:
|
||||
|
||||
static int const kElementsPerAccess = OutputTileIterator::kElementsPerAccess;
|
||||
|
||||
// TODO: generalize the reduction op
|
||||
using ReductionOp = cutlass::plus<Array<ElementReductionAccumulator, kElementsPerAccess>>;
|
||||
using ReductionOpScalar = cutlass::plus<ElementReductionAccumulator>;
|
||||
using ElementOutput = typename OutputTileIterator::Element;
|
||||
|
||||
@ -30,8 +30,8 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
|
||||
\brief Epilogue visitor operation that performs a linear combination of two visitor nodes
|
||||
|
||||
\brief A file contains the epilogue visitor Op with Linear Combination
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
@ -82,7 +82,7 @@ public:
|
||||
/// Fragment type of accumulator
|
||||
using AccumulatorAccessType = Array<ElementAccumulator, kElementsPerAccess>;
|
||||
|
||||
/// Combination Op TODO: generalize this
|
||||
/// Combination Op
|
||||
using CombinationOp = cutlass::plus<VisitAccessType>;
|
||||
|
||||
static_assert(kElementsPerAccess==VisitAccessTypeA::kElements, "kElementsPerAccess mismatches with Visitor A");
|
||||
|
||||
@ -30,8 +30,8 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
|
||||
\brief Epilogue visitor operation that broadcasts a vector to all rows
|
||||
|
||||
\brief A file contains the epilogue visitor Op with broadcasting vector to all rows
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
|
||||
@ -30,8 +30,8 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
|
||||
\brief Epilogue visitor operation that performs a column-wise reduction within a threadblock
|
||||
|
||||
\brief A file contains the epilogue visitor Op with reduction over rows in CTA
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
@ -69,7 +69,6 @@ public:
|
||||
|
||||
static int const kElementsPerAccess = OutputTileIterator::kElementsPerAccess;
|
||||
|
||||
// TODO: generalize the reduction op
|
||||
using ReductionOp = cutlass::plus<Array<ElementReductionAccumulator, kElementsPerAccess>>;
|
||||
using ReductionOpScalar = cutlass::plus<ElementReductionAccumulator>;
|
||||
using ElementOutput = typename OutputTileIterator::Element;
|
||||
|
||||
@ -30,8 +30,8 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
|
||||
\brief Epilogue visitor operator performing a unary operation atop a visitor node
|
||||
|
||||
\brief A file contains the epilogue visitor Op with Unary operation
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
@ -79,7 +79,7 @@ public:
|
||||
/// Fragment type of accumulator
|
||||
using AccumulatorAccessType = Array<ElementAccumulator, kElementsPerAccess>;
|
||||
|
||||
/// Combination Op TODO: generalize this
|
||||
/// Combination Op
|
||||
using UnaryOp = UnaryOp_<ElementCompute, kElementsPerAccess>;
|
||||
|
||||
static_assert(kElementsPerAccess==VisitAccessTypeVisitor::kElements, "kElementsPerAccess mismatches with Visitor");
|
||||
|
||||
@ -30,7 +30,7 @@
|
||||
**************************************************************************************************/
|
||||
|
||||
/*! \file
|
||||
\brief
|
||||
\brief
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
@ -139,8 +139,8 @@ public:
|
||||
//
|
||||
// Methods
|
||||
//
|
||||
|
||||
Arguments():
|
||||
|
||||
Arguments():
|
||||
ptr_A(nullptr), ptr_B(nullptr), ptr_C(nullptr), ptr_D(nullptr),
|
||||
ptr_gather_A_indices(nullptr),
|
||||
ptr_gather_B_indices(nullptr),
|
||||
@ -169,8 +169,8 @@ public:
|
||||
int const *ptr_scatter_D_indices = nullptr
|
||||
):
|
||||
UniversalArgumentsBase(mode, problem_size, batch_count, batch_stride_D),
|
||||
epilogue_visitor(epilogue_visitor),
|
||||
ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D),
|
||||
epilogue_visitor(epilogue_visitor),
|
||||
ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D),
|
||||
batch_stride_A(batch_stride_A), batch_stride_B(batch_stride_B), batch_stride_C(batch_stride_C),
|
||||
stride_a(stride_a), stride_b(stride_b), stride_c(stride_c), stride_d(stride_d),
|
||||
ptr_gather_A_indices(ptr_gather_A_indices), ptr_gather_B_indices(ptr_gather_B_indices),
|
||||
@ -205,8 +205,8 @@ public:
|
||||
int const *ptr_scatter_D_indices = nullptr
|
||||
):
|
||||
UniversalArgumentsBase(mode, problem_size, batch_count, batch_stride_D),
|
||||
epilogue_visitor(epilogue_visitor),
|
||||
ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D),
|
||||
epilogue_visitor(epilogue_visitor),
|
||||
ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D),
|
||||
batch_stride_A(batch_stride_A), batch_stride_B(batch_stride_B), batch_stride_C(batch_stride_C),
|
||||
lda(lda), ldb(ldb), ldc(ldc), ldd(ldd),
|
||||
ptr_gather_A_indices(ptr_gather_A_indices), ptr_gather_B_indices(ptr_gather_B_indices),
|
||||
@ -221,7 +221,7 @@ public:
|
||||
/// Returns arguments for the transposed problem
|
||||
Arguments transposed_problem() const {
|
||||
Arguments args(*this);
|
||||
|
||||
|
||||
std::swap(args.problem_size.m(), args.problem_size.n());
|
||||
std::swap(args.ptr_A, args.ptr_B);
|
||||
std::swap(args.lda, args.ldb);
|
||||
@ -256,7 +256,7 @@ public:
|
||||
typename Mma::IteratorB::Params params_B;
|
||||
typename EpilogueVisitor::OutputTileIterator::Params params_C;
|
||||
typename EpilogueVisitor::OutputTileIterator::Params params_D;
|
||||
|
||||
|
||||
typename EpilogueVisitor::Params epilogue_visitor;
|
||||
|
||||
void * ptr_A;
|
||||
@ -325,7 +325,7 @@ public:
|
||||
batch_stride_C = args.batch_stride_C;
|
||||
|
||||
epilogue_visitor = args.epilogue_visitor;
|
||||
|
||||
|
||||
semaphore = static_cast<int *>(workspace);
|
||||
CUTLASS_TRACE_HOST("GemmUniversal::Params::update()");
|
||||
}
|
||||
@ -345,7 +345,7 @@ public:
|
||||
//
|
||||
|
||||
CUTLASS_DEVICE
|
||||
GemmUniversalwithEpilogueVisitor() { }
|
||||
GemmUniversalwithEpilogueVisitor() { }
|
||||
|
||||
/// Determines whether kernel satisfies alignment
|
||||
static Status can_implement(
|
||||
@ -455,12 +455,12 @@ public:
|
||||
//
|
||||
// Fetch pointers based on mode.
|
||||
//
|
||||
if (params.mode == GemmUniversalMode::kGemm ||
|
||||
if (params.mode == GemmUniversalMode::kGemm ||
|
||||
params.mode == GemmUniversalMode::kGemmSplitKParallel) {
|
||||
|
||||
if (threadblock_tile_offset.k() + 1 < params.grid_tiled_shape.k()) {
|
||||
|
||||
problem_size_k = (threadblock_tile_offset.k() + 1) * params.gemm_k_size;
|
||||
problem_size_k = (threadblock_tile_offset.k() + 1) * params.gemm_k_size;
|
||||
}
|
||||
|
||||
offset_k = threadblock_tile_offset.k() * params.gemm_k_size;
|
||||
@ -529,10 +529,10 @@ public:
|
||||
|
||||
// Compute threadblock-scoped matrix multiply-add
|
||||
mma(
|
||||
gemm_k_iterations,
|
||||
accumulators,
|
||||
iterator_A,
|
||||
iterator_B,
|
||||
gemm_k_iterations,
|
||||
accumulators,
|
||||
iterator_A,
|
||||
iterator_B,
|
||||
accumulators);
|
||||
|
||||
//
|
||||
@ -555,30 +555,16 @@ public:
|
||||
|
||||
int block_idx = threadblock_tile_offset.m() + threadblock_tile_offset.n() * params.grid_tiled_shape.m();
|
||||
|
||||
ElementC *ptr_C = static_cast<ElementC *>(params.ptr_C);
|
||||
ElementC *ptr_C = static_cast<ElementC *>(params.ptr_C);
|
||||
ElementC *ptr_D = static_cast<ElementC *>(params.ptr_D);
|
||||
|
||||
//
|
||||
// Fetch pointers based on mode.
|
||||
//
|
||||
|
||||
|
||||
// Construct the semaphore.
|
||||
Semaphore semaphore(params.semaphore + block_idx, thread_idx);
|
||||
|
||||
// if (params.mode == GemmUniversalMode::kGemm) {
|
||||
|
||||
// // TODO: fix this order
|
||||
// // If performing a reduction via split-K, fetch the initial synchronization
|
||||
// if (params.grid_tiled_shape.k() > 1) {
|
||||
|
||||
// // Fetch the synchronization lock initially but do not block.
|
||||
// semaphore.fetch();
|
||||
|
||||
// // Indicate which position in a serial reduction the output operator is currently updating
|
||||
// output_op.set_k_partition(threadblock_tile_offset.k(), params.grid_tiled_shape.k());
|
||||
// }
|
||||
// }
|
||||
|
||||
// Tile iterator loading from source tensor.
|
||||
|
||||
EpilogueVisitor epilogue_visitor(
|
||||
@ -590,9 +576,6 @@ public:
|
||||
params.problem_size.mn()
|
||||
);
|
||||
|
||||
// if (params.mode == GemmUniversalMode::kGemmSplitKParallel) {
|
||||
// ptr_D += threadblock_tile_offset.k() * params.batch_stride_D;
|
||||
// }
|
||||
if (params.mode == GemmUniversalMode::kBatched || params.mode == GemmUniversalMode::kArray) {
|
||||
epilogue_visitor.set_batch_index(threadblock_tile_offset.k());
|
||||
}
|
||||
@ -605,25 +588,20 @@ public:
|
||||
|
||||
// Wait on the semaphore - this latency may have been covered by iterator construction
|
||||
if (params.mode == GemmUniversalMode::kGemm && params.grid_tiled_shape.k() > 1) {
|
||||
|
||||
// For subsequent threadblocks, the source matrix is held in the 'D' tensor.
|
||||
// TODO: ???
|
||||
// if (threadblock_tile_offset.k()) {
|
||||
// iterator_C = iterator_D;
|
||||
// }
|
||||
|
||||
// For subsequent threadblocks, the source matrix is held in the 'D' tensor.
|
||||
semaphore.wait(threadblock_tile_offset.k());
|
||||
}
|
||||
|
||||
|
||||
// Execute the epilogue operator to update the destination tensor.
|
||||
epilogue(epilogue_visitor, accumulators);
|
||||
|
||||
epilogue(epilogue_visitor, accumulators);
|
||||
|
||||
//
|
||||
// Release the semaphore
|
||||
//
|
||||
|
||||
if (params.mode == GemmUniversalMode::kGemm && params.grid_tiled_shape.k() > 1) {
|
||||
if (params.mode == GemmUniversalMode::kGemm && params.grid_tiled_shape.k() > 1) {
|
||||
|
||||
int lock = 0;
|
||||
if (params.grid_tiled_shape.k() == threadblock_tile_offset.k() + 1) {
|
||||
@ -635,7 +613,7 @@ public:
|
||||
// Otherwise, the semaphore is incremented
|
||||
lock = threadblock_tile_offset.k() + 1;
|
||||
}
|
||||
|
||||
|
||||
semaphore.release(lock);
|
||||
}
|
||||
}
|
||||
|
||||
@ -83,7 +83,6 @@ void bind_identity_swizzle(py::module & m, std::string name) {
|
||||
:param problem_size: Implicit gemm problem size conv_operator(NZPQK, NDHWC, KTRSC)
|
||||
:type problem_size: :class:`cutlass.gemm.GemmCoord`)
|
||||
)pbdoc")
|
||||
// TODO: the returned dim3 is not usable in python
|
||||
.def("get_grid_shape", &T::get_grid_shape,
|
||||
py::arg("tiled_shape"),
|
||||
R"pbdoc(Computes CUDA grid dimensions given a size in units of logical tiles)pbdoc")
|
||||
|
||||
@ -31,6 +31,7 @@ from pycutlass.utils import *
|
||||
from pycutlass.frontend import *
|
||||
from pycutlass.reduction_operation import *
|
||||
from pycutlass.compiler import *
|
||||
from pycutlass.utils.device import device_cc
|
||||
|
||||
# module-wide variables
|
||||
|
||||
@ -40,6 +41,12 @@ this = sys.modules[__name__]
|
||||
# artifact manager
|
||||
this.compiler = ArtifactManager()
|
||||
|
||||
try:
|
||||
if not hasattr(this, 'DEVICE_CC') or this.DEVICE_CC is None:
|
||||
this.DEVICE_CC = device_cc()
|
||||
except:
|
||||
this.DEVICE_CC = None
|
||||
|
||||
def get_memory_pool(init_pool_size=0, max_pool_size=2**34):
|
||||
this.memory_pool = PoolMemoryManager(
|
||||
init_pool_size=init_pool_size,
|
||||
|
||||
@ -0,0 +1,395 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
"""
|
||||
Utilities for stamping out collective mainloops for SM90 kernels
|
||||
"""
|
||||
|
||||
import cute
|
||||
import cutlass
|
||||
from pycutlass import SubstituteTemplate
|
||||
import pycutlass.library as library
|
||||
|
||||
|
||||
tma_alignment_bytes = 16
|
||||
cp_async_min_alignment_bytes = 4
|
||||
|
||||
|
||||
class RowColMajorToGMMAMajor:
|
||||
@staticmethod
|
||||
def A(layout, element):
|
||||
"""
|
||||
Converts operand A's layout from row/column major format into CuTe's GMMA major format
|
||||
|
||||
:param layout: layout of the A operand
|
||||
:type layout: cutlass.RowMajor or cutlass.ColumnMajor
|
||||
:param element: data type of the A operand
|
||||
|
||||
:return: C++ CuTe GMMA major format
|
||||
:rtype: cute.GMMAMajor
|
||||
"""
|
||||
type_requires_k_major = (element == cutlass.tfloat32) or (element == cutlass.int8)
|
||||
if layout == cutlass.ColumnMajor and not type_requires_k_major:
|
||||
return cute.GMMAMajor.MN
|
||||
else:
|
||||
return cute.GMMAMajor.K
|
||||
|
||||
@staticmethod
|
||||
def B(layout, element):
|
||||
"""
|
||||
Converts operand B's layout from row/column major format into CuTe's GMMA major format
|
||||
|
||||
:param layout: layout of the B operand
|
||||
:type layout: cutlass.RowMajor or cutlass.ColumnMajor
|
||||
:param element: data type of the B operand
|
||||
|
||||
:return: C++ CuTe GMMA major format
|
||||
:rtype: cute.GMMAMajor
|
||||
"""
|
||||
type_requires_k_major = (element == cutlass.tfloat32) or (element == cutlass.int8)
|
||||
if layout == cutlass.RowMajor and not type_requires_k_major:
|
||||
return cute.GMMAMajor.MN
|
||||
else:
|
||||
return cute.GMMAMajor.K
|
||||
|
||||
|
||||
def cluster_shape_to_tma(dim):
|
||||
"""
|
||||
Returns the TMA copy type for a given cluster dimension
|
||||
|
||||
:param dim: a given dimension of a cluster
|
||||
:type dim: layout
|
||||
|
||||
:return: C++ TMA copy time
|
||||
:rtype: str
|
||||
"""
|
||||
return 'cute::SM90_TMA_LOAD' if dim == 1 else 'cute::SM90_TMA_LOAD_MULTICAST'
|
||||
|
||||
|
||||
def make_cpasync_gmem_tiled_copy(thread_count, element, alignment, gmma_layout, dim_mn, dim_k):
|
||||
"""
|
||||
Returns a `make_tiled_copy` call for a given configuraiton
|
||||
|
||||
:param thread_count: number of threads in the threadblock
|
||||
:type thread_count: int
|
||||
:param element: datatype of the operand in question
|
||||
:param alignment: byte alignment of the operand in question
|
||||
:type alignment: int
|
||||
:param gmma_layout: GMMA layout of the operand in question
|
||||
:type gmma_layout: cute.GMMAMajor
|
||||
:param dim_mn: extent of the M/N dimension of the tile
|
||||
:type dim_mn: int
|
||||
:param dim_k: extent of the reduction dimension of the tile
|
||||
:type dim_k: int
|
||||
|
||||
:return: C++ call to `make_tiled_copy`
|
||||
:rtype: str
|
||||
"""
|
||||
|
||||
emission_str = """decltype(cute::make_tiled_copy(
|
||||
cute::Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<cute::uint_byte_t<static_cast<int>(sizeof(${element})) * ${alignment}>>, ${element}>{},
|
||||
cute::Layout<cute::Shape<_${shape0_x}, _${shape0_y}>,
|
||||
cute::Stride<_${stride_x}, _${stride_y}>>{},
|
||||
cute::Layout<cute::Shape<_${shape1_x}, _${shape1_y}>>{}))"""
|
||||
if gmma_layout == cute.GMMAMajor.K:
|
||||
threads_major = dim_k // alignment
|
||||
threads_minor = thread_count // threads_major
|
||||
values = {
|
||||
'shape0_x': str(threads_minor),
|
||||
'shape0_y': str(threads_major),
|
||||
'stride_x': str(threads_major),
|
||||
'stride_y': '1',
|
||||
'shape1_x': '1',
|
||||
'shape1_y': str(alignment)
|
||||
}
|
||||
elif gmma_layout == cute.GMMAMajor.MN:
|
||||
threads_major = dim_mn // alignment
|
||||
threads_minor = thread_count // threads_major
|
||||
values = {
|
||||
'shape0_x': str(threads_major),
|
||||
'shape0_y': str(threads_minor),
|
||||
'stride_x': '1',
|
||||
'stride_y': str(threads_major),
|
||||
'shape1_x': str(alignment),
|
||||
'shape1_y': '1'
|
||||
}
|
||||
else:
|
||||
raise Exception('Unexpected GMMA layout {}'.format(gmma_layout))
|
||||
|
||||
# Add common values
|
||||
values['element'] = library.DataTypeTag[element]
|
||||
values['alignment'] = str(alignment)
|
||||
return SubstituteTemplate(emission_str, values)
|
||||
|
||||
|
||||
def max_stages(op, arch):
|
||||
"""
|
||||
Returns the maximum number pipeline stages that can be used for an operation.
|
||||
|
||||
:param op: operation for which the maximum stages should be computed. If stages are
|
||||
set via the `op.tile_description.stages` parameter, this setting is ignored
|
||||
in the present calculation
|
||||
:type op: pycutlass.GemmOperation
|
||||
:param arch: compute capability of the device on which the operation will be run
|
||||
:type arch: int
|
||||
|
||||
:return: maximum number of pipeline stages that can be used for an operation
|
||||
:rtype: int
|
||||
"""
|
||||
smem_per_stage = library.CalculateSmemUsagePerStage(op)
|
||||
smem_capacity = library.SharedMemPerCC[arch]
|
||||
return int(smem_capacity // smem_per_stage)
|
||||
|
||||
|
||||
class LayoutToStride:
|
||||
_variable_first = 'cute::Stride<int64_t, cute::Int<1>, int64_t>'
|
||||
_variable_last = 'cute::Stride<cute::Int<1>, int64_t, int64_t>'
|
||||
|
||||
@staticmethod
|
||||
def A(layout):
|
||||
"""
|
||||
Returns the CuTe shape type corresponding to the layout of operand A
|
||||
|
||||
:param layout: layout of the B operand
|
||||
:type layout: cutlass.RowMajor or cutlass.ColumnMajor
|
||||
|
||||
:return: C++ declaration of CuTe stride
|
||||
:rtype: str
|
||||
"""
|
||||
if layout == cutlass.RowMajor:
|
||||
return LayoutToStride._variable_first
|
||||
elif layout == cutlass.ColumnMajor:
|
||||
return LayoutToStride._variable_last
|
||||
else:
|
||||
raise Exception('Unsupported layout {}'.format(layout))
|
||||
|
||||
@staticmethod
|
||||
def B(layout):
|
||||
"""
|
||||
Returns the CuTe shape type corresponding to the layout of operand B
|
||||
|
||||
:param layout: layout of the B operand
|
||||
:type layout: cutlass.RowMajor or cutlass.ColumnMajor
|
||||
|
||||
:return: C++ declaration of CuTe stride
|
||||
:rtype: str
|
||||
"""
|
||||
if layout == cutlass.RowMajor:
|
||||
return LayoutToStride._variable_last
|
||||
elif layout == cutlass.ColumnMajor:
|
||||
return LayoutToStride._variable_first
|
||||
else:
|
||||
raise Exception('Unsupported layout {}'.format(layout))
|
||||
|
||||
|
||||
EMISSION_STR = """
|
||||
using TileShape_MNK = cute::Shape<_${threadblock_shape_m}, _${threadblock_shape_n}, _${threadblock_shape_k}>;
|
||||
using ClusterShape_MNK = cute::Shape<_${cluster_shape_m}, _${cluster_shape_n}, _${cluster_shape_k}>;
|
||||
using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::ss_op_selector<
|
||||
${internal_element_A}, ${internal_element_B}, ${element_accumulator}, TileShape_MNK, ${gmma_layout_A}, ${gmma_layout_B}>()));
|
||||
|
||||
using SmemLayoutAtomA = decltype(cute::GMMA::smem_selector<${gmma_layout_A}, ${internal_element_A}, _${threadblock_shape_m}, _${threadblock_shape_k}>());
|
||||
using SmemLayoutAtomB = decltype(cute::GMMA::smem_selector<${gmma_layout_B}, ${internal_element_B}, _${threadblock_shape_n}, _${threadblock_shape_k}>());
|
||||
|
||||
using CollectiveOp = typename cutlass::gemm::collective::CollectiveMma<
|
||||
${mainloop_type}<${stage_count}, ClusterShape_MNK${kernel_schedule}>,
|
||||
TileShape_MNK,
|
||||
${element_A},
|
||||
${stride_A},
|
||||
${element_B},
|
||||
${stride_B},
|
||||
TiledMma,
|
||||
${gmem_tiled_copy_A},
|
||||
SmemLayoutAtomA,
|
||||
void, // GMMA_SS does not need an SmemCopyAtom
|
||||
${transform_A},
|
||||
${gmem_tiled_copy_B},
|
||||
SmemLayoutAtomB,
|
||||
void, // GMMA_SS does not need an SmemCopyAtom
|
||||
${transform_B}
|
||||
>;
|
||||
"""
|
||||
|
||||
|
||||
def internal_element(element):
|
||||
"""
|
||||
Returns the data type internally used for `element`.
|
||||
|
||||
:param element: data type
|
||||
|
||||
:return: data type used internally
|
||||
"""
|
||||
return cutlass.tfloat32 if element == cutlass.float32 else element
|
||||
|
||||
|
||||
def common_values(op, stage_count, transform_A, transform_B):
|
||||
"""
|
||||
Returns a dictionary containing common values to be substituted in the emission of the
|
||||
collective operation declaration. Values specific to a particular collective operation
|
||||
should be added to these.
|
||||
|
||||
:param op: GEMM operation for which to build a collective operation
|
||||
:type op: pycutlass.GemmOperation
|
||||
:param stage_count: number of pipeline stages to use in the operation
|
||||
:type stage_count: int
|
||||
:param transform_A: transformation to perform on the A operand
|
||||
:type transform_A: str
|
||||
:param transform_B: transformation to perform on the B operand
|
||||
:type transform_B: str
|
||||
|
||||
:return: dictionary containing values to substitute in emission string
|
||||
:rtype: dict
|
||||
"""
|
||||
internal_element_a = internal_element(op.A.element)
|
||||
internal_element_b = internal_element(op.B.element)
|
||||
|
||||
return {
|
||||
'threadblock_shape_m': str(op.tile_description.threadblock_shape[0]),
|
||||
'threadblock_shape_n': str(op.tile_description.threadblock_shape[1]),
|
||||
'threadblock_shape_k': str(op.tile_description.threadblock_shape[2]),
|
||||
'cluster_shape_m': str(op.tile_description.cluster_shape[0]),
|
||||
'cluster_shape_n': str(op.tile_description.cluster_shape[1]),
|
||||
'cluster_shape_k': str(op.tile_description.cluster_shape[2]),
|
||||
'element_A': library.DataTypeTag[op.A.element],
|
||||
'element_B': library.DataTypeTag[op.B.element],
|
||||
'internal_element_A': library.DataTypeTag[internal_element_a],
|
||||
'internal_element_B': library.DataTypeTag[internal_element_b],
|
||||
'element_accumulator': library.DataTypeTag[op.accumulator_type()],
|
||||
'gmma_layout_A': library.CuTeLayoutTag[RowColMajorToGMMAMajor.A(op.A.layout, internal_element_a)],
|
||||
'gmma_layout_B': library.CuTeLayoutTag[RowColMajorToGMMAMajor.B(op.B.layout, internal_element_b)],
|
||||
'stride_A': LayoutToStride.A(op.A.layout),
|
||||
'stride_B': LayoutToStride.B(op.B.layout),
|
||||
'stage_count': str(stage_count),
|
||||
'transform_A': transform_A,
|
||||
'transform_B': transform_B
|
||||
}
|
||||
|
||||
|
||||
def build_gmma_tma(op):
|
||||
"""
|
||||
Builds a collective operation declaration targetting TMA GMMA kernels
|
||||
|
||||
:param op: GEMM operation for which to build a collective operation
|
||||
:type op: pycutlass.GemmOperation
|
||||
|
||||
:return: string containing the C++ declaration of collective operation
|
||||
:rtype: str
|
||||
"""
|
||||
A_tma_aligned = (library.DataTypeSizeBytes[op.A.element] * op.A.alignment) % tma_alignment_bytes == 0
|
||||
B_tma_aligned = (library.DataTypeSizeBytes[op.B.element] * op.B.alignment) % tma_alignment_bytes == 0
|
||||
if not A_tma_aligned or not B_tma_aligned:
|
||||
raise Exception('Each of the A or B operands must be aligned to {} bytes to use TMA'.format(tma_alignment_bytes))
|
||||
|
||||
max_stage_count = max_stages(op, arch=90)
|
||||
if op.tile_description.stages is None:
|
||||
op.tile_description.stages = max_stage_count
|
||||
elif op.tile_description.stages > max_stage_count:
|
||||
raise Exception('Combination of threadblock shape, data types, and number of stages exceeds shared memory capacity.')
|
||||
|
||||
kernel_schedule = 'cutlass::gemm::KernelTmaWarpSpecialized'
|
||||
if op.tile_description.persistent:
|
||||
kernel_schedule = 'cutlass::gemm::KernelTmaWarpSpecializedPersistent'
|
||||
|
||||
transform_A = 'cute::identity'
|
||||
transform_B = 'cute::identity'
|
||||
values = common_values(op, op.tile_description.stages, transform_A, transform_B)
|
||||
specific_values = {
|
||||
'mainloop_type': 'cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized',
|
||||
'kernel_schedule': ', ' + kernel_schedule,
|
||||
'gmem_tiled_copy_A': cluster_shape_to_tma(op.tile_description.cluster_shape[1]),
|
||||
'gmem_tiled_copy_B': cluster_shape_to_tma(op.tile_description.cluster_shape[0])
|
||||
}
|
||||
values.update(specific_values)
|
||||
|
||||
return SubstituteTemplate(EMISSION_STR, values)
|
||||
|
||||
|
||||
def build_gmma_cpasync(op):
|
||||
"""
|
||||
Builds a collective operation declaration targetting cp.async GMMA kernels
|
||||
|
||||
:param op: GEMM operation for which to build a collective operation
|
||||
:type op: pycutlass.GemmOperation
|
||||
|
||||
:return: string containing the C++ declaration of collective operation
|
||||
:rtype: str
|
||||
"""
|
||||
A_cp_async_aligned = (library.DataTypeSizeBytes[op.A.element] * op.A.alignment) % cp_async_min_alignment_bytes == 0
|
||||
B_cp_async_aligned = (library.DataTypeSizeBytes[op.B.element] * op.B.alignment) % cp_async_min_alignment_bytes == 0
|
||||
if not A_cp_async_aligned or not B_cp_async_aligned:
|
||||
raise Exception('Each of the A or B operands must be aligned to {} bytes to use cp.async'.format(cp_async_min_alignment_bytes))
|
||||
|
||||
max_stage_count = max_stages(op, arch=90)
|
||||
if op.tile_description.stages is None:
|
||||
op.tile_description.stages = max_stage_count
|
||||
elif op.tile_description.stages > max_stage_count:
|
||||
raise Exception('Combination of threadblock shape, data types, and number of stages exceeds shared memory capacity.')
|
||||
|
||||
transform_A = 'cute::identity'
|
||||
transform_B = 'cute::identity'
|
||||
|
||||
thread_count = 128
|
||||
cpasync_copy_A = make_cpasync_gmem_tiled_copy(thread_count, op.A.element, op.A.alignment, RowColMajorToGMMAMajor.A(op.A.layout, op.A.element),
|
||||
op.tile_description.threadblock_shape[0], op.tile_description.threadblock_shape[2])
|
||||
cpasync_copy_B = make_cpasync_gmem_tiled_copy(thread_count, op.B.element, op.B.alignment, RowColMajorToGMMAMajor.B(op.B.layout, op.B.element),
|
||||
op.tile_description.threadblock_shape[1], op.tile_description.threadblock_shape[2])
|
||||
|
||||
values = common_values(op, op.tile_description.stages, transform_A, transform_B)
|
||||
specific_values = {
|
||||
'mainloop_type': 'cutlass::gemm::MainloopSm90CpAsyncGmma',
|
||||
'kernel_schedule': '',
|
||||
'gmem_tiled_copy_A': cpasync_copy_A,
|
||||
'gmem_tiled_copy_B': cpasync_copy_B
|
||||
}
|
||||
values.update(specific_values)
|
||||
|
||||
return SubstituteTemplate(EMISSION_STR, values)
|
||||
|
||||
|
||||
def build(operation):
|
||||
"""
|
||||
Builds a collective operation declaration targetting cp.async or TMA for GMMA kernels
|
||||
|
||||
:param operation: GEMM operation for which to build a collective operation
|
||||
:type operation: pycutlass.GemmOperation
|
||||
|
||||
:return: string containing the C++ declaration of collective operation
|
||||
:rtype: str
|
||||
"""
|
||||
A_tma_aligned = (library.DataTypeSizeBytes[operation.A.element] * operation.A.alignment) % tma_alignment_bytes == 0
|
||||
B_tma_aligned = (library.DataTypeSizeBytes[operation.B.element] * operation.B.alignment) % tma_alignment_bytes == 0
|
||||
tma_correct_size = (library.DataTypeSizeBytes[operation.A.element] == 2 and library.DataTypeSizeBytes[operation.B.element] == 2)
|
||||
tma_correct_layout = (operation.A.layout == cutlass.RowMajor or operation.B.layout == cutlass.ColumnMajor)
|
||||
if A_tma_aligned and B_tma_aligned and (tma_correct_size or tma_correct_layout):
|
||||
return build_gmma_tma(operation)
|
||||
else:
|
||||
return build_gmma_cpasync(operation)
|
||||
@ -33,8 +33,6 @@
|
||||
import ctypes
|
||||
from pycutlass.library import *
|
||||
|
||||
# 12B
|
||||
|
||||
|
||||
class GemmCoord_(ctypes.Structure):
|
||||
_fields_ = [
|
||||
@ -48,6 +46,24 @@ class GemmCoord_(ctypes.Structure):
|
||||
setattr(self, field_name, getattr(gemm_coord, field_name)())
|
||||
|
||||
|
||||
class GemmCoordBatched_(ctypes.Structure):
|
||||
"""
|
||||
Wrapper around a GemmCoord that also contains batch count. This is used for encoding
|
||||
batched GEMM inputs to CUTLASS 3 GEMMs.
|
||||
"""
|
||||
_fields_ = [
|
||||
("m", ctypes.c_int),
|
||||
("n", ctypes.c_int),
|
||||
("k", ctypes.c_int),
|
||||
("batch_count", ctypes.c_int)
|
||||
]
|
||||
|
||||
def __init__(self, gemm_coord, batch_count) -> None:
|
||||
for field_name, _ in self._fields_[:-1]:
|
||||
setattr(self, field_name, getattr(gemm_coord, field_name)())
|
||||
setattr(self, "batch_count", batch_count)
|
||||
|
||||
|
||||
class MatrixCoord_(ctypes.Structure):
|
||||
_fields_ = [
|
||||
("row", ctypes.c_int),
|
||||
@ -55,6 +71,26 @@ class MatrixCoord_(ctypes.Structure):
|
||||
]
|
||||
|
||||
|
||||
class dim3_(ctypes.Structure):
|
||||
_fields_ = [
|
||||
("x", ctypes.c_int),
|
||||
("y", ctypes.c_int),
|
||||
("z", ctypes.c_int)
|
||||
]
|
||||
|
||||
|
||||
class StrideBatched_(ctypes.Structure):
|
||||
"""
|
||||
CUTLASS 3.0 strides for operands contain one static dimension and two variable dimensions. The
|
||||
variable dimensions represent the stride along non-unit-stride dimension of the row/column major
|
||||
layout, and the batch stride. This structure encodes the two variable dimensions.
|
||||
"""
|
||||
_fields_ = [
|
||||
("major_stride", ctypes.c_int64),
|
||||
("batch_stride", ctypes.c_int64)
|
||||
]
|
||||
|
||||
|
||||
dtype2ctype = {
|
||||
cutlass.float16: ctypes.c_uint16,
|
||||
cutlass.float32: ctypes.c_float,
|
||||
@ -63,6 +99,28 @@ dtype2ctype = {
|
||||
}
|
||||
|
||||
|
||||
def get_gemm_arguments_3x(epilogue_functor):
|
||||
|
||||
_EpilogueOutputOpParams = epilogue_functor.epilogue_type
|
||||
|
||||
class _GemmArguments(ctypes.Structure):
|
||||
_fields_ = [
|
||||
("mode", ctypes.c_int),
|
||||
("problem_size", GemmCoordBatched_),
|
||||
("ptr_A", ctypes.c_void_p),
|
||||
("stride_A", StrideBatched_),
|
||||
("ptr_B", ctypes.c_void_p),
|
||||
("stride_B", StrideBatched_),
|
||||
("ptr_C", ctypes.c_void_p),
|
||||
("stride_C", StrideBatched_),
|
||||
("ptr_D", ctypes.c_void_p),
|
||||
("stride_D", StrideBatched_),
|
||||
("epilogue", _EpilogueOutputOpParams),
|
||||
]
|
||||
|
||||
return _GemmArguments, _EpilogueOutputOpParams
|
||||
|
||||
|
||||
def get_gemm_arguments(epilogue_functor):
|
||||
|
||||
_EpilogueOutputOpParams = epilogue_functor.epilogue_type
|
||||
@ -103,8 +161,6 @@ def get_gemm_arguments(epilogue_functor):
|
||||
# GEMM Grouped
|
||||
###########################################################################################
|
||||
|
||||
# include/cutlass/gemm/kernel/gemm_grouped.h
|
||||
|
||||
def get_gemm_grouped_arguments(epilogue_functor):
|
||||
_EpilogueOutputOpParams = epilogue_functor.epilogue_type
|
||||
|
||||
@ -131,12 +187,6 @@ def get_gemm_grouped_arguments(epilogue_functor):
|
||||
# Convolution2D
|
||||
############################################################################################
|
||||
|
||||
|
||||
# We use the arguments as the interface
|
||||
|
||||
|
||||
# include/cutlass/conv/conv2d_problem_size.h
|
||||
# 64B
|
||||
class Conv2DProblemSize(ctypes.Structure):
|
||||
_fields_ = [
|
||||
("N", ctypes.c_int),
|
||||
@ -164,8 +214,6 @@ class Conv2DProblemSize(ctypes.Structure):
|
||||
setattr(self, field_name, getattr(problem_size, field_name))
|
||||
|
||||
|
||||
# include/cutlass/layout/tensor.h
|
||||
# 12B
|
||||
class Layout4D(ctypes.Structure):
|
||||
_fields_ = [
|
||||
("stride", ctypes.c_int * 3)
|
||||
@ -175,13 +223,7 @@ class Layout4D(ctypes.Structure):
|
||||
stride = tensor_ref.stride()
|
||||
setattr(self, "stride", (stride.at(0), stride.at(1), stride.at(2)))
|
||||
|
||||
# TODO: Tensor 5-D takes ("stride", ctypes.c_int * 4)
|
||||
|
||||
|
||||
# include/cutlass/conv/threadblock/conv2d_dgrad_filter_tile_access_iterator_optimized.h
|
||||
# TensorRef is basically cutlass::TensorRef<Element, Layout>;
|
||||
# include/cutlass/tensor_ref.h
|
||||
# 24B
|
||||
class TensorRef_(ctypes.Structure):
|
||||
_fields_ = [
|
||||
("ptr", ctypes.c_void_p),
|
||||
@ -200,9 +242,6 @@ class TensorRef2D_(ctypes.Structure):
|
||||
]
|
||||
|
||||
|
||||
# include/cutlass/conv/kernel/implicit_gemm_convolution.h
|
||||
# split_k_mode: kNone: 0, kSerial: 1, kParallel: 2, kParallelSerial: 3, kInvalid: 4
|
||||
|
||||
def get_conv2d_arguments(epilogue_functor):
|
||||
_EpilogueOutputOpParams = epilogue_functor.epilogue_type
|
||||
|
||||
@ -224,7 +263,6 @@ def get_conv2d_arguments(epilogue_functor):
|
||||
# Reduction
|
||||
############################################################################################
|
||||
|
||||
|
||||
def get_reduction_params(epilogue_functor):
|
||||
_EpilogueOutputParams = epilogue_functor.epilogue_type
|
||||
|
||||
|
||||
@ -29,6 +29,7 @@
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
import cutlass
|
||||
from cuda import cuda
|
||||
@ -54,11 +55,11 @@ class CompilationOptions:
|
||||
'''
|
||||
|
||||
#
|
||||
def __init__(self, flags, architectures=[80], include_paths=[]):
|
||||
def __init__(self, flags, arch, include_paths=[]):
|
||||
self.includes = []
|
||||
self.include_paths = include_paths
|
||||
self.flags = flags
|
||||
self.architectures = architectures
|
||||
self.arch = arch
|
||||
|
||||
def get_str(self):
|
||||
options = ""
|
||||
@ -69,13 +70,11 @@ class CompilationOptions:
|
||||
for incl in self.include_paths:
|
||||
options += ' --include-path=%s' % incl
|
||||
|
||||
arch_list = "-arch="
|
||||
for idx, arch in enumerate(self.architectures):
|
||||
if idx:
|
||||
arch_list += ","
|
||||
arch_list += "sm_%d" % arch
|
||||
arch_flag = " -arch=sm_%d" % self.arch
|
||||
if self.arch == 90:
|
||||
arch_flag += 'a'
|
||||
options += arch_flag
|
||||
|
||||
options += " " + arch_list
|
||||
return options
|
||||
|
||||
#
|
||||
@ -88,13 +87,11 @@ class CompilationOptions:
|
||||
for incl in self.include_paths:
|
||||
options.append(bytes(str.encode('--include-path=%s' % incl)))
|
||||
|
||||
arch_list = "-arch="
|
||||
for idx, arch in enumerate(self.architectures):
|
||||
if idx:
|
||||
arch_list += ","
|
||||
arch_list += "sm_%d" % arch
|
||||
arch_flag = " -arch=sm_%d" % self.arch
|
||||
if self.arch == 90:
|
||||
arch_flag += 'a'
|
||||
|
||||
options.append(bytes(str.encode(arch_list)))
|
||||
options.append(bytes(str.encode(arch_flag)))
|
||||
|
||||
return options
|
||||
|
||||
@ -138,12 +135,12 @@ class ArtifactManager:
|
||||
def nvrtc(self):
|
||||
self.backend = "nvrtc"
|
||||
self.default_compile_options = [
|
||||
'-std=c++11', '-default-device',
|
||||
'-std=c++17', '-default-device'
|
||||
]
|
||||
def nvcc(self):
|
||||
self.backend = "nvcc"
|
||||
self.default_compile_options = [
|
||||
'-std=c++11',
|
||||
'-std=c++17', '--expt-relaxed-constexpr', '-Xcudafe --diag_suppress=esa_on_defaulted_function_ignored'
|
||||
]
|
||||
def insert_operation(self, op_key, cubin, hostfile, op_name, op_attrs):
|
||||
connection = sqlite3.connect("./compiled_cache.db")
|
||||
@ -158,7 +155,7 @@ class ArtifactManager:
|
||||
connection.commit()
|
||||
cursor.close()
|
||||
|
||||
def load_operation(self, op_key):
|
||||
def load_operation(self, op_key, extra_funcs):
|
||||
connection = sqlite3.connect("./compiled_cache.db")
|
||||
cursor = connection.cursor()
|
||||
sqlite_fetch_blob_query = """SELECT * from compiled_operations where op_key = ?"""
|
||||
@ -194,12 +191,17 @@ class ArtifactManager:
|
||||
if isinstance(attr, str):
|
||||
func_name = operation_name + '_' + attr
|
||||
func = getattr(host_lib, func_name)
|
||||
|
||||
# Set the return type of the function
|
||||
if attr in extra_funcs and extra_funcs[attr] != None:
|
||||
func.restype = extra_funcs[attr]
|
||||
|
||||
compiled_host_fns[attr] = func
|
||||
|
||||
self.compiled_cache_host.insert(key, compiled_host_fns)
|
||||
return True
|
||||
|
||||
def emit_compile_(self, operation_list, compilation_options):
|
||||
def emit_compile_(self, operation_list, compilation_options, requires_nvcc_hostlib_compilation):
|
||||
"""
|
||||
Compile a list of kernels and store them into database
|
||||
"""
|
||||
@ -276,6 +278,7 @@ class ArtifactManager:
|
||||
err, = nvrtc.nvrtcGetCUBIN(program, cubin_image)
|
||||
if err != nvrtc.nvrtcResult.NVRTC_SUCCESS:
|
||||
raise RuntimeError('NVRTC Error: {}'.format(err))
|
||||
|
||||
else: # with nvcc backend
|
||||
# emit code
|
||||
tempfile.tempdir = "./"
|
||||
@ -303,22 +306,34 @@ class ArtifactManager:
|
||||
with open(temp_cubin.name, 'rb') as file:
|
||||
cubin_image = file.read()
|
||||
|
||||
# compile the host code
|
||||
options = compilation_options.get()
|
||||
cmd = "echo '%s'|g++ -x c++ -fpermissive -w -fPIC" % source_buffer_host
|
||||
for opt in options:
|
||||
opt = opt.decode("utf-8")
|
||||
if opt not in ['-default-device', '-std=c++11', '-Xcicc', '-Xllc'] and '-arch=sm_' not in opt:
|
||||
if '--include-path=' in opt:
|
||||
cmd += " " + opt.replace('--include-path=', '-I')
|
||||
else:
|
||||
cmd += " " + opt
|
||||
# Set up the host-side library code
|
||||
if requires_nvcc_hostlib_compilation:
|
||||
cuda_install_path = os.getenv('CUDA_INSTALL_PATH')
|
||||
assert cuda_install_path is not None, "Environment variable 'CUDA_INSTALL_PATH' is not defined."
|
||||
cmd_template = "echo '%s'|${cuda_install_path}/bin/nvcc -x cu -Xcompiler=\"-fpermissive -w -fPIC\" ${options}" % source_buffer_host
|
||||
cmd = SubstituteTemplate(
|
||||
cmd_template,
|
||||
{
|
||||
"cuda_install_path": cuda_install_path,
|
||||
"options": compilation_options.get_str()
|
||||
})
|
||||
else:
|
||||
options = compilation_options.get()
|
||||
cmd = "echo '%s'|g++ -x c++ -fpermissive -w -fPIC" % source_buffer_host
|
||||
filtered_opts = ['-default-device', '-Xcicc', '-Xllc', '--expt-relaxed-constexpr', '-Xcudafe --diag_suppress=esa_on_defaulted_function_ignored']
|
||||
for opt in options:
|
||||
opt = opt.decode("utf-8")
|
||||
if opt not in filtered_opts and '-arch=sm_' not in opt:
|
||||
if '--include-path=' in opt:
|
||||
cmd += " " + opt.replace('--include-path=', '-I')
|
||||
else:
|
||||
cmd += " " + opt
|
||||
|
||||
tempfile.tempdir = "./"
|
||||
temp = tempfile.NamedTemporaryFile(
|
||||
prefix='host_func', suffix='.so', delete=True)
|
||||
|
||||
cmd += ' - -shared -o %s' % temp.name
|
||||
cmd += ' - -shared -o %s -lcudart -lcuda' % temp.name
|
||||
os.system(cmd)
|
||||
host_lib = ctypes.CDLL(temp.name)
|
||||
|
||||
@ -333,23 +348,25 @@ class ArtifactManager:
|
||||
assert cutlass_path is not None, "Environment variable 'CUTLASS_PATH' is not defined."
|
||||
cuda_install_path = os.getenv('CUDA_INSTALL_PATH')
|
||||
assert cuda_install_path is not None, "Environment variable 'CUDA_INSTALL_PATH' is not defined."
|
||||
architectures = []
|
||||
for operation in operations:
|
||||
if hasattr(operation, "tile_description"):
|
||||
cc = operation.arch
|
||||
if cc not in architectures:
|
||||
architectures.append(cc)
|
||||
include_paths = [
|
||||
cuda_install_path + '/include',
|
||||
cutlass_path + '/include',
|
||||
cutlass_path + '/tools/util/include',
|
||||
cutlass_path + '/tools/library/scripts/pycutlass/src/cpp/include'
|
||||
]
|
||||
|
||||
if pycutlass.DEVICE_CC is not None:
|
||||
arch = pycutlass.DEVICE_CC
|
||||
else:
|
||||
# Find the maximum arch tag among the provided operations and compile for that target.
|
||||
# Since we are compiling to .cubin files, only one architecture may be specified.
|
||||
arch = max([op.arch for op in operations])
|
||||
compile_options = CompilationOptions(
|
||||
self.default_compile_options, architectures, include_paths)
|
||||
self.default_compile_options, arch, include_paths)
|
||||
# save the cubin
|
||||
operation_key = []
|
||||
operation_list = []
|
||||
requires_nvcc_hostlib_compilation = False
|
||||
for operation in operations:
|
||||
# step 1: get kernel string as key
|
||||
key = operation.rt_module.emit() + operation.procedural_name() + self.backend
|
||||
@ -357,7 +374,7 @@ class ArtifactManager:
|
||||
compiled_kernel = self.compiled_cache_device.at(key)
|
||||
|
||||
if compiled_kernel is None:
|
||||
hit = self.load_operation(key)
|
||||
hit = self.load_operation(key, getattr(operation.rt_module, 'extra_funcs', {}))
|
||||
if hit:
|
||||
compiled_kernel = self.compiled_cache_device.at(key)
|
||||
assert compiled_kernel is not None
|
||||
@ -371,9 +388,18 @@ class ArtifactManager:
|
||||
else:
|
||||
operation_list.append(operation.rt_module)
|
||||
operation_key.append(key)
|
||||
|
||||
# Creating the Params structures for certain 3.0 kernels currently requires CUDA. For these cases, use NVCC to generate
|
||||
# the PyCUTLASS host-side library. Otherwise, g++ will be used.
|
||||
if isinstance(operation, pycutlass.gemm_operation.GemmOperationUniversal) and operation.api == pycutlass.library.ApiVersion.v3x:
|
||||
if self.backend == "nvrtc":
|
||||
raise RuntimeError('CUTLASS 3 kernels currently require NVCC for compilation.')
|
||||
|
||||
requires_nvcc_hostlib_compilation = True
|
||||
|
||||
if len(operation_list) > 0:
|
||||
cubin_image, host_lib, host_file = self.emit_compile_(
|
||||
operation_list, compile_options)
|
||||
operation_list, compile_options, requires_nvcc_hostlib_compilation)
|
||||
|
||||
err, module = cuda.cuModuleLoadData(cubin_image)
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
@ -417,9 +443,11 @@ class ArtifactManager:
|
||||
op_attr.append(param_size)
|
||||
|
||||
if hasattr(operation, "extra_funcs"):
|
||||
for suffix in operation.extra_funcs:
|
||||
for suffix, ret_type in operation.extra_funcs.items():
|
||||
func_name = operation.name() + '_' + suffix
|
||||
func = getattr(host_lib, func_name)
|
||||
if ret_type is not None:
|
||||
func.restype = ret_type
|
||||
setattr(operation, suffix, func)
|
||||
compiled_host_fns[suffix] = func
|
||||
op_attr.append(suffix)
|
||||
|
||||
@ -463,13 +463,14 @@ class Conv2dOperation:
|
||||
)
|
||||
|
||||
if self.stride_support == StrideSupport.Unity:
|
||||
configuration_name = "cutlass_${opcode_class}_${extended_name}_${threadblock}_${layout}_unity_stride_align${alignment}"
|
||||
configuration_name = "cutlass_sm${arch}_${opcode_class}_${extended_name}_${threadblock}_${layout}_unity_stride_align${alignment}"
|
||||
else:
|
||||
configuration_name = "cutlass_${opcode_class}_${extended_name}_${threadblock}_${layout}_align${alignment}"
|
||||
configuration_name = "cutlass_sm${arch}_${opcode_class}_${extended_name}_${threadblock}_${layout}_align${alignment}"
|
||||
|
||||
return SubstituteTemplate(
|
||||
configuration_name,
|
||||
{
|
||||
'arch': str(self.arch),
|
||||
'opcode_class': opcode_class_name,
|
||||
'extended_name': self.extended_name(),
|
||||
'threadblock': threadblock,
|
||||
@ -509,7 +510,7 @@ class Conv2dOperation:
|
||||
intermediate_type = ''
|
||||
|
||||
if self.tile_description.math_instruction.opcode_class == cutlass.OpClass.TensorOp:
|
||||
inst_shape = "%d%d%d" % tuple(
|
||||
inst_shape = "%dx%dx%d" % tuple(
|
||||
self.tile_description.math_instruction.instruction_shape)
|
||||
if self.tile_description.math_instruction.element_a != self.A.element and \
|
||||
self.tile_description.math_instruction.element_a != self.accumulator_type():
|
||||
|
||||
@ -111,6 +111,7 @@ class LinearCombination(EpilogueFunctorBase):
|
||||
self.element_output = element_output
|
||||
self.element_accumulator = element_accumulator
|
||||
self.element_epilogue = element_epilogue
|
||||
self.epilogue_vector_length = epilogue_vector_length
|
||||
|
||||
self.template_arguments = [
|
||||
DataTypeTag[element_output], str(epilogue_vector_length),
|
||||
|
||||
@ -36,6 +36,7 @@ import numpy as np
|
||||
from typeguard import typechecked
|
||||
import cutlass
|
||||
from pycutlass import *
|
||||
import pycutlass.builder.collective_op_builder as collective_op_builder
|
||||
from cuda import cuda
|
||||
|
||||
|
||||
@ -56,9 +57,9 @@ def transpose_layout(layout: cutlass.layout):
|
||||
|
||||
|
||||
# @typechecked
|
||||
class GemmArguments(ArgumentBase):
|
||||
class GemmArguments2x(ArgumentBase):
|
||||
"""
|
||||
Argument wrapper for GEMM. It encodes problem information and
|
||||
Argument wrapper for GEMM in CUTLASS 2. It encodes problem information and
|
||||
user-provide tensors into the kernel's argument
|
||||
|
||||
:param operation: the GEMM operation to take the argument
|
||||
@ -148,7 +149,7 @@ class GemmArguments(ArgumentBase):
|
||||
self.batch_count = 1
|
||||
self.split_k_slices = self.batch_count
|
||||
|
||||
if gemm_mode in [cutlass.gemm.Mode.Batched, cutlass.gemm.Mode.Array]:
|
||||
if gemm_mode in [cutlass.gemm.Mode.Batched, cutlass.gemm.Mode.Array]:
|
||||
if 'batch' in kwargs.keys():
|
||||
self.batch_count = kwargs['batch']
|
||||
else:
|
||||
@ -313,6 +314,154 @@ class GemmArguments(ArgumentBase):
|
||||
self.device_workspace = device_workspace
|
||||
self.launch_config = launch_config
|
||||
|
||||
class GemmArguments3x(GemmArguments2x):
|
||||
"""
|
||||
Argument wrapper for GEMM in CUTLASS 3. It encodes problem information and
|
||||
user-provide tensors into the kernel's argument
|
||||
|
||||
:param operation: the GEMM operation to take the argument
|
||||
:type operation: :class:`pycutlass.GemmOperationUniversal` |
|
||||
:class:`pycutlass.GemmOperationGrouped`
|
||||
|
||||
:param problem_size: GEMM problem size gemm(M, N, K)
|
||||
:type operation: :class:`cutlass.gemm.GemmCoord`
|
||||
|
||||
:param A: tensor A
|
||||
:type A: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
|
||||
|
||||
:param B: tensor B
|
||||
:type B: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
|
||||
|
||||
:param C: tensor C
|
||||
:type C: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
|
||||
|
||||
:param D: tensor D
|
||||
:type D: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
|
||||
|
||||
:param gemm_mode: GEMM mode
|
||||
:type gemm_mode: :class:`cutlass.gemm.Mode`
|
||||
|
||||
:param output_op: output operator, optional
|
||||
:type output_op: :class:`pycutlass.LinearCombinationFunctorArguments`
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, operation: 'GemmOperation', problem_size: 'cutlass.gemm.GemmCoord',
|
||||
A: 'Tensor', B: 'Tensor', C: 'Tensor', D: 'Tensor',
|
||||
gemm_mode: 'cutlass.gemm.Mode'=cutlass.gemm.Mode.Gemm, **kwargs):
|
||||
if gemm_mode not in [cutlass.gemm.Mode.Gemm, cutlass.gemm.Mode.Batched]:
|
||||
raise Exception("Unsupporged GEMM mode {}.".format(gemm_mode))
|
||||
|
||||
super().__init__(operation, problem_size, A, B, C, D, gemm_mode, **kwargs)
|
||||
|
||||
def get_arguments(self):
|
||||
problem_size_ = GemmCoordBatched_(self.problem_size, self.batch_count)
|
||||
|
||||
if self.batch_count > 1:
|
||||
bsA = self.batched_stride_A
|
||||
bsB = self.batched_stride_B
|
||||
bsC = self.batched_stride_C
|
||||
bsD = self.batched_stride_D
|
||||
else:
|
||||
bsA = 0
|
||||
bsB = 0
|
||||
bsC = 0
|
||||
bsD = 0
|
||||
stride_A = StrideBatched_(self.lda, bsA)
|
||||
stride_B = StrideBatched_(self.ldb, bsB)
|
||||
stride_C = StrideBatched_(self.ldc, bsC)
|
||||
stride_D = StrideBatched_(self.ldd, bsD)
|
||||
|
||||
self.arguments = self.operation.argument_type(
|
||||
self.gemm_mode,
|
||||
problem_size_,
|
||||
int(self.ptr_A),
|
||||
stride_A,
|
||||
int(self.ptr_B),
|
||||
stride_B,
|
||||
int(self.ptr_C),
|
||||
stride_C,
|
||||
int(self.ptr_D),
|
||||
stride_D,
|
||||
self.output_op,
|
||||
)
|
||||
|
||||
def initialize(self):
|
||||
# get the host and evice workspace
|
||||
device_workspace_size = \
|
||||
self.operation.rt_module.get_device_workspace_size(self)
|
||||
|
||||
if device_workspace_size > 0:
|
||||
self.workspace_buffer = device_mem_alloc(device_workspace_size)
|
||||
workspace_ptr = self.workspace_buffer.ptr
|
||||
err, = cuda.cuMemsetD32(
|
||||
workspace_ptr, 0, device_workspace_size // 4)
|
||||
else:
|
||||
workspace_ptr = None
|
||||
|
||||
device_workspace = 0
|
||||
if (workspace_ptr is not None and
|
||||
self.gemm_mode == cutlass.gemm.Mode.GemmSplitKParallel):
|
||||
# in GEMM splik-K parallel, the D pointer is redirected
|
||||
# to the workspace
|
||||
self.ptr_D = cuda.CUdeviceptr(workspace_ptr)
|
||||
elif (workspace_ptr is not None and
|
||||
self.gemm_mode == cutlass.gemm.Mode.Gemm):
|
||||
# in GEMM split-K serial
|
||||
device_workspace = workspace_ptr
|
||||
|
||||
self.get_arguments()
|
||||
res_arg = self.operation.rt_module.get_args(
|
||||
ctypes.byref(self.arguments), ctypes.c_void_p(int(device_workspace)))
|
||||
host_workspace = bytearray(res_arg.contents)
|
||||
|
||||
grid = self.operation.rt_module.get_grid_shape(
|
||||
ctypes.byref(self.arguments), ctypes.c_void_p(int(device_workspace)))
|
||||
block = self.operation.rt_module.get_block_shape()
|
||||
|
||||
device_workspace = None
|
||||
|
||||
self.host_workspace = host_workspace
|
||||
self.device_workspace = device_workspace
|
||||
self.launch_config = LaunchConfiguration([grid.x, grid.y, grid.z],
|
||||
[block.x, block.y, block.z],
|
||||
self.operation.rt_module.shared_memory_capacity)
|
||||
|
||||
def GemmArguments(operation: 'GemmOperation', problem_size: 'cutlass.gemm.GemmCoord',
|
||||
A: 'Tensor', B: 'Tensor', C: 'Tensor', D: 'Tensor',
|
||||
gemm_mode: 'cutlass.gemm.Mode'=cutlass.gemm.Mode.Gemm, **kwargs):
|
||||
"""
|
||||
Argument wrapper for GEMM in CUTLASS 2 or 3. It returns either 2x arguments
|
||||
or 3x arguments depending on the `arch` field specified in `operation`.
|
||||
|
||||
:param operation: the GEMM operation to take the argument
|
||||
:type operation: :class:`pycutlass.GemmOperationUniversal` |
|
||||
:class:`pycutlass.GemmOperationGrouped`
|
||||
|
||||
:param problem_size: GEMM problem size gemm(M, N, K)
|
||||
:type operation: :class:`cutlass.gemm.GemmCoord`
|
||||
|
||||
:param A: tensor A
|
||||
:type A: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
|
||||
|
||||
:param B: tensor B
|
||||
:type B: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
|
||||
|
||||
:param C: tensor C
|
||||
:type C: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
|
||||
|
||||
:param D: tensor D
|
||||
:type D: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
|
||||
|
||||
:param gemm_mode: GEMM mode
|
||||
:type gemm_mode: :class:`cutlass.gemm.Mode`
|
||||
|
||||
:param output_op: output operator, optional
|
||||
:type output_op: :class:`pycutlass.LinearCombinationFunctorArguments`
|
||||
"""
|
||||
ArgClass = GemmArguments3x if operation.api == ApiVersion.v3x else GemmArguments2x
|
||||
return ArgClass(operation, problem_size, A, B, C, D, gemm_mode, **kwargs)
|
||||
|
||||
|
||||
class GemmGroupedArguments:
|
||||
"""
|
||||
@ -383,7 +532,7 @@ class GemmGroupedArguments:
|
||||
# process the input arguments
|
||||
for idx, problem_size in enumerate(problem_sizes):
|
||||
M, N, K = problem_size.m(), problem_size.n(), problem_size.k()
|
||||
temp_argument = GemmArguments(
|
||||
temp_argument = GemmArguments2x(
|
||||
operation=operation,
|
||||
problem_size=cutlass.gemm.GemmCoord(M, N, K),
|
||||
A=A[idx], B=B[idx], C=C[idx], D=D[idx],
|
||||
@ -657,16 +806,164 @@ extern "C" {
|
||||
#
|
||||
workspace_bytes = 4 * arguments.grid_tiled_shape.x * arguments.grid_tiled_shape.y
|
||||
|
||||
# TODO: get extra workspace size
|
||||
# see https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm_universal_base.h
|
||||
return workspace_bytes
|
||||
|
||||
|
||||
################################################################################
|
||||
# Runtime module for GEMM Universal within CUTLASS 3
|
||||
################################################################################
|
||||
|
||||
class GemmRTUniversal3x(GemmRTUniversal):
|
||||
"""
|
||||
GemmRTUniversal manages the CUTLASS runtime components
|
||||
"""
|
||||
KernelTemplate = r'''
|
||||
|
||||
using Operator = ${operation_name}${operation_suffix};
|
||||
extern "C"
|
||||
__global__ __launch_bounds__(Operator::MaxThreadsPerBlock, Operator::MinBlocksPerMultiprocessor)
|
||||
void ${operation_name}(__grid_constant__ typename Operator::Params const params) {
|
||||
// Dynamic shared memory base pointer
|
||||
extern __shared__ char smem[];
|
||||
|
||||
// Declare pointer to dynamic shared memory.
|
||||
Operator op;
|
||||
op(params, smem);
|
||||
}
|
||||
'''
|
||||
HostTemplate = r'''
|
||||
extern "C" {
|
||||
// Get the size of params in bytes
|
||||
int ${operation_name}_get_param_size(){
|
||||
return sizeof(${operation_name}${operation_suffix}::Params);
|
||||
}
|
||||
|
||||
// Get the size of dynamic shared memory in bytes
|
||||
int ${operation_name}_shared_memory_size() {
|
||||
return ${operation_name}${operation_suffix}::SharedStorageSize;
|
||||
}
|
||||
|
||||
using GemmType = ${operation_name}_base;
|
||||
|
||||
// Get the params as byte array
|
||||
char* ${operation_name}_get_params(GemmType::Arguments* argument, int* workspace){
|
||||
GemmType::Params params = GemmType::to_underlying_arguments(*argument, workspace);
|
||||
|
||||
char *bytes = ((char*)(¶ms));
|
||||
char *output = new char[sizeof(GemmType::Params)];
|
||||
for (unsigned int i = 0; i < sizeof(GemmType::Params); i ++)
|
||||
output[i] = bytes[i];
|
||||
|
||||
return output;
|
||||
}
|
||||
|
||||
// Get the grid shape
|
||||
dim3 ${operation_name}_get_grid_shape(GemmType::Arguments* args, int* workspace) {
|
||||
auto tmp_params = GemmType::to_underlying_arguments(*args, workspace);
|
||||
return GemmType::get_grid_shape(tmp_params);
|
||||
}
|
||||
|
||||
// Get the block shape
|
||||
dim3 ${operation_name}_get_block_shape() {
|
||||
return GemmType::get_block_shape();
|
||||
}
|
||||
}
|
||||
'''
|
||||
|
||||
def __init__(self, operation: 'GemmOperation'):
|
||||
super(GemmRTUniversal3x, self).__init__(operation)
|
||||
self.extra_funcs = {
|
||||
'get_grid_shape': dim3_,
|
||||
'get_block_shape': dim3_
|
||||
}
|
||||
self.emitter = EmitGemmUniversalInstance3x('_type')
|
||||
self.argument_type, self.epilogue_type = get_gemm_arguments_3x(operation.epilogue_functor)
|
||||
|
||||
|
||||
class EmitGemmUniversalInstance3x:
|
||||
''' Responsible for emitting a CUTLASS 3 template definition'''
|
||||
|
||||
def __init__(self, operation_suffix=''):
|
||||
self.operation_suffix = operation_suffix
|
||||
self.includes = [
|
||||
"cutlass/cutlass.h",
|
||||
"cute/tensor.hpp",
|
||||
"cute/atom/mma_atom.hpp",
|
||||
"cutlass/numeric_types.h",
|
||||
"cutlass/gemm/kernel/gemm_universal.hpp",
|
||||
"cutlass/gemm/collective/collective_builder.hpp",
|
||||
"cutlass/epilogue/collective/default_epilogue.hpp",
|
||||
"cutlass/epilogue/thread/linear_combination.h"
|
||||
]
|
||||
self.gemm_template = """
|
||||
using namespace cute;
|
||||
|
||||
${collective_op}
|
||||
|
||||
using EpilogueOp = cutlass::epilogue::collective::DefaultEpilogue<
|
||||
cutlass::gemm::TagToStrideC_t<${layout_c}>,
|
||||
cutlass::gemm::TagToStrideC_t<${layout_c}>,
|
||||
${epilogue_functor}
|
||||
>;
|
||||
|
||||
// Gemm operator ${operation_name}
|
||||
using ${operation_name}_base = cutlass::gemm::kernel::GemmUniversal<
|
||||
Shape<int,int,int,int>,
|
||||
CollectiveOp,
|
||||
EpilogueOp
|
||||
>;
|
||||
|
||||
// Define named type
|
||||
struct ${operation_name}${operation_suffix} :
|
||||
public ${operation_name}_base { };
|
||||
"""
|
||||
|
||||
#
|
||||
def emit(self, operation):
|
||||
|
||||
instance_layout_A, instance_layout_B, instance_layout_C = \
|
||||
(operation.A.layout, operation.B.layout, operation.C.layout)
|
||||
|
||||
# Support built-in epilogue functors or user-defined functions
|
||||
epilogue_functor = operation.epilogue_functor.emit()
|
||||
|
||||
collective_op = collective_op_builder.build(operation)
|
||||
|
||||
values = {
|
||||
'operation_name': operation.procedural_name(),
|
||||
'operation_suffix': self.operation_suffix,
|
||||
'collective_op': collective_op,
|
||||
'element_a': DataTypeTag[operation.A.element],
|
||||
'layout_a': LayoutTag[instance_layout_A],
|
||||
'element_b': DataTypeTag[operation.B.element],
|
||||
'layout_b': LayoutTag[instance_layout_B],
|
||||
'element_c': DataTypeTag[operation.C.element],
|
||||
'layout_c': LayoutTag[instance_layout_C],
|
||||
'epilogue_functor': epilogue_functor,
|
||||
'element_output': DataTypeTag[operation.epilogue_functor.element_output],
|
||||
'element_accumulator': DataTypeTag[operation.accumulator_type()],
|
||||
'element_epilogue': DataTypeTag[operation.epilogue_functor.element_epilogue],
|
||||
'epilogue_vector_length': str(operation.epilogue_functor.epilogue_vector_length),
|
||||
'opcode_class': OpcodeClassTag[operation.tile_description.math_instruction.opcode_class],
|
||||
'arch': "cutlass::arch::Sm%d" % operation.arch,
|
||||
'threadblock_shape_m': str(operation.tile_description.threadblock_shape[0]),
|
||||
'threadblock_shape_n': str(operation.tile_description.threadblock_shape[1]),
|
||||
'threadblock_shape_k': str(operation.tile_description.threadblock_shape[2]),
|
||||
'cluster_shape_m': str(operation.tile_description.cluster_shape[0]),
|
||||
'cluster_shape_n': str(operation.tile_description.cluster_shape[1]),
|
||||
'cluster_shape_k': str(operation.tile_description.cluster_shape[2]),
|
||||
'align_a': str(operation.A.alignment),
|
||||
'align_b': str(operation.B.alignment)
|
||||
}
|
||||
|
||||
values['epilogue_functor'] = operation.epilogue_functor.emit()
|
||||
return SubstituteTemplate(self.gemm_template, values)
|
||||
|
||||
|
||||
###################################################################################################
|
||||
# Runtime module for GEMM Grouped
|
||||
###################################################################################################
|
||||
|
||||
|
||||
class GemmRTGrouped(GemmRTbase):
|
||||
"""
|
||||
GemmRTGrouped manages the CUTLASS runtime components
|
||||
@ -713,7 +1010,7 @@ class GemmRTGrouped(GemmRTbase):
|
||||
|
||||
def __init__(self, operation: 'GemmOperation'):
|
||||
super(GemmRTGrouped, self).__init__(operation)
|
||||
self.extra_funcs = ['precompute']
|
||||
self.extra_funcs = {'precompute': None}
|
||||
|
||||
self.emitter = EmitGemmGroupedInstance('_type')
|
||||
self.argument_type, self.epilogue_type = get_gemm_grouped_arguments(operation.epilogue_functor)
|
||||
@ -761,7 +1058,7 @@ class GemmOperationBase:
|
||||
self, gemm_kind, arch, tile_description: TileDescription,
|
||||
A: TensorDescription, B: TensorDescription, C: TensorDescription,
|
||||
epilogue_functor,
|
||||
swizzling_functor=cutlass.IdentitySwizzle1, **kwargs):
|
||||
swizzling_functor=cutlass.IdentitySwizzle1, api=False, **kwargs):
|
||||
|
||||
#: operation kind
|
||||
self.operation_kind: OperationKind = OperationKind.Gemm
|
||||
@ -772,8 +1069,11 @@ class GemmOperationBase:
|
||||
#: gemm kind
|
||||
self.gemm_kind: GemmKind = gemm_kind
|
||||
|
||||
self.api = api
|
||||
self.prefix = "3x" if self.api == ApiVersion.v3x else ""
|
||||
|
||||
# use deep copy to avoid overwritting the original TensorDescription
|
||||
if C.layout == cutlass.ColumnMajor:
|
||||
if self.api != ApiVersion.v3x and C.layout == cutlass.ColumnMajor:
|
||||
#: Operand A
|
||||
self.A: TensorDescription = copy.deepcopy(B)
|
||||
#: Operand B
|
||||
@ -800,7 +1100,6 @@ class GemmOperationBase:
|
||||
self.direct_store = kwargs["direct_store"]
|
||||
else:
|
||||
self.direct_store = False
|
||||
|
||||
if "visitor" in kwargs:
|
||||
self.visitor = kwargs["visitor"]
|
||||
else:
|
||||
@ -872,8 +1171,11 @@ class GemmOperationBase:
|
||||
math_op_string = math_operations_map[math_op] if math_op in math_operations_map.keys(
|
||||
) else ''
|
||||
|
||||
inst_shape = "%d%d%d" % tuple(
|
||||
self.tile_description.math_instruction.instruction_shape)
|
||||
if self.tile_description.math_instruction.instruction_shape is not None:
|
||||
inst_shape = "%dx%dx%d" % tuple(
|
||||
self.tile_description.math_instruction.instruction_shape)
|
||||
else:
|
||||
inst_shape = "Default"
|
||||
inst_shape += math_op_string
|
||||
|
||||
if self.tile_description.math_instruction.element_a != self.A.element and \
|
||||
@ -905,6 +1207,17 @@ class GemmOperationBase:
|
||||
|
||||
return extended_name
|
||||
|
||||
#
|
||||
def extended_name_3x(self):
|
||||
'''Generates a string representing the MMA atom. Assumes accumulator type is C type.'''
|
||||
extended_name = "{core_name}_{element_a}_{element_b}_{element_acc}_{element_c}".format(
|
||||
element_a = DataTypeNames[self.A.element],
|
||||
element_b = DataTypeNames[self.B.element],
|
||||
element_acc = DataTypeNames[self.tile_description.math_instruction.element_accumulator],
|
||||
element_c = DataTypeNames[self.C.element],
|
||||
core_name = self.core_name())
|
||||
return extended_name
|
||||
|
||||
#
|
||||
def layout_name(self):
|
||||
if self.is_complex() or self.is_planar_complex():
|
||||
@ -916,25 +1229,49 @@ class GemmOperationBase:
|
||||
)
|
||||
return "%s%s" % (ShortLayoutTypeNames[self.A.layout], ShortLayoutTypeNames[self.B.layout])
|
||||
|
||||
# Generates a short string representing the ABC layout tags (e.g. ntn or tnn)
|
||||
def layout_name_3x(self):
|
||||
if self.is_complex() or self.is_planar_complex():
|
||||
return "{}{}{}".format(
|
||||
ShortComplexLayoutNames[(self.A.layout, self.A.complex_transform)],
|
||||
ShortComplexLayoutNames[(self.B.layout, self.B.complex_transform)],
|
||||
ShortComplexLayoutNames[(self.C.layout, self.C.complex_transform)])
|
||||
else:
|
||||
return "{}{}{}".format(
|
||||
ShortLayoutTypeNames[self.A.layout],
|
||||
ShortLayoutTypeNames[self.B.layout],
|
||||
ShortLayoutTypeNames[self.C.layout])
|
||||
|
||||
#
|
||||
def procedural_name(self):
|
||||
''' The full procedural name indicates architecture, extended name, tile size, and layout. '''
|
||||
threadblock = self.tile_description.procedural_name()
|
||||
|
||||
opcode_class_name = OpcodeClassNames[self.tile_description.math_instruction.opcode_class]
|
||||
|
||||
alignment = max([self.A.alignment, self.B.alignment, self.C.alignment])
|
||||
|
||||
return SubstituteTemplate(
|
||||
"cutlass_${opcode_class}_${extended_name}_${threadblock}_${layout}_align${alignment}",
|
||||
{
|
||||
'opcode_class': opcode_class_name,
|
||||
'extended_name': self.extended_name(),
|
||||
'threadblock': threadblock,
|
||||
'layout': self.layout_name(),
|
||||
'alignment': "%d" % self.A.alignment,
|
||||
}
|
||||
)
|
||||
if self.api == ApiVersion.v3x and self.arch >= 90:
|
||||
kernel_name_template = "cutlass{p}_sm{ar}_{op}_{ex}_{tbm}x{tbn}x{tbk}_{cm}x{cn}x{ck}_{l}_{s}_align{al}"
|
||||
return kernel_name_template.format(
|
||||
p = self.prefix,
|
||||
ar = self.arch,
|
||||
op = opcode_class_name,
|
||||
ex = self.extended_name_3x(),
|
||||
tbm = self.tile_description.threadblock_shape[0],
|
||||
tbn = self.tile_description.threadblock_shape[1],
|
||||
tbk = self.tile_description.threadblock_shape[2],
|
||||
cm = self.tile_description.cluster_shape[0],
|
||||
cn = self.tile_description.cluster_shape[1],
|
||||
ck = self.tile_description.cluster_shape[2],
|
||||
l = self.tile_description.stages,
|
||||
s = self.layout_name_3x(),
|
||||
al = str(self.A.alignment))
|
||||
else:
|
||||
threadblock = self.tile_description.procedural_name()
|
||||
return "cutlass{p}_sm{ar}_{op}_{ex}_{tb}_{l}_align{a}".format(
|
||||
p = self.prefix,
|
||||
ar = self.arch,
|
||||
op = opcode_class_name,
|
||||
ex = self.extended_name(),
|
||||
tb = threadblock,
|
||||
l = self.layout_name(),
|
||||
a = str(self.A.alignment))
|
||||
|
||||
#
|
||||
def configuration_name(self):
|
||||
@ -945,9 +1282,14 @@ class GemmOperationBase:
|
||||
class GemmOperationUniversal(GemmOperationBase):
|
||||
def __init__(self, arch, tile_description: TileDescription, A: TensorDescription, B, C,
|
||||
epilogue_functor, swizzling_functor=cutlass.IdentitySwizzle1, **kwargs):
|
||||
api = api_version(arch, tile_description.math_instruction.opcode_class, A.element)
|
||||
super(GemmOperationUniversal, self).__init__(GemmKind.Universal, arch, tile_description,
|
||||
A, B, C, epilogue_functor, swizzling_functor, **kwargs)
|
||||
self.rt_module = GemmRTUniversal(self)
|
||||
A, B, C, epilogue_functor, swizzling_functor,
|
||||
api=api, **kwargs)
|
||||
if api == ApiVersion.v3x:
|
||||
self.rt_module = GemmRTUniversal3x(self)
|
||||
else:
|
||||
self.rt_module = GemmRTUniversal(self)
|
||||
self.argument_type = self.rt_module.argument_type
|
||||
self.epilogue_type = self.rt_module.epilogue_type
|
||||
|
||||
|
||||
@ -36,6 +36,7 @@ import re
|
||||
|
||||
import enum
|
||||
import cutlass
|
||||
import cute
|
||||
|
||||
# The following block implements enum.auto() for Python 3.5 variants that don't include it such
|
||||
# as the default 3.5.2 on Ubuntu 16.04.
|
||||
@ -182,6 +183,30 @@ DataTypeSize = {
|
||||
cutlass.dtype.cs64: 128,
|
||||
}
|
||||
|
||||
|
||||
class DataTypeSizeBytes:
|
||||
"""
|
||||
Static class to mimic the `DataTypeSize` dictionary, but with checks for whether the
|
||||
data type key is less than a full byte or a non-integer number of bytes.
|
||||
"""
|
||||
@staticmethod
|
||||
def __class_getitem__(datatype):
|
||||
"""
|
||||
Returns the number of bytes in size the data type is. Raises an exception if the data type
|
||||
is either less than a full byte or a non-integer number of bytes in size.
|
||||
|
||||
:param datatype: data type to query
|
||||
|
||||
:return: number of bytes the data type occupies
|
||||
:rtype: int
|
||||
"""
|
||||
bits = DataTypeSize[datatype]
|
||||
if bits < 8:
|
||||
raise Exception('Data type {} is less than one byte in size.'.format(datatype))
|
||||
elif bits % 8 != 0:
|
||||
raise Exception('Data type {} is not an integer number of bytes.'.format(datatype))
|
||||
return bits // 8
|
||||
|
||||
###################################################################################################
|
||||
#
|
||||
|
||||
@ -350,6 +375,12 @@ ShortComplexLayoutNames = {
|
||||
(cutlass.RowMajor, cutlass.complex_transform.conj): 'h'
|
||||
}
|
||||
|
||||
#
|
||||
CuTeLayoutTag = {
|
||||
cute.GMMAMajor.K: 'cute::GMMA::Major::K',
|
||||
cute.GMMAMajor.MN: 'cute::GMMA::Major::MN'
|
||||
}
|
||||
|
||||
###################################################################################################
|
||||
|
||||
#
|
||||
@ -436,7 +467,6 @@ OpcodeClassTag = {
|
||||
|
||||
#
|
||||
|
||||
|
||||
class OperationKind(enum.Enum):
|
||||
Gemm = enum_auto()
|
||||
RankK = enum_auto()
|
||||
@ -460,16 +490,19 @@ ArchitectureNames = {
|
||||
70: 'volta',
|
||||
75: 'turing',
|
||||
80: 'ampere',
|
||||
90: 'hopper'
|
||||
}
|
||||
|
||||
#
|
||||
SharedMemPerCC = {
|
||||
70: 96, # 96KB of SMEM
|
||||
72: 96, # 96KB of SMEM
|
||||
75: 64, # 64KB of SMEM
|
||||
80: 160, # 164KB of SMEM - 4KB reserved for the driver
|
||||
86: 100, # 100KB of SMEM
|
||||
87: 160, # 164KB of SMEM - 4KB reserved for the driver
|
||||
70: 96 << 10, # 96KB of SMEM
|
||||
72: 96 << 10, # 96KB of SMEM
|
||||
75: 64 << 10, # 64KB of SMEM
|
||||
80: 160 << 10, # 164KB of SMEM - 4KB reserved for the driver
|
||||
86: 100 << 10, # 100KB of SMEM
|
||||
87: 160 << 10, # 164KB of SMEM - 4KB reserved for the driver
|
||||
89: 100 << 10, # 100KB of SMEM
|
||||
90: 227 << 10, # 228KB of SMEM - 1KB reserved for the driver
|
||||
}
|
||||
|
||||
###################################################################################################
|
||||
@ -646,7 +679,21 @@ ConvModeTag = {
|
||||
|
||||
|
||||
class MathInstruction:
|
||||
"""
|
||||
Description of a the lowest-level matrix-multiply-accumulate operation to be used in a kernel
|
||||
"""
|
||||
def __init__(self, instruction_shape, element_a, element_b, element_accumulator, opcode_class=cutlass.OpClass.Simt, math_operation=MathOperation.multiply_add):
|
||||
"""
|
||||
:param instruction_shape: size of the [M, N, K] dimensions of the instruction
|
||||
:type instruction_shape: list or tuple
|
||||
:param element_a: data type of operand A
|
||||
:param element_b: data type of operand B
|
||||
:param element_accumulator: data type used in accumulation
|
||||
:param opcode_class: higher-level class of the instruction (e.g., SIMT or Tensor Core)
|
||||
:type opcode_class: cutlass.OpClass
|
||||
:param math_operation: the type of low-level operation to be performed (e.g., multiply accumulate)
|
||||
:type math_operation: MathOperation
|
||||
"""
|
||||
self.instruction_shape = instruction_shape
|
||||
self.element_a = element_a
|
||||
self.element_b = element_b
|
||||
@ -658,24 +705,65 @@ class MathInstruction:
|
||||
|
||||
|
||||
class TileDescription:
|
||||
|
||||
def __init__(self, threadblock_shape, stages, warp_count, math_instruction):
|
||||
"""
|
||||
Description of a tile of computation to be performed in the kernel, encompassing threadblock, cluster, and warp shapes,
|
||||
stage count, and math instruction specification
|
||||
"""
|
||||
def __init__(self, threadblock_shape, stages, warp_count, math_instruction, cluster_shape=[1, 1, 1], persistent=False):
|
||||
"""
|
||||
:param threadblock_shape: shape of a threadblock tyle
|
||||
:type threadblock_shape: list or tuple
|
||||
:param stages: number of pipline stages in the operation. For SM90 kernels, this can be set to `None` and the maximum
|
||||
number of stages that can be supported for an operation on a given architecture will be computed at a later time
|
||||
:type stages: int or None
|
||||
:param warp_count: number of warps in each [M, N, K] dimension of a threadblock tile
|
||||
:type warp_count: list, tuple, or None
|
||||
:param math_instruction: specification of the instruction type and shape to be performed and the types of its operands
|
||||
:type math_instruction: MathInstruction
|
||||
:param cluster_shape: number of threadblocks in the [X, Y, Z] dimensions of a threadblock cluster
|
||||
:param persistent: whether the kernel uses persistent warp-specialized threadblocks (only available for SM90+)
|
||||
:type persistent: bool
|
||||
"""
|
||||
self.threadblock_shape = threadblock_shape
|
||||
|
||||
#: number of pipeline stages
|
||||
self.cluster_shape = cluster_shape
|
||||
self.persistent: bool = persistent
|
||||
self.stages: int = stages
|
||||
|
||||
#: number of warps along x, y, z directions
|
||||
self.warp_count: list[int] = warp_count
|
||||
self.math_instruction = math_instruction
|
||||
|
||||
#: number threads per threadblock
|
||||
self.num_threads: int = 32
|
||||
for cnt in self.warp_count:
|
||||
self.num_threads *= cnt
|
||||
# Number of warps along x, y, z directions
|
||||
self.warp_count = warp_count
|
||||
|
||||
@property
|
||||
def num_threads(self):
|
||||
"""
|
||||
Returns the number of threads in the threadblock
|
||||
|
||||
:return: number of threads in the threadblock
|
||||
:rtype: int or None (if warp count is None)
|
||||
"""
|
||||
if self.warp_count is not None:
|
||||
threads = 32
|
||||
for cnt in self.warp_count:
|
||||
threads *= cnt
|
||||
return threads
|
||||
return None
|
||||
|
||||
def procedural_name(self):
|
||||
return "%dx%d_%dx%d" % (self.threadblock_shape[0], self.threadblock_shape[1], self.threadblock_shape[2], self.stages)
|
||||
"""
|
||||
Returns a name identifying the tile description
|
||||
|
||||
:return: name identifying the tile description
|
||||
:rtype: int
|
||||
"""
|
||||
emit_stages = 0 if self.stages is None else self.stages
|
||||
name = "%dx%dx%d_%dx%d_%dx%d" % (
|
||||
self.cluster_shape[0], self.cluster_shape[1], self.cluster_shape[2],
|
||||
self.threadblock_shape[0], self.threadblock_shape[1], self.threadblock_shape[2], emit_stages)
|
||||
|
||||
if self.persistent:
|
||||
name += '_persistent'
|
||||
return name
|
||||
|
||||
#
|
||||
|
||||
@ -715,30 +803,68 @@ class TriangularTensorDescription:
|
||||
###################################################################################################
|
||||
|
||||
#
|
||||
def CalculateSmemUsagePerStage(operation):
|
||||
"""
|
||||
Returns the amount of shared memory in bytes consumed in a single stage of a kernel.
|
||||
|
||||
:param op: operation for which the maximum stages should be computed. If stages are
|
||||
set via the `op.tile_description.stages` parameter, this setting is ignored
|
||||
in the present calculation
|
||||
:type op: pycutlass.Operation
|
||||
|
||||
def CalculateSmemUsage(operation):
|
||||
cta_shape = operation.tile_description.threadblock_shape
|
||||
stages = operation.tile_description.stages
|
||||
:return: number of bytes of shared memory consumed by a single stage
|
||||
:rtype: int
|
||||
"""
|
||||
m, n, k = operation.tile_description.threadblock_shape
|
||||
|
||||
if operation.operation_kind == OperationKind.Gemm and operation.gemm_kind == GemmKind.Sparse:
|
||||
# Elements represented by 8 bits of metadata (based on 4:8, 2:4 or 1:2 sparsity)
|
||||
if DataTypeSize[operation.A.element] == 32:
|
||||
elements_per_8b_md = 2
|
||||
elif DataTypeSize[operation.A.element] == 4:
|
||||
elements_per_8b_md = 8
|
||||
else:
|
||||
elements_per_8b_md = 4
|
||||
|
||||
smem_per_stage = DataTypeSize[operation.A.element] * cta_shape[0] * (cta_shape[2] // 2) // 8 + \
|
||||
DataTypeSize[operation.B.element] * cta_shape[1] * cta_shape[2] // 8 + \
|
||||
cta_shape[0] * (cta_shape[2] // 2) // elements_per_8b_md
|
||||
if operation.operation_kind == OperationKind.Gemm:
|
||||
stage_barrier_bytes = 32
|
||||
return (DataTypeSize[operation.A.element] * m * k // 8) + \
|
||||
(DataTypeSize[operation.B.element] * k * n // 8) + stage_barrier_bytes
|
||||
else:
|
||||
# Few BLAS3 operations only have A tensor
|
||||
smem_per_stage = DataTypeSize[operation.A.element] * cta_shape[0] * cta_shape[2] // 8 + \
|
||||
DataTypeSize[operation.A.element] * \
|
||||
cta_shape[1] * cta_shape[2] // 8
|
||||
raise Exception('Unsupported operation kind {}.'.format(operation.operation_kind))
|
||||
|
||||
|
||||
#
|
||||
def CalculateSmemUsage(operation):
|
||||
"""
|
||||
Returns the amount of shared memory in bytes consumed by a kernel.
|
||||
|
||||
:param op: operation for which the maximum stages should be computed. If stages are
|
||||
set via the `op.tile_description.stages` parameter, this setting is ignored
|
||||
in the present calculation
|
||||
:type op: pycutlass.Operation
|
||||
|
||||
:return: int
|
||||
"""
|
||||
return operation.tile_description.stages * CalculateSmemUsagePerStage(operation)
|
||||
|
||||
|
||||
class ApiVersion(enum.Enum):
|
||||
"""
|
||||
Differentiate between CUTLASS 2.x and 3.x API versions
|
||||
"""
|
||||
v2x = enum_auto()
|
||||
v3x = enum_auto()
|
||||
|
||||
|
||||
def api_version(arch, opclass, datatype):
|
||||
"""
|
||||
Returns whether the architecture, opcode class, and datatype in question require using CUTLASS 2.x
|
||||
or 3.x for code emission.
|
||||
|
||||
:param arch: compute capability of device on which to run
|
||||
:type arch: int
|
||||
:param opclass: class of the operation being performed
|
||||
:type opclass: cutlass.OpClass
|
||||
:param datatype: data type to be used in operation (assumes that ElementA and ElementB are the same)
|
||||
|
||||
:return: API version to be used in code emission
|
||||
:rtype: ApiVersion
|
||||
"""
|
||||
if arch >= 90 and opclass == cutlass.OpClass.TensorOp and (datatype != cutlass.float64):
|
||||
return ApiVersion.v3x
|
||||
else:
|
||||
return ApiVersion.v2x
|
||||
|
||||
smem_usage = smem_per_stage * stages
|
||||
return (smem_usage >> 10)
|
||||
###################################################################################################
|
||||
|
||||
@ -32,6 +32,12 @@
|
||||
|
||||
import ctypes
|
||||
from cuda import cuda
|
||||
from pycutlass.utils.device import device_cc
|
||||
|
||||
from cuda import __version__ as __cuda_version__
|
||||
_version_splits = [int(x) for x in __cuda_version__.split('.')]
|
||||
supports_cluster_launch = device_cc() >= 90 and (_version_splits[0] > 11 or (_version_splits[0] == 11 and _version_splits[1] >= 8))
|
||||
|
||||
|
||||
################################################################################
|
||||
#
|
||||
@ -90,21 +96,58 @@ class ExecutableOperation:
|
||||
def initialize(self, host_workspace, device_workspace, launch_config, arguments, stream=cuda.CUstream(0)):
|
||||
raise NotImplementedError()
|
||||
|
||||
|
||||
#
|
||||
def run(self, host_workspace, device_workspace, launch_config, stream=cuda.CUstream(0)):
|
||||
def run_with_clusters(self, launch_config, kernel_params, stream=cuda.CUstream(0)):
|
||||
if hasattr(self.operation, 'tile_description') and hasattr(self.operation.tile_description, 'cluster_shape'):
|
||||
attr = cuda.CUlaunchAttribute()
|
||||
attr.value.clusterDim.x, attr.value.clusterDim.y, attr.value.clusterDim.z = self.operation.tile_description.cluster_shape
|
||||
attr.id = cuda.CUstreamAttrID.CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION
|
||||
attrs = [attr]
|
||||
|
||||
cArg = (ctypes.c_char * len(host_workspace)
|
||||
).from_buffer(host_workspace)
|
||||
packed = (ctypes.c_void_p * 1)()
|
||||
packed[0] = ctypes.addressof(cArg)
|
||||
# Allow for non-portable cluster sizes
|
||||
err, = cuda.cuFuncSetAttribute(
|
||||
self.kernel, cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NON_PORTABLE_CLUSTER_SIZE_ALLOWED, 1)
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
return err
|
||||
else:
|
||||
attrs = []
|
||||
|
||||
config = cuda.CUlaunchConfig()
|
||||
config.gridDimX, config.gridDimY, config.gridDimZ = launch_config.grid
|
||||
config.blockDimX, config.blockDimY, config.blockDimZ = launch_config.block
|
||||
config.blockDimZ = launch_config.block[2]
|
||||
config.sharedMemBytes = launch_config.shared_memory_capacity
|
||||
config.hStream = stream
|
||||
config.attrs = attrs
|
||||
config.numAttrs = len(attrs)
|
||||
|
||||
err, = cuda.cuLaunchKernelEx(config, f=self.kernel, kernelParams=kernel_params, extra=0)
|
||||
return err
|
||||
|
||||
|
||||
#
|
||||
def run_without_clusters(self, launch_config, kernel_params, stream=cuda.CUstream(0)):
|
||||
err, = cuda.cuLaunchKernel(
|
||||
self.kernel,
|
||||
launch_config.grid[0], launch_config.grid[1], launch_config.grid[2],
|
||||
launch_config.block[0], launch_config.block[1], launch_config.block[2],
|
||||
launch_config.shared_memory_capacity,
|
||||
stream,
|
||||
packed,
|
||||
kernel_params,
|
||||
0)
|
||||
|
||||
return err
|
||||
|
||||
|
||||
#
|
||||
def run(self, host_workspace, device_workspace, launch_config, stream=cuda.CUstream(0)):
|
||||
cArg = (ctypes.c_char * len(host_workspace)
|
||||
).from_buffer(host_workspace)
|
||||
packed = (ctypes.c_void_p * 1)()
|
||||
packed[0] = ctypes.addressof(cArg)
|
||||
|
||||
if supports_cluster_launch:
|
||||
return self.run_with_clusters(launch_config, packed, stream)
|
||||
else:
|
||||
return self.run_without_clusters(launch_config, packed, stream)
|
||||
|
||||
@ -543,7 +543,6 @@ using ${operation_name}_EpilogueVisitor = cutlass::epilogue::threadblock::Epilog
|
||||
self.elements_per_access = elements_per_access
|
||||
self.element_compute = element_compute
|
||||
self.element_output = element_output
|
||||
# TODO: deprecate this
|
||||
self.elementwise_functor = elementwise_functor
|
||||
pass
|
||||
|
||||
@ -554,11 +553,8 @@ using ${operation_name}_EpilogueVisitor = cutlass::epilogue::threadblock::Epilog
|
||||
#
|
||||
tree = function.epilogue_tree
|
||||
self.tree = tree
|
||||
# self.tree.show() # for debug
|
||||
function.pass_binary_2_unary(self.tree, self.tree.root)
|
||||
# self.tree.show() # for debug
|
||||
function.pass_inject_reduction(self.tree, self.tree.root)
|
||||
# self.tree.show() # for debug
|
||||
function.pass_inject_epilogue_op(self.tree,self.tree.root)
|
||||
|
||||
visitor = self.tree.get_node(self.tree.root).data.epilogue_node
|
||||
@ -575,7 +571,6 @@ using ${operation_name}_EpilogueVisitor = cutlass::epilogue::threadblock::Epilog
|
||||
if input_key == "accum":
|
||||
continue
|
||||
if function.input_args[input_key][0] == "scalar":
|
||||
# _kwargs[input_key] = kwargs[input_key]
|
||||
continue
|
||||
# tensor input
|
||||
else:
|
||||
|
||||
@ -265,15 +265,6 @@ class Conv2dLauncher:
|
||||
|
||||
flops_total_ = flops_mainloop_ + flops_epilogue_
|
||||
|
||||
# TODO complex-value support
|
||||
# switch (operation_desc.tile_description.math_instruction.math_operation) {
|
||||
# case library::MathOperationID::kMultiplyAddComplex:
|
||||
# flops_total_ *=4;
|
||||
# break;
|
||||
|
||||
# default: break;
|
||||
# }
|
||||
|
||||
return flops_total_
|
||||
|
||||
|
||||
@ -511,9 +502,8 @@ class Conv2dLauncher:
|
||||
# (conv_blacklist_sizes)
|
||||
############################################################################################################
|
||||
|
||||
def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleaved=False): # TODO: conv_test_sizes and conv_blacklist_sizes
|
||||
def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleaved=False):
|
||||
passed = True
|
||||
|
||||
#
|
||||
# Testbed object
|
||||
#
|
||||
@ -529,8 +519,6 @@ def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleave
|
||||
# Vector of conv2d problem sizes to avoid duplicate runs
|
||||
conv_tested_sizes = []
|
||||
|
||||
# TODO: include resnet 50 sizes, user sepecified sizes, and rigorous sizes
|
||||
|
||||
# Flatten 2D problem_vectors into a 1D problem sizes
|
||||
problem_sizes = conv_problems.conv2d_default_sizes
|
||||
|
||||
@ -539,7 +527,6 @@ def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleave
|
||||
# Sweep conv2d problem sizes (split-k-mode=kSerial, split-k-slices=1, alpha=1.0, beta=0.0)
|
||||
for conv_problem in problem_sizes:
|
||||
|
||||
# TODO: skip blacklist problem sizes
|
||||
if conv_problem in conv_tested_sizes:
|
||||
continue
|
||||
|
||||
@ -585,9 +572,8 @@ def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleave
|
||||
|
||||
passed = testbed.run(conv_problem)
|
||||
|
||||
# if not passed: return False
|
||||
|
||||
# TODO: If CUTLASS_UNIT_TEST_PROBLEM_COUNT is set reduce the the number of tested problem counts
|
||||
if not passed:
|
||||
return False
|
||||
|
||||
if interleaved:
|
||||
return True
|
||||
|
||||
@ -184,7 +184,7 @@ class TestbedGrouped:
|
||||
arguments.sync()
|
||||
|
||||
#
|
||||
# Reference check - TODO: support caching results
|
||||
# Reference check
|
||||
#
|
||||
alpha = self.compute_type(alpha).value()
|
||||
beta = self.compute_type(beta).value()
|
||||
|
||||
@ -33,6 +33,7 @@
|
||||
from time import sleep
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
import pycutlass.utils.datatypes as datatypes
|
||||
import cutlass
|
||||
from cuda import cudart
|
||||
from cuda import cuda
|
||||
@ -52,16 +53,22 @@ def transpose(layout):
|
||||
return cutlass.ColumnMajorInterleaved32
|
||||
|
||||
|
||||
def getTensorRef(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: cutlass.layout):
|
||||
def getTensorRef(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: cutlass.layout, batch_offset: int = 0):
|
||||
ptr = tensor.__array_interface__['data'][0]
|
||||
if operand == "a":
|
||||
tensor_coord = problem_size.mk()
|
||||
batch_stride = problem_size.m() * problem_size.k()
|
||||
elif operand == "b":
|
||||
tensor_coord = problem_size.kn()
|
||||
batch_stride = problem_size.k() * problem_size.n()
|
||||
elif operand in ["c", "d"]:
|
||||
tensor_coord = problem_size.mn()
|
||||
batch_stride = problem_size.m() * problem_size.n()
|
||||
else:
|
||||
raise ValueError("unknonw operand: " + operand)
|
||||
raise ValueError("Unknown operand: " + operand)
|
||||
|
||||
elt_size = DataTypeSizeBytes[datatypes.to_cutlass(tensor.dtype)]
|
||||
ptr += batch_offset * batch_stride * elt_size
|
||||
|
||||
if layout == cutlass.RowMajor:
|
||||
layout = cutlass.RowMajor.packed(tensor_coord)
|
||||
@ -96,8 +103,8 @@ def getTensorRef(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, opera
|
||||
return getattr(cutlass, ref_name)(ptr, layout)
|
||||
|
||||
|
||||
def getTensorView(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: str):
|
||||
tensor_ref = getTensorRef(tensor, problem_size, operand, layout)
|
||||
def getTensorView(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: str, batch_offset: int = 0):
|
||||
tensor_ref = getTensorRef(tensor, problem_size, operand, layout, batch_offset)
|
||||
|
||||
if operand == "a":
|
||||
tensor_coord = problem_size.mk()
|
||||
@ -106,7 +113,7 @@ def getTensorView(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, oper
|
||||
elif operand in ["c", "d"]:
|
||||
tensor_coord = problem_size.mn()
|
||||
else:
|
||||
raise ValueError("unknonw operand: " + operand)
|
||||
raise ValueError("Unknown operand: " + operand)
|
||||
|
||||
if layout == cutlass.RowMajor:
|
||||
layout_tag = "RowMajor"
|
||||
@ -168,7 +175,12 @@ class GemmUniversalLauncher:
|
||||
# Compile the operator
|
||||
#
|
||||
|
||||
pycutlass.compiler.add_module([operation, self.reduction_operation])
|
||||
op_list = [operation]
|
||||
if operation.arch < 90:
|
||||
# Split K via Python is currently only supported for pre-SM90 kernels
|
||||
op_list.append(self.reduction_operation)
|
||||
|
||||
pycutlass.compiler.add_module(op_list)
|
||||
|
||||
self.operation = operation
|
||||
|
||||
@ -206,8 +218,10 @@ class GemmUniversalLauncher:
|
||||
def print_problem_size(self, p, mode, batch_count):
|
||||
if mode == cutlass.gemm.Mode.Gemm:
|
||||
mode = "Gemm"
|
||||
elif mode == cutlass.gemm.Mode.Batched:
|
||||
mode = "GemmBatched"
|
||||
elif mode == cutlass.gemm.Mode.GemmSplitKParallel:
|
||||
mode = "GemmSplitKParalel"
|
||||
mode = "GemmSplitKParallel"
|
||||
problem_size = "problem: %d, %d, %d\n batch_count: %d\n mode: %s" % (
|
||||
p.m(), p.n(), p.k(), batch_count, mode)
|
||||
print(problem_size)
|
||||
@ -251,8 +265,7 @@ class GemmUniversalLauncher:
|
||||
tensor_ref_B, reordered_tensor_ref_B, problem_size)
|
||||
return reordered_tensor_B
|
||||
|
||||
def host_reference(self, problem_size, tensor_A, tensor_B, tensor_C, alpha, beta):
|
||||
# TODO
|
||||
def host_reference(self, problem_size, batch_count, tensor_A, tensor_B, tensor_C, alpha, beta):
|
||||
tensor_D_ref = np.ones_like(tensor_C)
|
||||
alpha = self.numpy_type(self.compute_type)(alpha)
|
||||
beta = self.numpy_type(self.compute_type)(beta)
|
||||
@ -262,42 +275,46 @@ class GemmUniversalLauncher:
|
||||
beta = self.compute_type(beta).value()
|
||||
init_acc = self.accumulator_type(init_acc).value()
|
||||
|
||||
if self.operation.switched:
|
||||
tensor_ref_A = getTensorRef(
|
||||
tensor_A, problem_size, "a", transpose(self.operation.B.layout))
|
||||
tensor_ref_B = getTensorRef(
|
||||
tensor_B, problem_size, "b", transpose(self.operation.A.layout))
|
||||
tensor_ref_C = getTensorRef(
|
||||
tensor_C, problem_size, "c", transpose(self.operation.C.layout))
|
||||
tensor_ref_D_ref = getTensorRef(
|
||||
tensor_D_ref, problem_size, "d", transpose(self.operation.C.layout))
|
||||
else:
|
||||
tensor_ref_A = getTensorRef(
|
||||
tensor_A, problem_size, "a", self.operation.A.layout)
|
||||
tensor_ref_B = getTensorRef(
|
||||
tensor_B, problem_size, "b", self.operation.B.layout)
|
||||
tensor_ref_C = getTensorRef(
|
||||
tensor_C, problem_size, "c", self.operation.C.layout)
|
||||
tensor_ref_D_ref = getTensorRef(
|
||||
tensor_D_ref, problem_size, "d", self.operation.C.layout)
|
||||
for i in range(batch_count):
|
||||
if self.operation.switched:
|
||||
tensor_ref_A = getTensorRef(
|
||||
tensor_A, problem_size, "a", transpose(self.operation.B.layout), batch_offset=i)
|
||||
tensor_ref_B = getTensorRef(
|
||||
tensor_B, problem_size, "b", transpose(self.operation.A.layout), batch_offset=i)
|
||||
tensor_ref_C = getTensorRef(
|
||||
tensor_C, problem_size, "c", transpose(self.operation.C.layout), batch_offset=i)
|
||||
tensor_ref_D_ref = getTensorRef(
|
||||
tensor_D_ref, problem_size, "d", transpose(self.operation.C.layout), batch_offset=i)
|
||||
else:
|
||||
tensor_ref_A = getTensorRef(
|
||||
tensor_A, problem_size, "a", self.operation.A.layout, batch_offset=i)
|
||||
tensor_ref_B = getTensorRef(
|
||||
tensor_B, problem_size, "b", self.operation.B.layout, batch_offset=i)
|
||||
tensor_ref_C = getTensorRef(
|
||||
tensor_C, problem_size, "c", self.operation.C.layout, batch_offset=i)
|
||||
tensor_ref_D_ref = getTensorRef(
|
||||
tensor_D_ref, problem_size, "d", self.operation.C.layout, batch_offset=i)
|
||||
|
||||
if self.math_operation in [MathOperation.multiply_add_saturate]:
|
||||
cutlass.test.gemm.host.gemm_saturate(
|
||||
problem_size, alpha, tensor_ref_A, tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
|
||||
else:
|
||||
cutlass.test.gemm.host.gemm(problem_size, alpha, tensor_ref_A,
|
||||
tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
|
||||
if self.math_operation in [MathOperation.multiply_add_saturate]:
|
||||
cutlass.test.gemm.host.gemm_saturate(
|
||||
problem_size, alpha, tensor_ref_A, tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
|
||||
else:
|
||||
cutlass.test.gemm.host.gemm(problem_size, alpha, tensor_ref_A,
|
||||
tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
|
||||
|
||||
return tensor_D_ref
|
||||
|
||||
def equal(self, tensor_D, tensor_D_ref, problem_size):
|
||||
def equal(self, tensor_D, tensor_D_ref, problem_size, batch_count):
|
||||
for i in range(batch_count):
|
||||
tensor_view_D = getTensorView(
|
||||
tensor_D, problem_size, "d", self.operation.C.layout, batch_offset=i)
|
||||
tensor_view_D_ref = getTensorView(
|
||||
tensor_D_ref, problem_size, "d", self.operation.C.layout, batch_offset=i)
|
||||
|
||||
tensor_view_D = getTensorView(
|
||||
tensor_D, problem_size, "d", self.operation.C.layout)
|
||||
tensor_view_D_ref = getTensorView(
|
||||
tensor_D_ref, problem_size, "d", self.operation.C.layout)
|
||||
if not cutlass.test.gemm.host.equals(tensor_view_D, tensor_view_D_ref):
|
||||
return False
|
||||
|
||||
return cutlass.test.gemm.host.equals(tensor_view_D, tensor_view_D_ref)
|
||||
return True
|
||||
|
||||
def bytes(self, problem_size, batch_count=1, alpha=1.0, beta=0.0):
|
||||
m = problem_size.m()
|
||||
@ -321,9 +338,8 @@ class GemmUniversalLauncher:
|
||||
n = problem_size.n()
|
||||
k = problem_size.k()
|
||||
|
||||
flops_ = (m * n * k + m * n) * 2 * batch_count
|
||||
flops_ = (m * n * k) * 2 * batch_count
|
||||
|
||||
# TODO: complex
|
||||
return flops_
|
||||
|
||||
def run_cutlass_profiler(self, mode, problem_size, batch_count=1, alpha=1.0, beta=0.0):
|
||||
@ -368,21 +384,25 @@ class GemmUniversalLauncher:
|
||||
|
||||
return runtime
|
||||
|
||||
def run(self, mode, problem_size, batch_count=1, alpha=1.0, beta=0.0):
|
||||
|
||||
def run(self, mode, problem_size, batch_count=1, split_k_slices=1, alpha=1.0, beta=0.0):
|
||||
assert get_allocated_size(
|
||||
) == 0, "%d byte of pool memory is not released in previous run" % get_allocated_size()
|
||||
|
||||
np.random.seed(self.seed)
|
||||
|
||||
# Assign an actual batch count in cases where we are not running in batched mode.
|
||||
# This is to differentiate between the number of split K slices and the batch count,
|
||||
# which are overloaded within the single `batch_count` variable.
|
||||
true_batch_count = batch_count if mode == cutlass.gemm.Mode.Batched else 1
|
||||
|
||||
tensor_A = self.uniform_init(
|
||||
size=(problem_size.m() * problem_size.k(),), dtype=self.dtype_A)
|
||||
size=(problem_size.m() * problem_size.k() * true_batch_count,), dtype=self.dtype_A)
|
||||
tensor_B = self.uniform_init(
|
||||
size=(problem_size.n() * problem_size.k(),), dtype=self.dtype_B)
|
||||
size=(problem_size.n() * problem_size.k() * true_batch_count,), dtype=self.dtype_B)
|
||||
tensor_C = self.uniform_init(
|
||||
size=(problem_size.m() * problem_size.n(),), dtype=self.dtype_C)
|
||||
size=(problem_size.m() * problem_size.n() * true_batch_count,), dtype=self.dtype_C)
|
||||
tensor_D = np.zeros(
|
||||
shape=(problem_size.m() * problem_size.n(),), dtype=self.dtype_D)
|
||||
shape=(problem_size.m() * problem_size.n() * true_batch_count,), dtype=self.dtype_D)
|
||||
|
||||
#
|
||||
# Launch kernel
|
||||
@ -392,14 +412,14 @@ class GemmUniversalLauncher:
|
||||
operation=self.operation, problem_size=problem_size,
|
||||
A=tensor_A, B=tensor_B, C=tensor_C, D=tensor_D,
|
||||
output_op=self.operation.epilogue_type(alpha, beta),
|
||||
gemm_mode=mode, split_k_slices=batch_count
|
||||
gemm_mode=mode, split_k_slices=split_k_slices, batch=batch_count
|
||||
)
|
||||
|
||||
if mode == cutlass.gemm.Mode.GemmSplitKParallel:
|
||||
reduction_arguments = ReductionArguments(
|
||||
self.reduction_operation, problem_size=[
|
||||
problem_size.m(), problem_size.n()],
|
||||
partitions=batch_count,
|
||||
partitions=split_k_slices,
|
||||
workspace=arguments.ptr_D,
|
||||
destination=tensor_D,
|
||||
source=tensor_C,
|
||||
@ -419,8 +439,8 @@ class GemmUniversalLauncher:
|
||||
else:
|
||||
arguments.sync()
|
||||
tensor_D_ref = self.host_reference(
|
||||
problem_size, tensor_A, tensor_B, tensor_C, alpha, beta)
|
||||
passed = self.equal(tensor_D, tensor_D_ref, problem_size)
|
||||
problem_size, true_batch_count, tensor_A, tensor_B, tensor_C, alpha, beta)
|
||||
passed = self.equal(tensor_D, tensor_D_ref, problem_size, true_batch_count)
|
||||
|
||||
try:
|
||||
assert passed
|
||||
@ -494,7 +514,7 @@ def test_all_gemm(operation: 'GemmOperationUniversal', testcase="universal"):
|
||||
if operation.A.layout in [cutlass.ColumnMajorInterleaved32, cutlass.RowMajorInterleaved32]:
|
||||
interleavedk = 32
|
||||
else:
|
||||
raise ValueError("unknonw layout")
|
||||
raise ValueError("Unknown layout")
|
||||
|
||||
if testcase == "interleaved":
|
||||
modes = [cutlass.gemm.Mode.Gemm, ]
|
||||
@ -515,14 +535,22 @@ def test_all_gemm(operation: 'GemmOperationUniversal', testcase="universal"):
|
||||
problem_beta = [0.0]
|
||||
batch_counts = [1, ]
|
||||
else: # universal
|
||||
modes = [cutlass.gemm.Mode.Gemm, cutlass.gemm.Mode.GemmSplitKParallel]
|
||||
modes = [cutlass.gemm.Mode.Gemm]
|
||||
batch_counts = [1, 2, 3, 5, 7]
|
||||
if operation.arch < 90:
|
||||
# Split K kernels via Python are currently only supported pre-SM90
|
||||
modes.append(cutlass.gemm.Mode.GemmSplitKParallel)
|
||||
|
||||
problem_size_m = [alignment_m, 512 - 3 * alignment_m]
|
||||
problem_size_n = [alignment_n, 512 - 2 * alignment_n]
|
||||
if operation.tile_description.stages is None:
|
||||
stages_for_k_calc = 7
|
||||
else:
|
||||
stages_for_k_calc = operation.tile_description.stages
|
||||
problem_size_k = [
|
||||
alignment_k,
|
||||
threadblock_k * operation.tile_description.stages - alignment_k,
|
||||
threadblock_k * operation.tile_description.stages * 3 - alignment_k]
|
||||
batch_counts = [1, 2, 3, 5, 7]
|
||||
threadblock_k * stages_for_k_calc - alignment_k,
|
||||
threadblock_k * stages_for_k_calc * 3 - alignment_k]
|
||||
problem_alpha = [1.0]
|
||||
problem_beta = [2.0]
|
||||
|
||||
@ -543,8 +571,17 @@ def test_all_gemm(operation: 'GemmOperationUniversal', testcase="universal"):
|
||||
|
||||
problem_size = cutlass.gemm.GemmCoord(m, n, k)
|
||||
|
||||
if operation.arch < 90:
|
||||
split_k_slices = batch_count
|
||||
else:
|
||||
split_k_slices = 1
|
||||
|
||||
overridden_mode = mode
|
||||
if mode == cutlass.gemm.Mode.Gemm and batch_count > 1:
|
||||
overridden_mode = cutlass.gemm.Mode.Batched
|
||||
|
||||
passed = testbed.run(
|
||||
mode, problem_size, batch_count, alpha, beta)
|
||||
overridden_mode, problem_size, batch_count, split_k_slices, alpha, beta)
|
||||
|
||||
err, = cudart.cudaDeviceSynchronize()
|
||||
if err != cuda.CUresult.CUDA_SUCCESS:
|
||||
|
||||
109
tools/library/scripts/pycutlass/src/pycutlass/test/utils.py
Normal file
109
tools/library/scripts/pycutlass/src/pycutlass/test/utils.py
Normal file
@ -0,0 +1,109 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
import cutlass
|
||||
from pycutlass import library, SubstituteTemplate
|
||||
|
||||
|
||||
class Layout:
|
||||
"""
|
||||
Utility class to map transpose and non-transpose terminology to row- and column-major terminology
|
||||
"""
|
||||
T = cutlass.RowMajor
|
||||
N = cutlass.ColumnMajor
|
||||
|
||||
|
||||
class LayoutCombination:
|
||||
"""
|
||||
Utility class defining all combinations of row- and column-major layouts for operands to a GEMMs
|
||||
"""
|
||||
NNN = (Layout.N, Layout.N, Layout.N)
|
||||
NNT = (Layout.N, Layout.N, Layout.T)
|
||||
NTN = (Layout.N, Layout.T, Layout.N)
|
||||
NTT = (Layout.N, Layout.T, Layout.T)
|
||||
TNN = (Layout.T, Layout.N, Layout.N)
|
||||
TNT = (Layout.T, Layout.N, Layout.T)
|
||||
TTN = (Layout.T, Layout.T, Layout.N)
|
||||
TTT = (Layout.T, Layout.T, Layout.T)
|
||||
|
||||
|
||||
def get_name(layouts, alignments, element_output,
|
||||
element_accumulator, element_epilogue, cluster_shape,
|
||||
threadblock_shape, stages, element_a, element_b, arch, opclass, suffix=""):
|
||||
"""
|
||||
Generates a procedural name for a test case.
|
||||
|
||||
:param layouts: indexable container of layouts of A, B, and C operands
|
||||
:param alignments: indexable container of alingments of A, B, and C operands
|
||||
:param element_output: data type of the output element
|
||||
:param element_accumulator: data type used in accumulation
|
||||
:param element_epilogue: data type used in computing the epilogue
|
||||
:param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
|
||||
:param threadblock_shape: indexable container of dimensions of threadblock tiles
|
||||
:param stages: number of pipeline stages to use in the kernel
|
||||
:type stages: int
|
||||
:param element_a: data type of operand A
|
||||
:param element_b: data type of operand B
|
||||
:param arch: compute capability of kernel being generated
|
||||
:type arch: int
|
||||
:param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
|
||||
:type opclass: cutlass.OpClass
|
||||
:param suffix: additional string to add to the suffix of the name
|
||||
:type suffix: str
|
||||
|
||||
:return: str
|
||||
"""
|
||||
name_format = 'test_SM${arch}_Device_Gemm_${eA}${lA}_${eB}${lB}_${eC}${lC}_${opclass}_${acc}_${tbM}x${tbN}x${tbK}_${cM}x${cN}x${cK}_${stages}_align${aA}-${aB}-${aC}${suffix}'
|
||||
return SubstituteTemplate(name_format,
|
||||
{
|
||||
'arch': str(arch),
|
||||
'eA': library.DataTypeNames[element_a],
|
||||
'eB': library.DataTypeNames[element_b],
|
||||
'eC': library.DataTypeNames[element_output],
|
||||
'lA': library.ShortLayoutTypeNames[layouts[0]],
|
||||
'lB': library.ShortLayoutTypeNames[layouts[1]],
|
||||
'lC': library.ShortLayoutTypeNames[layouts[2]],
|
||||
'opclass': library.OpcodeClassNames[opclass],
|
||||
'acc': library.DataTypeNames[element_accumulator],
|
||||
'cM': str(cluster_shape[0]),
|
||||
'cN': str(cluster_shape[1]),
|
||||
'cK': str(cluster_shape[2]),
|
||||
'tbM': str(threadblock_shape[0]),
|
||||
'tbN': str(threadblock_shape[1]),
|
||||
'tbK': str(threadblock_shape[2]),
|
||||
'stages': str(stages) if stages is not None else 'auto',
|
||||
'aA' : str(alignments[0]),
|
||||
'aB' : str(alignments[1]),
|
||||
'aC' : str(alignments[2]),
|
||||
'suffix': '' if suffix is None else suffix
|
||||
}
|
||||
)
|
||||
121
tools/library/scripts/pycutlass/src/pycutlass/utils/datatypes.py
Normal file
121
tools/library/scripts/pycutlass/src/pycutlass/utils/datatypes.py
Normal file
@ -0,0 +1,121 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
"""
|
||||
Utility functions for converting between frontend datatypes and CUTLASS datatypes
|
||||
"""
|
||||
|
||||
from typing import Union, Tuple
|
||||
|
||||
import cutlass
|
||||
|
||||
import pycutlass.library as library
|
||||
|
||||
|
||||
try:
|
||||
import numpy as np
|
||||
numpy_available = True
|
||||
except ImportError:
|
||||
numpy_available = False
|
||||
|
||||
def numpy_to_cutlass(inp):
|
||||
if numpy_available:
|
||||
if inp == np.float16:
|
||||
return cutlass.float16
|
||||
elif inp == np.float32:
|
||||
return cutlass.float32
|
||||
elif inp == np.float64:
|
||||
return cutlass.float64
|
||||
elif inp == np.int8:
|
||||
return cutlass.int8
|
||||
elif inp == np.int32:
|
||||
return cutlass.int32
|
||||
return None
|
||||
|
||||
try:
|
||||
import cupy as cp
|
||||
cupy_available = True
|
||||
cupy_to_cutlass_dict = {
|
||||
cp.float16: cutlass.float16,
|
||||
cp.float32: cutlass.float32,
|
||||
cp.float64: cutlass.float64
|
||||
}
|
||||
except ImportError:
|
||||
cupy_available = False
|
||||
|
||||
def cupy_to_cutlass(inp):
|
||||
if cupy_available:
|
||||
if inp == cp.float16:
|
||||
return cutlass.float16
|
||||
elif inp == cp.float32:
|
||||
return cutlass.float32
|
||||
elif inp == cp.float64:
|
||||
return cutlass.float64
|
||||
return None
|
||||
|
||||
try:
|
||||
import torch
|
||||
torch_available = True
|
||||
torch_to_cutlass_dict = {
|
||||
torch.half: cutlass.float16,
|
||||
torch.float16: cutlass.float16,
|
||||
torch.float: cutlass.float32,
|
||||
torch.float32: cutlass.float32,
|
||||
torch.double: cutlass.float64,
|
||||
torch.float64: cutlass.float64
|
||||
}
|
||||
except ImportError:
|
||||
torch_available = False
|
||||
|
||||
def torch_to_cutlass(inp):
|
||||
if torch_available:
|
||||
return torch_to_cutlass_dict.get(inp, None)
|
||||
|
||||
try:
|
||||
import bfloat16
|
||||
bfloat16_available = True
|
||||
except ImportError:
|
||||
bfloat16_available = False
|
||||
|
||||
def bfloat16_to_cutlass(inp):
|
||||
if bfloat16_available:
|
||||
if inp == bfloat16.bfloat16:
|
||||
return cutlass.bfloat16
|
||||
|
||||
|
||||
def to_cutlass(inp):
|
||||
for cvt_fn in [bfloat16_to_cutlass, cupy_to_cutlass, numpy_to_cutlass, torch_to_cutlass]:
|
||||
out = cvt_fn(inp)
|
||||
if out is not None:
|
||||
return out
|
||||
|
||||
raise Exception('No available conversion from type {} to a CUTLASS type.'.format(inp))
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_dgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
|
||||
from pycutlass.conv2d_operation import *
|
||||
from pycutlass import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_dgrad_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass.conv2d_operation import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_fprop_few_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass.test import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_fprop_fixed_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass.test import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass.conv2d_operation import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_wgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_wgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_wgrad_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass.conv2d_operation import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
# test/unit/conv/device/conv2d_wgrad_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
import pycutlass
|
||||
import unittest
|
||||
from pycutlass.memory_manager import *
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
pushd $CUTLASS_PATH/examples/40_cutlass_py/customizable
|
||||
|
||||
python gemm.py -i 8 8 4 -ta float64 -tb float64 -tc float64 -tacc float64 -m multiply_add -op TensorOp -b 32 32 16 -s 4 -w 2 2 1 -cc 80 -la ColumnMajor -aa 1 -lb RowMajor -ab 1 -lc RowMajor -ac 1 -te float64 -ep LinearCombination -sw IdentitySwizzle1 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm Gemm -k 1
|
||||
|
||||
@ -1 +1,33 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
CUPY_CACHE_DIR=./ python test_frontend.py
|
||||
|
||||
@ -29,13 +29,15 @@
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
## Test case for Pytorch
|
||||
|
||||
"""
|
||||
Test cases for frontends
|
||||
"""
|
||||
|
||||
import pycutlass
|
||||
import unittest
|
||||
from pycutlass import *
|
||||
from pycutlass.utils.device import device_cc
|
||||
import torch
|
||||
import cupy as cp
|
||||
|
||||
|
||||
class Test_Frontend(unittest.TestCase):
|
||||
@ -49,9 +51,7 @@ class Test_Frontend(unittest.TestCase):
|
||||
cutlass.OpClass.Simt, MathOperation.multiply_add
|
||||
)
|
||||
|
||||
# Stages > 2 is supported only for compute capability 80 and beyond
|
||||
stages = 4 if cc >= 80 else 2
|
||||
|
||||
stages = 2
|
||||
tile_description = TileDescription(
|
||||
[128, 128, 8], stages, [2, 4, 1],
|
||||
math_inst
|
||||
@ -84,6 +84,11 @@ class Test_Frontend(unittest.TestCase):
|
||||
|
||||
|
||||
def test_torch_frontend(self):
|
||||
try:
|
||||
import torch
|
||||
except:
|
||||
self.assertTrue(False, "Unable to import torch")
|
||||
|
||||
problem_size = cutlass.gemm.GemmCoord(512, 256, 128)
|
||||
|
||||
tensor_A = torch.ceil(torch.empty(size=(problem_size.m(), problem_size.k()), dtype=torch.float32, device="cuda").uniform_(-8.5, 7.5))
|
||||
@ -111,6 +116,11 @@ class Test_Frontend(unittest.TestCase):
|
||||
self.assertTrue(torch.equal(tensor_D, tensor_D_ref))
|
||||
|
||||
def test_cupy_frontend(self):
|
||||
try:
|
||||
import cupy as cp
|
||||
except:
|
||||
self.assertTrue(False, "Unable to import cupy")
|
||||
|
||||
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)
|
||||
|
||||
problem_size = cutlass.gemm.GemmCoord(512, 256, 128)
|
||||
@ -139,7 +149,6 @@ class Test_Frontend(unittest.TestCase):
|
||||
self.assertTrue(cp.array_equal(tensor_D, tensor_D_ref))
|
||||
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**32, 2**32)
|
||||
unittest.main()
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass.test import *
|
||||
@ -92,5 +124,5 @@ class GemmBF16TensorOpSm80(unittest.TestCase):
|
||||
self.assertTrue(test_all_gemm(operation, "multistage"))
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**24, 2**24)
|
||||
pycutlass.get_memory_pool(2**30, 2**30)
|
||||
unittest.main()
|
||||
|
||||
138
tools/library/scripts/pycutlass/test/gemm/gemm_bf16_sm90.py
Normal file
138
tools/library/scripts/pycutlass/test/gemm/gemm_bf16_sm90.py
Normal file
@ -0,0 +1,138 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
from functools import partial
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass import library
|
||||
from pycutlass.test import *
|
||||
import unittest
|
||||
|
||||
from pycutlass.test.utils import LayoutCombination, get_name
|
||||
from pycutlass.test.gemm_testbed import test_all_gemm
|
||||
from pycutlass.utils.device import device_cc
|
||||
|
||||
|
||||
name_fn = partial(get_name, element_a=cutlass.bfloat16, element_b=cutlass.bfloat16, arch=90)
|
||||
|
||||
def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
|
||||
cluster_shape, threadblock_shape, stages, opclass, persistent=False):
|
||||
"""
|
||||
Create a test-running function with the given specification and set it as a method of `cls`.
|
||||
|
||||
:param cls: class to which the generated method will be added
|
||||
:type cls: type
|
||||
:param layouts: indexable container of layouts of A, B, and C operands
|
||||
:param alignments: indexable container of alingments of A, B, and C operands
|
||||
:param element_output: data type of the output element
|
||||
:param element_accumulator: data type used in accumulation
|
||||
:param element_epilogue: data type used in computing the epilogue
|
||||
:param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
|
||||
:param threadblock_shape: indexable container of dimensions of threadblock tiles
|
||||
:param stages: number of pipeline stages to use in the kernel
|
||||
:type stages: int
|
||||
:param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
|
||||
:type opclass: cutlass.OpClass
|
||||
:param persistent: whether this is a persistent warp-specialized kernel
|
||||
:type persistent: bool
|
||||
"""
|
||||
|
||||
def run(self):
|
||||
"""
|
||||
Dynamically-generated function that constructs a GEMM operation and verifies it against
|
||||
multiple test cases.
|
||||
"""
|
||||
element_A = cutlass.bfloat16
|
||||
element_B = cutlass.bfloat16
|
||||
inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
|
||||
warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
|
||||
math_inst = MathInstruction(
|
||||
instruction_shape=inst_shape,
|
||||
element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
|
||||
opcode_class=opclass, math_operation=MathOperation.multiply_add
|
||||
)
|
||||
|
||||
tile_description = TileDescription(
|
||||
threadblock_shape=threadblock_shape,
|
||||
cluster_shape=cluster_shape,
|
||||
stages=stages, warp_count=warp_count,
|
||||
math_instruction=math_inst,
|
||||
persistent=persistent
|
||||
)
|
||||
|
||||
A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
|
||||
B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
|
||||
C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
|
||||
|
||||
epilogue_functor = LinearCombination(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
|
||||
|
||||
swizzling_functor = cutlass.IdentitySwizzle1
|
||||
|
||||
operation = GemmOperationUniversal(
|
||||
arch=90, tile_description=tile_description, A=A, B=B, C=C,
|
||||
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
|
||||
|
||||
self.assertTrue(test_all_gemm(operation, "universal"))
|
||||
|
||||
if persistent:
|
||||
suffix = "_persistent"
|
||||
else:
|
||||
suffix = ""
|
||||
|
||||
name = name_fn(layouts, alignments, element_output, element_accumulator,
|
||||
element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass, suffix=suffix)
|
||||
setattr(cls, name, run)
|
||||
|
||||
return run
|
||||
|
||||
|
||||
@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
|
||||
class GemmBF16Sm90(unittest.TestCase):
|
||||
"""
|
||||
Wrapper class to which tests will be added dynamically in __main__
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
add_test_tensorop = partial(add_test, opclass=cutlass.OpClass.TensorOp)
|
||||
add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
|
||||
|
||||
add_test_tensorop(GemmBF16Sm90, LayoutCombination.NNN, [8, 8, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], 3)
|
||||
add_test_tensorop(GemmBF16Sm90, LayoutCombination.NNN, [4, 4, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], 5)
|
||||
add_test_tensorop(GemmBF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [2, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmBF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [2, 1, 1], [128, 128, 32], None, persistent=True)
|
||||
add_test_simt(GemmBF16Sm90, LayoutCombination.NNN, [1, 1, 1], cutlass.bfloat16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 8], 2)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**30, 2**30)
|
||||
unittest.main()
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass.test import *
|
||||
@ -443,5 +475,5 @@ class GemmF16Sm80(unittest.TestCase):
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**24, 2**24)
|
||||
pycutlass.get_memory_pool(2**30, 2**30)
|
||||
unittest.main()
|
||||
|
||||
182
tools/library/scripts/pycutlass/test/gemm/gemm_f16_sm90.py
Normal file
182
tools/library/scripts/pycutlass/test/gemm/gemm_f16_sm90.py
Normal file
@ -0,0 +1,182 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
from functools import partial
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass import library
|
||||
from pycutlass.test import *
|
||||
import unittest
|
||||
|
||||
from pycutlass.test.utils import LayoutCombination, get_name
|
||||
from pycutlass.test.gemm_testbed import test_all_gemm
|
||||
from pycutlass.utils.device import device_cc
|
||||
|
||||
|
||||
# Partial specialziation for naming tests
|
||||
name_fn = partial(get_name, element_a=cutlass.float16, element_b=cutlass.float16, arch=90)
|
||||
|
||||
|
||||
def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
|
||||
cluster_shape, threadblock_shape, stages, opclass, persistent=False):
|
||||
"""
|
||||
Create a test-running function with the given specification and set it as a method of `cls`.
|
||||
|
||||
:param cls: class to which the generated method will be added
|
||||
:type cls: type
|
||||
:param layouts: indexable container of layouts of A, B, and C operands
|
||||
:param alignments: indexable container of alingments of A, B, and C operands
|
||||
:param element_output: data type of the output element
|
||||
:param element_accumulator: data type used in accumulation
|
||||
:param element_epilogue: data type used in computing the epilogue
|
||||
:param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
|
||||
:param threadblock_shape: indexable container of dimensions of threadblock tiles
|
||||
:param stages: number of pipeline stages to use in the kernel
|
||||
:type stages: int
|
||||
:param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
|
||||
:type opclass: cutlass.OpClass
|
||||
:param persistent: whether this is a persistent warp-specialized kernel
|
||||
:type persistent: bool
|
||||
"""
|
||||
|
||||
def run(self):
|
||||
"""
|
||||
Dynamically-generated function that constructs a GEMM operation and verifies it against
|
||||
multiple test cases.
|
||||
"""
|
||||
|
||||
element_A = cutlass.float16
|
||||
element_B = cutlass.float16
|
||||
inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
|
||||
warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
|
||||
math_inst = MathInstruction(
|
||||
instruction_shape=inst_shape,
|
||||
element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
|
||||
opcode_class=opclass, math_operation=MathOperation.multiply_add
|
||||
)
|
||||
|
||||
tile_description = TileDescription(
|
||||
threadblock_shape=threadblock_shape,
|
||||
cluster_shape=cluster_shape,
|
||||
stages=stages, warp_count=warp_count,
|
||||
math_instruction=math_inst,
|
||||
persistent=persistent
|
||||
)
|
||||
|
||||
A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
|
||||
B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
|
||||
C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
|
||||
|
||||
epilogue_functor = LinearCombination(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
|
||||
|
||||
swizzling_functor = cutlass.IdentitySwizzle1
|
||||
|
||||
operation = GemmOperationUniversal(
|
||||
arch=90, tile_description=tile_description, A=A, B=B, C=C,
|
||||
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
|
||||
|
||||
self.assertTrue(test_all_gemm(operation, "universal"))
|
||||
|
||||
if persistent:
|
||||
suffix = "_persistent"
|
||||
else:
|
||||
suffix = ""
|
||||
|
||||
name = name_fn(layouts, alignments, element_output, element_accumulator,
|
||||
element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass, suffix=suffix)
|
||||
setattr(cls, name, run)
|
||||
|
||||
return run
|
||||
|
||||
|
||||
@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
|
||||
class GemmF16Sm90(unittest.TestCase):
|
||||
"""
|
||||
Wrapper class to which tests will be added dynamically in __main__
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
add_test_tensorop = partial(add_test, opclass=cutlass.OpClass.TensorOp)
|
||||
add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
|
||||
|
||||
# Tests with 1x1x1 clusters
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.NNN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], 3)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.NNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.NTN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.NTT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 64, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 64, 64], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [4, 4, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [4, 4, 8], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 32], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 64, 64], 5)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [2, 2, 2], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 32], None)
|
||||
|
||||
# Tests with different cluster shapes
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.NTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.NNN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 4, 1], [64, 128, 64], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 4, 1], [64, 128, 64], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 1, 1], [64, 128, 64], None)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 2, 1], [64, 128, 64], None)
|
||||
|
||||
# Tests for persistent warp-specialized threadblocks
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 128, 64], None, persistent=True)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 1, 1], [64, 128, 64], None, persistent=True)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 64], None, persistent=True)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 1, 1], [128, 128, 64], None, persistent=True)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 2, 1], [64, 128, 64], None, persistent=True)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None, persistent=True)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 4, 1], [64, 128, 64], None, persistent=True)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 4, 1], [64, 128, 64], None, persistent=True)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 1, 1], [64, 128, 64], None, persistent=True)
|
||||
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 4, 1], [64, 128, 64], None, persistent=True)
|
||||
|
||||
# Tests using SIMT
|
||||
add_test_simt(GemmF16Sm90, LayoutCombination.NNN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 8], 2)
|
||||
add_test_simt(GemmF16Sm90, LayoutCombination.TNN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 128, 8], 2)
|
||||
add_test_simt(GemmF16Sm90, LayoutCombination.NTN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 64, 8], 2)
|
||||
add_test_simt(GemmF16Sm90, LayoutCombination.TTN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 64, 8], 2)
|
||||
add_test_simt(GemmF16Sm90, LayoutCombination.NNT, [1, 1, 1], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 8], 2)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**30, 2**30)
|
||||
unittest.main()
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass.memory_manager import get_allocated_size
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass.test import *
|
||||
@ -98,5 +130,5 @@ class GemmF64TensorOpSm80(unittest.TestCase):
|
||||
self.assertTrue(test_all_gemm(operation, "universal"))
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**24, 2**24)
|
||||
pycutlass.get_memory_pool(2**30, 2**30)
|
||||
unittest.main()
|
||||
|
||||
124
tools/library/scripts/pycutlass/test/gemm/gemm_f64_sm90.py
Normal file
124
tools/library/scripts/pycutlass/test/gemm/gemm_f64_sm90.py
Normal file
@ -0,0 +1,124 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
from functools import partial
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass import library
|
||||
from pycutlass.test import *
|
||||
import unittest
|
||||
|
||||
from pycutlass.test.utils import LayoutCombination, get_name
|
||||
from pycutlass.test.gemm_testbed import test_all_gemm
|
||||
from pycutlass.utils.device import device_cc
|
||||
|
||||
|
||||
name_fn = partial(get_name, element_a=cutlass.float64, element_b=cutlass.float64, arch=90)
|
||||
|
||||
def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
|
||||
cluster_shape, threadblock_shape, stages, opclass):
|
||||
"""
|
||||
Create a test-running function with the given specification and set it as a method of `cls`.
|
||||
|
||||
:param cls: class to which the generated method will be added
|
||||
:type cls: type
|
||||
:param layouts: indexable container of layouts of A, B, and C operands
|
||||
:param alignments: indexable container of alingments of A, B, and C operands
|
||||
:param element_output: data type of the output element
|
||||
:param element_accumulator: data type used in accumulation
|
||||
:param element_epilogue: data type used in computing the epilogue
|
||||
:param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
|
||||
:param threadblock_shape: indexable container of dimensions of threadblock tiles
|
||||
:param stages: number of pipeline stages to use in the kernel
|
||||
:type stages: int
|
||||
:param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
|
||||
:type opclass: cutlass.OpClass
|
||||
"""
|
||||
|
||||
def run(self):
|
||||
"""
|
||||
Dynamically-generated function that constructs a GEMM operation and verifies it against
|
||||
multiple test cases.
|
||||
"""
|
||||
element_A = cutlass.float64
|
||||
element_B = cutlass.float64
|
||||
inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
|
||||
warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
|
||||
math_inst = MathInstruction(
|
||||
instruction_shape=inst_shape,
|
||||
element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
|
||||
opcode_class=opclass, math_operation=MathOperation.multiply_add
|
||||
)
|
||||
|
||||
tile_description = TileDescription(
|
||||
threadblock_shape=threadblock_shape,
|
||||
cluster_shape=cluster_shape,
|
||||
stages=stages, warp_count=warp_count,
|
||||
math_instruction=math_inst
|
||||
)
|
||||
|
||||
A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
|
||||
B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
|
||||
C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
|
||||
|
||||
epilogue_functor = LinearCombination(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
|
||||
|
||||
swizzling_functor = cutlass.IdentitySwizzle1
|
||||
|
||||
operation = GemmOperationUniversal(
|
||||
arch=90, tile_description=tile_description, A=A, B=B, C=C,
|
||||
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
|
||||
|
||||
self.assertTrue(test_all_gemm(operation, "universal"))
|
||||
|
||||
name = name_fn(layouts, alignments, element_output, element_accumulator,
|
||||
element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass)
|
||||
setattr(cls, name, run)
|
||||
|
||||
return run
|
||||
|
||||
|
||||
@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
|
||||
class GemmF64Sm90(unittest.TestCase):
|
||||
"""
|
||||
Wrapper class to which tests will be added dynamically in __main__
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
|
||||
add_test_simt(GemmF64Sm90, LayoutCombination.NNN, [1, 1, 1], cutlass.float64, cutlass.float64, cutlass.float64, [1, 1, 1], [64, 64, 32], 2)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**30, 2**30)
|
||||
unittest.main()
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass.test import *
|
||||
@ -199,5 +231,5 @@ class GemmGroupedSm80(unittest.TestCase):
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**26, 2**26)
|
||||
pycutlass.get_memory_pool(2**30, 2**30)
|
||||
unittest.main()
|
||||
|
||||
@ -1,3 +1,35 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass.epilogue import LinearCombinationClamp
|
||||
@ -225,5 +257,5 @@ class GemmS8TensorOpF32Sm80(unittest.TestCase):
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**24, 2**24)
|
||||
pycutlass.get_memory_pool(2**30, 2**30)
|
||||
unittest.main()
|
||||
|
||||
154
tools/library/scripts/pycutlass/test/gemm/gemm_s8_sm90.py
Normal file
154
tools/library/scripts/pycutlass/test/gemm/gemm_s8_sm90.py
Normal file
@ -0,0 +1,154 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
from functools import partial
|
||||
import pycutlass
|
||||
from pycutlass import *
|
||||
from pycutlass import library
|
||||
from pycutlass.test import *
|
||||
import unittest
|
||||
|
||||
from pycutlass.test.utils import LayoutCombination, get_name
|
||||
from pycutlass.test.gemm_testbed import test_all_gemm
|
||||
from pycutlass.utils.device import device_cc
|
||||
|
||||
|
||||
name_fn = partial(get_name, element_a=cutlass.float16, element_b=cutlass.float16, arch=90)
|
||||
|
||||
def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
|
||||
cluster_shape, threadblock_shape, stages, opclass, persistent=False):
|
||||
"""
|
||||
Create a test-running function with the given specification and set it as a method of `cls`.
|
||||
|
||||
:param cls: class to which the generated method will be added
|
||||
:type cls: type
|
||||
:param layouts: indexable container of layouts of A, B, and C operands
|
||||
:param alignments: indexable container of alingments of A, B, and C operands
|
||||
:param element_output: data type of the output element
|
||||
:param element_accumulator: data type used in accumulation
|
||||
:param element_epilogue: data type used in computing the epilogue
|
||||
:param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
|
||||
:param threadblock_shape: indexable container of dimensions of threadblock tiles
|
||||
:param stages: number of pipeline stages to use in the kernel
|
||||
:type stages: int
|
||||
:param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
|
||||
:type opclass: cutlass.OpClass
|
||||
:param persistent: whether this is a persistent warp-specialized kernel
|
||||
:type persistent: bool
|
||||
"""
|
||||
|
||||
def run(self):
|
||||
"""
|
||||
Dynamically-generated function that constructs a GEMM operation and verifies it against
|
||||
multiple test cases.
|
||||
"""
|
||||
element_A = cutlass.int8
|
||||
element_B = cutlass.int8
|
||||
inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
|
||||
warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
|
||||
math_inst = MathInstruction(
|
||||
instruction_shape=inst_shape,
|
||||
element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
|
||||
opcode_class=opclass, math_operation=MathOperation.multiply_add
|
||||
)
|
||||
|
||||
tile_description = TileDescription(
|
||||
threadblock_shape=threadblock_shape,
|
||||
cluster_shape=cluster_shape,
|
||||
stages=stages, warp_count=warp_count,
|
||||
math_instruction=math_inst,
|
||||
persistent=persistent
|
||||
)
|
||||
|
||||
A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
|
||||
B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
|
||||
C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
|
||||
|
||||
if opclass == cutlass.OpClass.Simt:
|
||||
epilogue_functor_cls = LinearCombinationClamp
|
||||
else:
|
||||
epilogue_functor_cls = LinearCombination
|
||||
epilogue_functor = epilogue_functor_cls(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
|
||||
|
||||
swizzling_functor = cutlass.IdentitySwizzle1
|
||||
|
||||
operation = GemmOperationUniversal(
|
||||
arch=90, tile_description=tile_description, A=A, B=B, C=C,
|
||||
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
|
||||
|
||||
self.assertTrue(test_all_gemm(operation, "universal"))
|
||||
|
||||
if persistent:
|
||||
suffix = "_persistent"
|
||||
else:
|
||||
suffix = ""
|
||||
|
||||
name = name_fn(layouts, alignments, element_output, element_accumulator,
|
||||
element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass, suffix=suffix)
|
||||
setattr(cls, name, run)
|
||||
|
||||
return run
|
||||
|
||||
|
||||
@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
|
||||
class GemmS8Sm90(unittest.TestCase):
|
||||
"""
|
||||
Wrapper class to which tests will be added dynamically in __main__
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
add_test_tensorop = partial(add_test, opclass=cutlass.OpClass.TensorOp)
|
||||
add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
|
||||
|
||||
# Tests with 1x1x1 clusters
|
||||
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNN, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], 3)
|
||||
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], None)
|
||||
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 8], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], None)
|
||||
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [64, 128, 128], None)
|
||||
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 64, 32], None)
|
||||
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [4, 4, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], None)
|
||||
|
||||
# Tests with different cluster shapes
|
||||
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [2, 2, 1], [128, 128, 128], None)
|
||||
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 4, 1], [128, 128, 128], None)
|
||||
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [4, 4, 1], [128, 128, 128], None)
|
||||
|
||||
# Tests with persistent warp-specialized threadblocks
|
||||
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [2, 1, 1], [128, 128, 128], None, persistent=True)
|
||||
|
||||
# Tests for SIMT
|
||||
add_test_simt(GemmS8Sm90, LayoutCombination.TNN, [1, 1, 1], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [64, 32, 8], 2)
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**30, 2**30)
|
||||
unittest.main()
|
||||
@ -1,8 +1,40 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
import pycutlass
|
||||
import unittest
|
||||
|
||||
if __name__ == '__main__':
|
||||
pycutlass.get_memory_pool(2**26, 2**26)
|
||||
pycutlass.get_memory_pool(2**30, 2**30)
|
||||
loader = unittest.TestLoader()
|
||||
tests = loader.discover('./', 'gemm_*.py')
|
||||
testRunner = unittest.runner.TextTestRunner()
|
||||
|
||||
Reference in New Issue
Block a user