CUTLASS 3.0.0 (#786)

* CUTLASS 3.0.0
This commit is contained in:
Vijay Thakkar
2023-01-23 17:55:28 -08:00
committed by GitHub
parent 66d9cddc83
commit 277bd6e537
377 changed files with 76396 additions and 1186 deletions

View File

@ -81,13 +81,24 @@ The tiling size of above operations can also be customized.
## Installation
### Using Docker
You can run the PyCUTLASS on NGC PyTorch container.
We recommend using one of our provided Docker images for using PyCUTLASS.
**To run CUTLASS 3 GEMM kernels targetting the NVIDIA Hopper architecture via PyCUTLASS,** you can use an included [Dockerfile](docker/Dockerfile-cuda12.0) based on the NGC CUDA 12.0 container:
```shell
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.09-py3
docker build -t pycutlass-cuda12.0:latest -f docker/Dockerfile-cuda12.0 .
docker run --gpus all -it --rm pycutlass-cuda12.0:latest
```
Note that this Docker container does not include CuPy or PyTorch, and, thus, will not be able to run PyCUTLASS examples that
leverage these packages.
**To run CUTLASS 2.x kernels targetting pre-SM90 architectures via PyCUTLASS,** you can use an included [Dockerfile](docker/Dockerfile-cuda11.8-pytorch) based on an NGC PyTorch container:
```shell
docker build -t pycutlass-cuda11.8-pytorch:latest -f docker/Dockerfile-cuda11.8-pytorch .
docker run --gpus all -it --rm pycutlass-cuda11.8-pytorch:latest
```
### Environment variables
PyCUTLASSS requires two environment variables:
PyCUTLASS requires two environment variables:
* `CUTLASS_PATH`: the root directory of CUTLASS. You can set this from the location at which you cloned CUTLASS via: `export CUTLASS_PATH=$(pwd)`.
* `CUDA_INSTALL_PATH`: the directory where cuda toolkit is installed. If running in bash with `nvcc` installed under a CUDA toolkit, you can set this to the location of your `nvcc` installation via: `export CUDA_INSTALL_PATH=$(which nvcc | awk -F'/bin/nvcc' '{print $1}')`

View File

@ -1,4 +1,36 @@
pip install pybind11
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
pip install -U pybind11
git clone https://github.com/google/googletest.git
python setup.py install
python setup.py develop --user
python setup.py rmm

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
pip install enum-tools
pip install sphinx-toolbox
pip install m2r2

View File

@ -0,0 +1,40 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
FROM nvcr.io/nvidia/pytorch:22.11-py3
RUN chmod ugo+rwx /home
RUN pip uninstall -y rmm
RUN pip install rmm-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
ENV CUDA_INSTALL_PATH=/usr/local/cuda

View File

@ -0,0 +1,46 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
FROM nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu20.04
RUN apt-get update
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
RUN apt-get install -y git cmake vim python3 python3-pip
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN chmod ugo+rwx /home
RUN pip install numpy==1.23
RUN pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
RUN pip install cuml-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
RUN pip install cugraph-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
ENV LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu/:$LIBRARY_PATH
ENV CUDA_INSTALL_PATH=/usr/local/cuda

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import distutils.cmd
from setuptools import setup
import setuptools.command.build_py
@ -15,7 +47,7 @@ class BuildRMM(distutils.cmd.Command):
import rmm
except ImportError:
print("installing rmm")
os.system("git clone -b branch-22.08 --recurse-submodules https://github.com/rapidsai/rmm.git")
os.system("git clone -b branch-22.10 --recurse-submodules https://github.com/rapidsai/rmm.git")
os.chdir("./rmm")
os.system("./build.sh librmm rmm")
os.chdir("./python")
@ -43,7 +75,11 @@ try:
Pybind11Extension("cutlass",
["src/cpp/cutlass.cpp"],
include_dirs=include_dirs,
extra_compile_args=["-fpermissive", "-w"])
extra_compile_args=["-fpermissive", "-w", "-std=c++17"]),
Pybind11Extension("cute",
["src/cpp/cute.cpp"],
include_dirs=include_dirs,
extra_compile_args=["-fpermissive", "-w", "-std=c++17"])
]
except ImportError:
pass
@ -65,7 +101,7 @@ setup(
install_requires=[
"numpy<1.23",
'pybind11',
'cuda-python<11.7.0',
'cuda-python>=11.8.0',
'typeguard',
'bfloat16',
'typing',

View File

@ -0,0 +1,54 @@
/***************************************************************************************************
* Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: BSD-3-Clause
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* 3. Neither the name of the copyright holder nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/* \file
\brief binding CuTe C++ APIs to Python
*/
#include <pybind11/pybind11.h>
#include <pybind11/stl_bind.h>
#include "cute/arch/mma_sm90_gmma.hpp"
namespace py = pybind11;
PYBIND11_MODULE(cute, m) {
// module doc
m.doc() = "CuTe C++ bindings";
py::enum_<cute::GMMA::Major>(m, "GMMAMajor",
R"pbdoc(classification of CuTe GMMA tensor major specification)pbdoc")
.value("K", cute::GMMA::Major::K,
R"pbdoc(Tensor is contiguous in reduction dimension)pbdoc")
.value("MN", cute::GMMA::Major::MN,
R"pbdoc(Tensor is contiguous in non-reduction dimension)pbdoc");
}

View File

@ -29,8 +29,9 @@
*
**************************************************************************************************/
/* \file
\brief binding cutlass C++ APIs to python
\brief binding CUTLASS C++ APIs to Python
*/
#include <pybind11/pybind11.h>
#include <pybind11/stl_bind.h>

View File

@ -34,6 +34,7 @@
\brief A generic wrapper around an epilogue visitor operation
*/
#pragma once
#include "cutlass/cutlass.h"

View File

@ -30,8 +30,8 @@
**************************************************************************************************/
/*! \file
\brief Binary operations to be used within the epilogue visitor model.
\brief A file contains the binary ops
*/
#pragma once
@ -44,7 +44,7 @@ namespace cutlass {
/////////////////////////////////////////////////////////////////////////////////////////////////
/// Elementwise addition of two arrays
/// Scalar multiplication
template <typename T, int N>
struct VectorAdd {

View File

@ -30,8 +30,8 @@
**************************************************************************************************/
/*! \file
\brief Unary operations to be used within the epilogue visitor model.
\brief A file contains the unary ops
*/
#pragma once

View File

@ -30,8 +30,8 @@
**************************************************************************************************/
/*! \file
\brief Epilogue visitor operation that simply returns the accumulator
\brief A file contains the epilogue visitor Op with accumulator
*/
#pragma once

View File

@ -30,8 +30,8 @@
**************************************************************************************************/
/*! \file
\brief Epilogue visitor operator performing a binary operation between two visitor nodes
\brief A file contains the epilogue visitor Op with Binary op
*/
#pragma once
@ -84,7 +84,6 @@ public:
/// Fragment type of accumulator
using AccumulatorAccessType = Array<ElementAccumulator, kElementsPerAccess>;
/// Combination Op TODO: generalize this
using BinaryOp = BinaryOp_<ElementCompute, kElementsPerAccess>;
static_assert(kElementsPerAccess==VisitAccessTypeA::kElements, "kElementsPerAccess mismatches with Visitor A");

View File

@ -30,8 +30,8 @@
**************************************************************************************************/
/*! \file
\brief Epilogue visitor operation that broadcasts a vector to all columns
\brief A file contains the epilogue visitor Op with broadcasting vector to all columns
*/
#pragma once

View File

@ -30,8 +30,8 @@
**************************************************************************************************/
/*! \file
\brief Epilogue visitor operation that performs a column-wise reduction within a threadblock
\brief A file contains the epilogue visitor Op with reduction over columns in CTA
*/
#pragma once
@ -68,7 +68,6 @@ public:
static int const kElementsPerAccess = OutputTileIterator::kElementsPerAccess;
// TODO: generalize the reduction op
using ReductionOp = cutlass::plus<Array<ElementReductionAccumulator, kElementsPerAccess>>;
using ReductionOpScalar = cutlass::plus<ElementReductionAccumulator>;
using ElementOutput = typename OutputTileIterator::Element;

View File

@ -30,8 +30,8 @@
**************************************************************************************************/
/*! \file
\brief Epilogue visitor operation that performs a linear combination of two visitor nodes
\brief A file contains the epilogue visitor Op with Linear Combination
*/
#pragma once
@ -82,7 +82,7 @@ public:
/// Fragment type of accumulator
using AccumulatorAccessType = Array<ElementAccumulator, kElementsPerAccess>;
/// Combination Op TODO: generalize this
/// Combination Op
using CombinationOp = cutlass::plus<VisitAccessType>;
static_assert(kElementsPerAccess==VisitAccessTypeA::kElements, "kElementsPerAccess mismatches with Visitor A");

View File

@ -30,8 +30,8 @@
**************************************************************************************************/
/*! \file
\brief Epilogue visitor operation that broadcasts a vector to all rows
\brief A file contains the epilogue visitor Op with broadcasting vector to all rows
*/
#pragma once

View File

@ -30,8 +30,8 @@
**************************************************************************************************/
/*! \file
\brief Epilogue visitor operation that performs a column-wise reduction within a threadblock
\brief A file contains the epilogue visitor Op with reduction over rows in CTA
*/
#pragma once
@ -69,7 +69,6 @@ public:
static int const kElementsPerAccess = OutputTileIterator::kElementsPerAccess;
// TODO: generalize the reduction op
using ReductionOp = cutlass::plus<Array<ElementReductionAccumulator, kElementsPerAccess>>;
using ReductionOpScalar = cutlass::plus<ElementReductionAccumulator>;
using ElementOutput = typename OutputTileIterator::Element;

View File

@ -30,8 +30,8 @@
**************************************************************************************************/
/*! \file
\brief Epilogue visitor operator performing a unary operation atop a visitor node
\brief A file contains the epilogue visitor Op with Unary operation
*/
#pragma once
@ -79,7 +79,7 @@ public:
/// Fragment type of accumulator
using AccumulatorAccessType = Array<ElementAccumulator, kElementsPerAccess>;
/// Combination Op TODO: generalize this
/// Combination Op
using UnaryOp = UnaryOp_<ElementCompute, kElementsPerAccess>;
static_assert(kElementsPerAccess==VisitAccessTypeVisitor::kElements, "kElementsPerAccess mismatches with Visitor");

View File

@ -30,7 +30,7 @@
**************************************************************************************************/
/*! \file
\brief
\brief
*/
#pragma once
@ -139,8 +139,8 @@ public:
//
// Methods
//
Arguments():
Arguments():
ptr_A(nullptr), ptr_B(nullptr), ptr_C(nullptr), ptr_D(nullptr),
ptr_gather_A_indices(nullptr),
ptr_gather_B_indices(nullptr),
@ -169,8 +169,8 @@ public:
int const *ptr_scatter_D_indices = nullptr
):
UniversalArgumentsBase(mode, problem_size, batch_count, batch_stride_D),
epilogue_visitor(epilogue_visitor),
ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D),
epilogue_visitor(epilogue_visitor),
ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D),
batch_stride_A(batch_stride_A), batch_stride_B(batch_stride_B), batch_stride_C(batch_stride_C),
stride_a(stride_a), stride_b(stride_b), stride_c(stride_c), stride_d(stride_d),
ptr_gather_A_indices(ptr_gather_A_indices), ptr_gather_B_indices(ptr_gather_B_indices),
@ -205,8 +205,8 @@ public:
int const *ptr_scatter_D_indices = nullptr
):
UniversalArgumentsBase(mode, problem_size, batch_count, batch_stride_D),
epilogue_visitor(epilogue_visitor),
ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D),
epilogue_visitor(epilogue_visitor),
ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D),
batch_stride_A(batch_stride_A), batch_stride_B(batch_stride_B), batch_stride_C(batch_stride_C),
lda(lda), ldb(ldb), ldc(ldc), ldd(ldd),
ptr_gather_A_indices(ptr_gather_A_indices), ptr_gather_B_indices(ptr_gather_B_indices),
@ -221,7 +221,7 @@ public:
/// Returns arguments for the transposed problem
Arguments transposed_problem() const {
Arguments args(*this);
std::swap(args.problem_size.m(), args.problem_size.n());
std::swap(args.ptr_A, args.ptr_B);
std::swap(args.lda, args.ldb);
@ -256,7 +256,7 @@ public:
typename Mma::IteratorB::Params params_B;
typename EpilogueVisitor::OutputTileIterator::Params params_C;
typename EpilogueVisitor::OutputTileIterator::Params params_D;
typename EpilogueVisitor::Params epilogue_visitor;
void * ptr_A;
@ -325,7 +325,7 @@ public:
batch_stride_C = args.batch_stride_C;
epilogue_visitor = args.epilogue_visitor;
semaphore = static_cast<int *>(workspace);
CUTLASS_TRACE_HOST("GemmUniversal::Params::update()");
}
@ -345,7 +345,7 @@ public:
//
CUTLASS_DEVICE
GemmUniversalwithEpilogueVisitor() { }
GemmUniversalwithEpilogueVisitor() { }
/// Determines whether kernel satisfies alignment
static Status can_implement(
@ -455,12 +455,12 @@ public:
//
// Fetch pointers based on mode.
//
if (params.mode == GemmUniversalMode::kGemm ||
if (params.mode == GemmUniversalMode::kGemm ||
params.mode == GemmUniversalMode::kGemmSplitKParallel) {
if (threadblock_tile_offset.k() + 1 < params.grid_tiled_shape.k()) {
problem_size_k = (threadblock_tile_offset.k() + 1) * params.gemm_k_size;
problem_size_k = (threadblock_tile_offset.k() + 1) * params.gemm_k_size;
}
offset_k = threadblock_tile_offset.k() * params.gemm_k_size;
@ -529,10 +529,10 @@ public:
// Compute threadblock-scoped matrix multiply-add
mma(
gemm_k_iterations,
accumulators,
iterator_A,
iterator_B,
gemm_k_iterations,
accumulators,
iterator_A,
iterator_B,
accumulators);
//
@ -555,30 +555,16 @@ public:
int block_idx = threadblock_tile_offset.m() + threadblock_tile_offset.n() * params.grid_tiled_shape.m();
ElementC *ptr_C = static_cast<ElementC *>(params.ptr_C);
ElementC *ptr_C = static_cast<ElementC *>(params.ptr_C);
ElementC *ptr_D = static_cast<ElementC *>(params.ptr_D);
//
// Fetch pointers based on mode.
//
// Construct the semaphore.
Semaphore semaphore(params.semaphore + block_idx, thread_idx);
// if (params.mode == GemmUniversalMode::kGemm) {
// // TODO: fix this order
// // If performing a reduction via split-K, fetch the initial synchronization
// if (params.grid_tiled_shape.k() > 1) {
// // Fetch the synchronization lock initially but do not block.
// semaphore.fetch();
// // Indicate which position in a serial reduction the output operator is currently updating
// output_op.set_k_partition(threadblock_tile_offset.k(), params.grid_tiled_shape.k());
// }
// }
// Tile iterator loading from source tensor.
EpilogueVisitor epilogue_visitor(
@ -590,9 +576,6 @@ public:
params.problem_size.mn()
);
// if (params.mode == GemmUniversalMode::kGemmSplitKParallel) {
// ptr_D += threadblock_tile_offset.k() * params.batch_stride_D;
// }
if (params.mode == GemmUniversalMode::kBatched || params.mode == GemmUniversalMode::kArray) {
epilogue_visitor.set_batch_index(threadblock_tile_offset.k());
}
@ -605,25 +588,20 @@ public:
// Wait on the semaphore - this latency may have been covered by iterator construction
if (params.mode == GemmUniversalMode::kGemm && params.grid_tiled_shape.k() > 1) {
// For subsequent threadblocks, the source matrix is held in the 'D' tensor.
// TODO: ???
// if (threadblock_tile_offset.k()) {
// iterator_C = iterator_D;
// }
// For subsequent threadblocks, the source matrix is held in the 'D' tensor.
semaphore.wait(threadblock_tile_offset.k());
}
// Execute the epilogue operator to update the destination tensor.
epilogue(epilogue_visitor, accumulators);
epilogue(epilogue_visitor, accumulators);
//
// Release the semaphore
//
if (params.mode == GemmUniversalMode::kGemm && params.grid_tiled_shape.k() > 1) {
if (params.mode == GemmUniversalMode::kGemm && params.grid_tiled_shape.k() > 1) {
int lock = 0;
if (params.grid_tiled_shape.k() == threadblock_tile_offset.k() + 1) {
@ -635,7 +613,7 @@ public:
// Otherwise, the semaphore is incremented
lock = threadblock_tile_offset.k() + 1;
}
semaphore.release(lock);
}
}

View File

@ -83,7 +83,6 @@ void bind_identity_swizzle(py::module & m, std::string name) {
:param problem_size: Implicit gemm problem size conv_operator(NZPQK, NDHWC, KTRSC)
:type problem_size: :class:`cutlass.gemm.GemmCoord`)
)pbdoc")
// TODO: the returned dim3 is not usable in python
.def("get_grid_shape", &T::get_grid_shape,
py::arg("tiled_shape"),
R"pbdoc(Computes CUDA grid dimensions given a size in units of logical tiles)pbdoc")

View File

@ -31,6 +31,7 @@ from pycutlass.utils import *
from pycutlass.frontend import *
from pycutlass.reduction_operation import *
from pycutlass.compiler import *
from pycutlass.utils.device import device_cc
# module-wide variables
@ -40,6 +41,12 @@ this = sys.modules[__name__]
# artifact manager
this.compiler = ArtifactManager()
try:
if not hasattr(this, 'DEVICE_CC') or this.DEVICE_CC is None:
this.DEVICE_CC = device_cc()
except:
this.DEVICE_CC = None
def get_memory_pool(init_pool_size=0, max_pool_size=2**34):
this.memory_pool = PoolMemoryManager(
init_pool_size=init_pool_size,

View File

@ -0,0 +1,395 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
"""
Utilities for stamping out collective mainloops for SM90 kernels
"""
import cute
import cutlass
from pycutlass import SubstituteTemplate
import pycutlass.library as library
tma_alignment_bytes = 16
cp_async_min_alignment_bytes = 4
class RowColMajorToGMMAMajor:
@staticmethod
def A(layout, element):
"""
Converts operand A's layout from row/column major format into CuTe's GMMA major format
:param layout: layout of the A operand
:type layout: cutlass.RowMajor or cutlass.ColumnMajor
:param element: data type of the A operand
:return: C++ CuTe GMMA major format
:rtype: cute.GMMAMajor
"""
type_requires_k_major = (element == cutlass.tfloat32) or (element == cutlass.int8)
if layout == cutlass.ColumnMajor and not type_requires_k_major:
return cute.GMMAMajor.MN
else:
return cute.GMMAMajor.K
@staticmethod
def B(layout, element):
"""
Converts operand B's layout from row/column major format into CuTe's GMMA major format
:param layout: layout of the B operand
:type layout: cutlass.RowMajor or cutlass.ColumnMajor
:param element: data type of the B operand
:return: C++ CuTe GMMA major format
:rtype: cute.GMMAMajor
"""
type_requires_k_major = (element == cutlass.tfloat32) or (element == cutlass.int8)
if layout == cutlass.RowMajor and not type_requires_k_major:
return cute.GMMAMajor.MN
else:
return cute.GMMAMajor.K
def cluster_shape_to_tma(dim):
"""
Returns the TMA copy type for a given cluster dimension
:param dim: a given dimension of a cluster
:type dim: layout
:return: C++ TMA copy time
:rtype: str
"""
return 'cute::SM90_TMA_LOAD' if dim == 1 else 'cute::SM90_TMA_LOAD_MULTICAST'
def make_cpasync_gmem_tiled_copy(thread_count, element, alignment, gmma_layout, dim_mn, dim_k):
"""
Returns a `make_tiled_copy` call for a given configuraiton
:param thread_count: number of threads in the threadblock
:type thread_count: int
:param element: datatype of the operand in question
:param alignment: byte alignment of the operand in question
:type alignment: int
:param gmma_layout: GMMA layout of the operand in question
:type gmma_layout: cute.GMMAMajor
:param dim_mn: extent of the M/N dimension of the tile
:type dim_mn: int
:param dim_k: extent of the reduction dimension of the tile
:type dim_k: int
:return: C++ call to `make_tiled_copy`
:rtype: str
"""
emission_str = """decltype(cute::make_tiled_copy(
cute::Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<cute::uint_byte_t<static_cast<int>(sizeof(${element})) * ${alignment}>>, ${element}>{},
cute::Layout<cute::Shape<_${shape0_x}, _${shape0_y}>,
cute::Stride<_${stride_x}, _${stride_y}>>{},
cute::Layout<cute::Shape<_${shape1_x}, _${shape1_y}>>{}))"""
if gmma_layout == cute.GMMAMajor.K:
threads_major = dim_k // alignment
threads_minor = thread_count // threads_major
values = {
'shape0_x': str(threads_minor),
'shape0_y': str(threads_major),
'stride_x': str(threads_major),
'stride_y': '1',
'shape1_x': '1',
'shape1_y': str(alignment)
}
elif gmma_layout == cute.GMMAMajor.MN:
threads_major = dim_mn // alignment
threads_minor = thread_count // threads_major
values = {
'shape0_x': str(threads_major),
'shape0_y': str(threads_minor),
'stride_x': '1',
'stride_y': str(threads_major),
'shape1_x': str(alignment),
'shape1_y': '1'
}
else:
raise Exception('Unexpected GMMA layout {}'.format(gmma_layout))
# Add common values
values['element'] = library.DataTypeTag[element]
values['alignment'] = str(alignment)
return SubstituteTemplate(emission_str, values)
def max_stages(op, arch):
"""
Returns the maximum number pipeline stages that can be used for an operation.
:param op: operation for which the maximum stages should be computed. If stages are
set via the `op.tile_description.stages` parameter, this setting is ignored
in the present calculation
:type op: pycutlass.GemmOperation
:param arch: compute capability of the device on which the operation will be run
:type arch: int
:return: maximum number of pipeline stages that can be used for an operation
:rtype: int
"""
smem_per_stage = library.CalculateSmemUsagePerStage(op)
smem_capacity = library.SharedMemPerCC[arch]
return int(smem_capacity // smem_per_stage)
class LayoutToStride:
_variable_first = 'cute::Stride<int64_t, cute::Int<1>, int64_t>'
_variable_last = 'cute::Stride<cute::Int<1>, int64_t, int64_t>'
@staticmethod
def A(layout):
"""
Returns the CuTe shape type corresponding to the layout of operand A
:param layout: layout of the B operand
:type layout: cutlass.RowMajor or cutlass.ColumnMajor
:return: C++ declaration of CuTe stride
:rtype: str
"""
if layout == cutlass.RowMajor:
return LayoutToStride._variable_first
elif layout == cutlass.ColumnMajor:
return LayoutToStride._variable_last
else:
raise Exception('Unsupported layout {}'.format(layout))
@staticmethod
def B(layout):
"""
Returns the CuTe shape type corresponding to the layout of operand B
:param layout: layout of the B operand
:type layout: cutlass.RowMajor or cutlass.ColumnMajor
:return: C++ declaration of CuTe stride
:rtype: str
"""
if layout == cutlass.RowMajor:
return LayoutToStride._variable_last
elif layout == cutlass.ColumnMajor:
return LayoutToStride._variable_first
else:
raise Exception('Unsupported layout {}'.format(layout))
EMISSION_STR = """
using TileShape_MNK = cute::Shape<_${threadblock_shape_m}, _${threadblock_shape_n}, _${threadblock_shape_k}>;
using ClusterShape_MNK = cute::Shape<_${cluster_shape_m}, _${cluster_shape_n}, _${cluster_shape_k}>;
using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::ss_op_selector<
${internal_element_A}, ${internal_element_B}, ${element_accumulator}, TileShape_MNK, ${gmma_layout_A}, ${gmma_layout_B}>()));
using SmemLayoutAtomA = decltype(cute::GMMA::smem_selector<${gmma_layout_A}, ${internal_element_A}, _${threadblock_shape_m}, _${threadblock_shape_k}>());
using SmemLayoutAtomB = decltype(cute::GMMA::smem_selector<${gmma_layout_B}, ${internal_element_B}, _${threadblock_shape_n}, _${threadblock_shape_k}>());
using CollectiveOp = typename cutlass::gemm::collective::CollectiveMma<
${mainloop_type}<${stage_count}, ClusterShape_MNK${kernel_schedule}>,
TileShape_MNK,
${element_A},
${stride_A},
${element_B},
${stride_B},
TiledMma,
${gmem_tiled_copy_A},
SmemLayoutAtomA,
void, // GMMA_SS does not need an SmemCopyAtom
${transform_A},
${gmem_tiled_copy_B},
SmemLayoutAtomB,
void, // GMMA_SS does not need an SmemCopyAtom
${transform_B}
>;
"""
def internal_element(element):
"""
Returns the data type internally used for `element`.
:param element: data type
:return: data type used internally
"""
return cutlass.tfloat32 if element == cutlass.float32 else element
def common_values(op, stage_count, transform_A, transform_B):
"""
Returns a dictionary containing common values to be substituted in the emission of the
collective operation declaration. Values specific to a particular collective operation
should be added to these.
:param op: GEMM operation for which to build a collective operation
:type op: pycutlass.GemmOperation
:param stage_count: number of pipeline stages to use in the operation
:type stage_count: int
:param transform_A: transformation to perform on the A operand
:type transform_A: str
:param transform_B: transformation to perform on the B operand
:type transform_B: str
:return: dictionary containing values to substitute in emission string
:rtype: dict
"""
internal_element_a = internal_element(op.A.element)
internal_element_b = internal_element(op.B.element)
return {
'threadblock_shape_m': str(op.tile_description.threadblock_shape[0]),
'threadblock_shape_n': str(op.tile_description.threadblock_shape[1]),
'threadblock_shape_k': str(op.tile_description.threadblock_shape[2]),
'cluster_shape_m': str(op.tile_description.cluster_shape[0]),
'cluster_shape_n': str(op.tile_description.cluster_shape[1]),
'cluster_shape_k': str(op.tile_description.cluster_shape[2]),
'element_A': library.DataTypeTag[op.A.element],
'element_B': library.DataTypeTag[op.B.element],
'internal_element_A': library.DataTypeTag[internal_element_a],
'internal_element_B': library.DataTypeTag[internal_element_b],
'element_accumulator': library.DataTypeTag[op.accumulator_type()],
'gmma_layout_A': library.CuTeLayoutTag[RowColMajorToGMMAMajor.A(op.A.layout, internal_element_a)],
'gmma_layout_B': library.CuTeLayoutTag[RowColMajorToGMMAMajor.B(op.B.layout, internal_element_b)],
'stride_A': LayoutToStride.A(op.A.layout),
'stride_B': LayoutToStride.B(op.B.layout),
'stage_count': str(stage_count),
'transform_A': transform_A,
'transform_B': transform_B
}
def build_gmma_tma(op):
"""
Builds a collective operation declaration targetting TMA GMMA kernels
:param op: GEMM operation for which to build a collective operation
:type op: pycutlass.GemmOperation
:return: string containing the C++ declaration of collective operation
:rtype: str
"""
A_tma_aligned = (library.DataTypeSizeBytes[op.A.element] * op.A.alignment) % tma_alignment_bytes == 0
B_tma_aligned = (library.DataTypeSizeBytes[op.B.element] * op.B.alignment) % tma_alignment_bytes == 0
if not A_tma_aligned or not B_tma_aligned:
raise Exception('Each of the A or B operands must be aligned to {} bytes to use TMA'.format(tma_alignment_bytes))
max_stage_count = max_stages(op, arch=90)
if op.tile_description.stages is None:
op.tile_description.stages = max_stage_count
elif op.tile_description.stages > max_stage_count:
raise Exception('Combination of threadblock shape, data types, and number of stages exceeds shared memory capacity.')
kernel_schedule = 'cutlass::gemm::KernelTmaWarpSpecialized'
if op.tile_description.persistent:
kernel_schedule = 'cutlass::gemm::KernelTmaWarpSpecializedPersistent'
transform_A = 'cute::identity'
transform_B = 'cute::identity'
values = common_values(op, op.tile_description.stages, transform_A, transform_B)
specific_values = {
'mainloop_type': 'cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized',
'kernel_schedule': ', ' + kernel_schedule,
'gmem_tiled_copy_A': cluster_shape_to_tma(op.tile_description.cluster_shape[1]),
'gmem_tiled_copy_B': cluster_shape_to_tma(op.tile_description.cluster_shape[0])
}
values.update(specific_values)
return SubstituteTemplate(EMISSION_STR, values)
def build_gmma_cpasync(op):
"""
Builds a collective operation declaration targetting cp.async GMMA kernels
:param op: GEMM operation for which to build a collective operation
:type op: pycutlass.GemmOperation
:return: string containing the C++ declaration of collective operation
:rtype: str
"""
A_cp_async_aligned = (library.DataTypeSizeBytes[op.A.element] * op.A.alignment) % cp_async_min_alignment_bytes == 0
B_cp_async_aligned = (library.DataTypeSizeBytes[op.B.element] * op.B.alignment) % cp_async_min_alignment_bytes == 0
if not A_cp_async_aligned or not B_cp_async_aligned:
raise Exception('Each of the A or B operands must be aligned to {} bytes to use cp.async'.format(cp_async_min_alignment_bytes))
max_stage_count = max_stages(op, arch=90)
if op.tile_description.stages is None:
op.tile_description.stages = max_stage_count
elif op.tile_description.stages > max_stage_count:
raise Exception('Combination of threadblock shape, data types, and number of stages exceeds shared memory capacity.')
transform_A = 'cute::identity'
transform_B = 'cute::identity'
thread_count = 128
cpasync_copy_A = make_cpasync_gmem_tiled_copy(thread_count, op.A.element, op.A.alignment, RowColMajorToGMMAMajor.A(op.A.layout, op.A.element),
op.tile_description.threadblock_shape[0], op.tile_description.threadblock_shape[2])
cpasync_copy_B = make_cpasync_gmem_tiled_copy(thread_count, op.B.element, op.B.alignment, RowColMajorToGMMAMajor.B(op.B.layout, op.B.element),
op.tile_description.threadblock_shape[1], op.tile_description.threadblock_shape[2])
values = common_values(op, op.tile_description.stages, transform_A, transform_B)
specific_values = {
'mainloop_type': 'cutlass::gemm::MainloopSm90CpAsyncGmma',
'kernel_schedule': '',
'gmem_tiled_copy_A': cpasync_copy_A,
'gmem_tiled_copy_B': cpasync_copy_B
}
values.update(specific_values)
return SubstituteTemplate(EMISSION_STR, values)
def build(operation):
"""
Builds a collective operation declaration targetting cp.async or TMA for GMMA kernels
:param operation: GEMM operation for which to build a collective operation
:type operation: pycutlass.GemmOperation
:return: string containing the C++ declaration of collective operation
:rtype: str
"""
A_tma_aligned = (library.DataTypeSizeBytes[operation.A.element] * operation.A.alignment) % tma_alignment_bytes == 0
B_tma_aligned = (library.DataTypeSizeBytes[operation.B.element] * operation.B.alignment) % tma_alignment_bytes == 0
tma_correct_size = (library.DataTypeSizeBytes[operation.A.element] == 2 and library.DataTypeSizeBytes[operation.B.element] == 2)
tma_correct_layout = (operation.A.layout == cutlass.RowMajor or operation.B.layout == cutlass.ColumnMajor)
if A_tma_aligned and B_tma_aligned and (tma_correct_size or tma_correct_layout):
return build_gmma_tma(operation)
else:
return build_gmma_cpasync(operation)

View File

@ -33,8 +33,6 @@
import ctypes
from pycutlass.library import *
# 12B
class GemmCoord_(ctypes.Structure):
_fields_ = [
@ -48,6 +46,24 @@ class GemmCoord_(ctypes.Structure):
setattr(self, field_name, getattr(gemm_coord, field_name)())
class GemmCoordBatched_(ctypes.Structure):
"""
Wrapper around a GemmCoord that also contains batch count. This is used for encoding
batched GEMM inputs to CUTLASS 3 GEMMs.
"""
_fields_ = [
("m", ctypes.c_int),
("n", ctypes.c_int),
("k", ctypes.c_int),
("batch_count", ctypes.c_int)
]
def __init__(self, gemm_coord, batch_count) -> None:
for field_name, _ in self._fields_[:-1]:
setattr(self, field_name, getattr(gemm_coord, field_name)())
setattr(self, "batch_count", batch_count)
class MatrixCoord_(ctypes.Structure):
_fields_ = [
("row", ctypes.c_int),
@ -55,6 +71,26 @@ class MatrixCoord_(ctypes.Structure):
]
class dim3_(ctypes.Structure):
_fields_ = [
("x", ctypes.c_int),
("y", ctypes.c_int),
("z", ctypes.c_int)
]
class StrideBatched_(ctypes.Structure):
"""
CUTLASS 3.0 strides for operands contain one static dimension and two variable dimensions. The
variable dimensions represent the stride along non-unit-stride dimension of the row/column major
layout, and the batch stride. This structure encodes the two variable dimensions.
"""
_fields_ = [
("major_stride", ctypes.c_int64),
("batch_stride", ctypes.c_int64)
]
dtype2ctype = {
cutlass.float16: ctypes.c_uint16,
cutlass.float32: ctypes.c_float,
@ -63,6 +99,28 @@ dtype2ctype = {
}
def get_gemm_arguments_3x(epilogue_functor):
_EpilogueOutputOpParams = epilogue_functor.epilogue_type
class _GemmArguments(ctypes.Structure):
_fields_ = [
("mode", ctypes.c_int),
("problem_size", GemmCoordBatched_),
("ptr_A", ctypes.c_void_p),
("stride_A", StrideBatched_),
("ptr_B", ctypes.c_void_p),
("stride_B", StrideBatched_),
("ptr_C", ctypes.c_void_p),
("stride_C", StrideBatched_),
("ptr_D", ctypes.c_void_p),
("stride_D", StrideBatched_),
("epilogue", _EpilogueOutputOpParams),
]
return _GemmArguments, _EpilogueOutputOpParams
def get_gemm_arguments(epilogue_functor):
_EpilogueOutputOpParams = epilogue_functor.epilogue_type
@ -103,8 +161,6 @@ def get_gemm_arguments(epilogue_functor):
# GEMM Grouped
###########################################################################################
# include/cutlass/gemm/kernel/gemm_grouped.h
def get_gemm_grouped_arguments(epilogue_functor):
_EpilogueOutputOpParams = epilogue_functor.epilogue_type
@ -131,12 +187,6 @@ def get_gemm_grouped_arguments(epilogue_functor):
# Convolution2D
############################################################################################
# We use the arguments as the interface
# include/cutlass/conv/conv2d_problem_size.h
# 64B
class Conv2DProblemSize(ctypes.Structure):
_fields_ = [
("N", ctypes.c_int),
@ -164,8 +214,6 @@ class Conv2DProblemSize(ctypes.Structure):
setattr(self, field_name, getattr(problem_size, field_name))
# include/cutlass/layout/tensor.h
# 12B
class Layout4D(ctypes.Structure):
_fields_ = [
("stride", ctypes.c_int * 3)
@ -175,13 +223,7 @@ class Layout4D(ctypes.Structure):
stride = tensor_ref.stride()
setattr(self, "stride", (stride.at(0), stride.at(1), stride.at(2)))
# TODO: Tensor 5-D takes ("stride", ctypes.c_int * 4)
# include/cutlass/conv/threadblock/conv2d_dgrad_filter_tile_access_iterator_optimized.h
# TensorRef is basically cutlass::TensorRef<Element, Layout>;
# include/cutlass/tensor_ref.h
# 24B
class TensorRef_(ctypes.Structure):
_fields_ = [
("ptr", ctypes.c_void_p),
@ -200,9 +242,6 @@ class TensorRef2D_(ctypes.Structure):
]
# include/cutlass/conv/kernel/implicit_gemm_convolution.h
# split_k_mode: kNone: 0, kSerial: 1, kParallel: 2, kParallelSerial: 3, kInvalid: 4
def get_conv2d_arguments(epilogue_functor):
_EpilogueOutputOpParams = epilogue_functor.epilogue_type
@ -224,7 +263,6 @@ def get_conv2d_arguments(epilogue_functor):
# Reduction
############################################################################################
def get_reduction_params(epilogue_functor):
_EpilogueOutputParams = epilogue_functor.epilogue_type

View File

@ -29,6 +29,7 @@
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import pycutlass
from pycutlass import *
import cutlass
from cuda import cuda
@ -54,11 +55,11 @@ class CompilationOptions:
'''
#
def __init__(self, flags, architectures=[80], include_paths=[]):
def __init__(self, flags, arch, include_paths=[]):
self.includes = []
self.include_paths = include_paths
self.flags = flags
self.architectures = architectures
self.arch = arch
def get_str(self):
options = ""
@ -69,13 +70,11 @@ class CompilationOptions:
for incl in self.include_paths:
options += ' --include-path=%s' % incl
arch_list = "-arch="
for idx, arch in enumerate(self.architectures):
if idx:
arch_list += ","
arch_list += "sm_%d" % arch
arch_flag = " -arch=sm_%d" % self.arch
if self.arch == 90:
arch_flag += 'a'
options += arch_flag
options += " " + arch_list
return options
#
@ -88,13 +87,11 @@ class CompilationOptions:
for incl in self.include_paths:
options.append(bytes(str.encode('--include-path=%s' % incl)))
arch_list = "-arch="
for idx, arch in enumerate(self.architectures):
if idx:
arch_list += ","
arch_list += "sm_%d" % arch
arch_flag = " -arch=sm_%d" % self.arch
if self.arch == 90:
arch_flag += 'a'
options.append(bytes(str.encode(arch_list)))
options.append(bytes(str.encode(arch_flag)))
return options
@ -138,12 +135,12 @@ class ArtifactManager:
def nvrtc(self):
self.backend = "nvrtc"
self.default_compile_options = [
'-std=c++11', '-default-device',
'-std=c++17', '-default-device'
]
def nvcc(self):
self.backend = "nvcc"
self.default_compile_options = [
'-std=c++11',
'-std=c++17', '--expt-relaxed-constexpr', '-Xcudafe --diag_suppress=esa_on_defaulted_function_ignored'
]
def insert_operation(self, op_key, cubin, hostfile, op_name, op_attrs):
connection = sqlite3.connect("./compiled_cache.db")
@ -158,7 +155,7 @@ class ArtifactManager:
connection.commit()
cursor.close()
def load_operation(self, op_key):
def load_operation(self, op_key, extra_funcs):
connection = sqlite3.connect("./compiled_cache.db")
cursor = connection.cursor()
sqlite_fetch_blob_query = """SELECT * from compiled_operations where op_key = ?"""
@ -194,12 +191,17 @@ class ArtifactManager:
if isinstance(attr, str):
func_name = operation_name + '_' + attr
func = getattr(host_lib, func_name)
# Set the return type of the function
if attr in extra_funcs and extra_funcs[attr] != None:
func.restype = extra_funcs[attr]
compiled_host_fns[attr] = func
self.compiled_cache_host.insert(key, compiled_host_fns)
return True
def emit_compile_(self, operation_list, compilation_options):
def emit_compile_(self, operation_list, compilation_options, requires_nvcc_hostlib_compilation):
"""
Compile a list of kernels and store them into database
"""
@ -276,6 +278,7 @@ class ArtifactManager:
err, = nvrtc.nvrtcGetCUBIN(program, cubin_image)
if err != nvrtc.nvrtcResult.NVRTC_SUCCESS:
raise RuntimeError('NVRTC Error: {}'.format(err))
else: # with nvcc backend
# emit code
tempfile.tempdir = "./"
@ -303,22 +306,34 @@ class ArtifactManager:
with open(temp_cubin.name, 'rb') as file:
cubin_image = file.read()
# compile the host code
options = compilation_options.get()
cmd = "echo '%s'|g++ -x c++ -fpermissive -w -fPIC" % source_buffer_host
for opt in options:
opt = opt.decode("utf-8")
if opt not in ['-default-device', '-std=c++11', '-Xcicc', '-Xllc'] and '-arch=sm_' not in opt:
if '--include-path=' in opt:
cmd += " " + opt.replace('--include-path=', '-I')
else:
cmd += " " + opt
# Set up the host-side library code
if requires_nvcc_hostlib_compilation:
cuda_install_path = os.getenv('CUDA_INSTALL_PATH')
assert cuda_install_path is not None, "Environment variable 'CUDA_INSTALL_PATH' is not defined."
cmd_template = "echo '%s'|${cuda_install_path}/bin/nvcc -x cu -Xcompiler=\"-fpermissive -w -fPIC\" ${options}" % source_buffer_host
cmd = SubstituteTemplate(
cmd_template,
{
"cuda_install_path": cuda_install_path,
"options": compilation_options.get_str()
})
else:
options = compilation_options.get()
cmd = "echo '%s'|g++ -x c++ -fpermissive -w -fPIC" % source_buffer_host
filtered_opts = ['-default-device', '-Xcicc', '-Xllc', '--expt-relaxed-constexpr', '-Xcudafe --diag_suppress=esa_on_defaulted_function_ignored']
for opt in options:
opt = opt.decode("utf-8")
if opt not in filtered_opts and '-arch=sm_' not in opt:
if '--include-path=' in opt:
cmd += " " + opt.replace('--include-path=', '-I')
else:
cmd += " " + opt
tempfile.tempdir = "./"
temp = tempfile.NamedTemporaryFile(
prefix='host_func', suffix='.so', delete=True)
cmd += ' - -shared -o %s' % temp.name
cmd += ' - -shared -o %s -lcudart -lcuda' % temp.name
os.system(cmd)
host_lib = ctypes.CDLL(temp.name)
@ -333,23 +348,25 @@ class ArtifactManager:
assert cutlass_path is not None, "Environment variable 'CUTLASS_PATH' is not defined."
cuda_install_path = os.getenv('CUDA_INSTALL_PATH')
assert cuda_install_path is not None, "Environment variable 'CUDA_INSTALL_PATH' is not defined."
architectures = []
for operation in operations:
if hasattr(operation, "tile_description"):
cc = operation.arch
if cc not in architectures:
architectures.append(cc)
include_paths = [
cuda_install_path + '/include',
cutlass_path + '/include',
cutlass_path + '/tools/util/include',
cutlass_path + '/tools/library/scripts/pycutlass/src/cpp/include'
]
if pycutlass.DEVICE_CC is not None:
arch = pycutlass.DEVICE_CC
else:
# Find the maximum arch tag among the provided operations and compile for that target.
# Since we are compiling to .cubin files, only one architecture may be specified.
arch = max([op.arch for op in operations])
compile_options = CompilationOptions(
self.default_compile_options, architectures, include_paths)
self.default_compile_options, arch, include_paths)
# save the cubin
operation_key = []
operation_list = []
requires_nvcc_hostlib_compilation = False
for operation in operations:
# step 1: get kernel string as key
key = operation.rt_module.emit() + operation.procedural_name() + self.backend
@ -357,7 +374,7 @@ class ArtifactManager:
compiled_kernel = self.compiled_cache_device.at(key)
if compiled_kernel is None:
hit = self.load_operation(key)
hit = self.load_operation(key, getattr(operation.rt_module, 'extra_funcs', {}))
if hit:
compiled_kernel = self.compiled_cache_device.at(key)
assert compiled_kernel is not None
@ -371,9 +388,18 @@ class ArtifactManager:
else:
operation_list.append(operation.rt_module)
operation_key.append(key)
# Creating the Params structures for certain 3.0 kernels currently requires CUDA. For these cases, use NVCC to generate
# the PyCUTLASS host-side library. Otherwise, g++ will be used.
if isinstance(operation, pycutlass.gemm_operation.GemmOperationUniversal) and operation.api == pycutlass.library.ApiVersion.v3x:
if self.backend == "nvrtc":
raise RuntimeError('CUTLASS 3 kernels currently require NVCC for compilation.')
requires_nvcc_hostlib_compilation = True
if len(operation_list) > 0:
cubin_image, host_lib, host_file = self.emit_compile_(
operation_list, compile_options)
operation_list, compile_options, requires_nvcc_hostlib_compilation)
err, module = cuda.cuModuleLoadData(cubin_image)
if err != cuda.CUresult.CUDA_SUCCESS:
@ -417,9 +443,11 @@ class ArtifactManager:
op_attr.append(param_size)
if hasattr(operation, "extra_funcs"):
for suffix in operation.extra_funcs:
for suffix, ret_type in operation.extra_funcs.items():
func_name = operation.name() + '_' + suffix
func = getattr(host_lib, func_name)
if ret_type is not None:
func.restype = ret_type
setattr(operation, suffix, func)
compiled_host_fns[suffix] = func
op_attr.append(suffix)

View File

@ -463,13 +463,14 @@ class Conv2dOperation:
)
if self.stride_support == StrideSupport.Unity:
configuration_name = "cutlass_${opcode_class}_${extended_name}_${threadblock}_${layout}_unity_stride_align${alignment}"
configuration_name = "cutlass_sm${arch}_${opcode_class}_${extended_name}_${threadblock}_${layout}_unity_stride_align${alignment}"
else:
configuration_name = "cutlass_${opcode_class}_${extended_name}_${threadblock}_${layout}_align${alignment}"
configuration_name = "cutlass_sm${arch}_${opcode_class}_${extended_name}_${threadblock}_${layout}_align${alignment}"
return SubstituteTemplate(
configuration_name,
{
'arch': str(self.arch),
'opcode_class': opcode_class_name,
'extended_name': self.extended_name(),
'threadblock': threadblock,
@ -509,7 +510,7 @@ class Conv2dOperation:
intermediate_type = ''
if self.tile_description.math_instruction.opcode_class == cutlass.OpClass.TensorOp:
inst_shape = "%d%d%d" % tuple(
inst_shape = "%dx%dx%d" % tuple(
self.tile_description.math_instruction.instruction_shape)
if self.tile_description.math_instruction.element_a != self.A.element and \
self.tile_description.math_instruction.element_a != self.accumulator_type():

View File

@ -111,6 +111,7 @@ class LinearCombination(EpilogueFunctorBase):
self.element_output = element_output
self.element_accumulator = element_accumulator
self.element_epilogue = element_epilogue
self.epilogue_vector_length = epilogue_vector_length
self.template_arguments = [
DataTypeTag[element_output], str(epilogue_vector_length),

View File

@ -36,6 +36,7 @@ import numpy as np
from typeguard import typechecked
import cutlass
from pycutlass import *
import pycutlass.builder.collective_op_builder as collective_op_builder
from cuda import cuda
@ -56,9 +57,9 @@ def transpose_layout(layout: cutlass.layout):
# @typechecked
class GemmArguments(ArgumentBase):
class GemmArguments2x(ArgumentBase):
"""
Argument wrapper for GEMM. It encodes problem information and
Argument wrapper for GEMM in CUTLASS 2. It encodes problem information and
user-provide tensors into the kernel's argument
:param operation: the GEMM operation to take the argument
@ -148,7 +149,7 @@ class GemmArguments(ArgumentBase):
self.batch_count = 1
self.split_k_slices = self.batch_count
if gemm_mode in [cutlass.gemm.Mode.Batched, cutlass.gemm.Mode.Array]:
if gemm_mode in [cutlass.gemm.Mode.Batched, cutlass.gemm.Mode.Array]:
if 'batch' in kwargs.keys():
self.batch_count = kwargs['batch']
else:
@ -313,6 +314,154 @@ class GemmArguments(ArgumentBase):
self.device_workspace = device_workspace
self.launch_config = launch_config
class GemmArguments3x(GemmArguments2x):
"""
Argument wrapper for GEMM in CUTLASS 3. It encodes problem information and
user-provide tensors into the kernel's argument
:param operation: the GEMM operation to take the argument
:type operation: :class:`pycutlass.GemmOperationUniversal` |
:class:`pycutlass.GemmOperationGrouped`
:param problem_size: GEMM problem size gemm(M, N, K)
:type operation: :class:`cutlass.gemm.GemmCoord`
:param A: tensor A
:type A: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param B: tensor B
:type B: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param C: tensor C
:type C: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param D: tensor D
:type D: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param gemm_mode: GEMM mode
:type gemm_mode: :class:`cutlass.gemm.Mode`
:param output_op: output operator, optional
:type output_op: :class:`pycutlass.LinearCombinationFunctorArguments`
"""
def __init__(
self, operation: 'GemmOperation', problem_size: 'cutlass.gemm.GemmCoord',
A: 'Tensor', B: 'Tensor', C: 'Tensor', D: 'Tensor',
gemm_mode: 'cutlass.gemm.Mode'=cutlass.gemm.Mode.Gemm, **kwargs):
if gemm_mode not in [cutlass.gemm.Mode.Gemm, cutlass.gemm.Mode.Batched]:
raise Exception("Unsupporged GEMM mode {}.".format(gemm_mode))
super().__init__(operation, problem_size, A, B, C, D, gemm_mode, **kwargs)
def get_arguments(self):
problem_size_ = GemmCoordBatched_(self.problem_size, self.batch_count)
if self.batch_count > 1:
bsA = self.batched_stride_A
bsB = self.batched_stride_B
bsC = self.batched_stride_C
bsD = self.batched_stride_D
else:
bsA = 0
bsB = 0
bsC = 0
bsD = 0
stride_A = StrideBatched_(self.lda, bsA)
stride_B = StrideBatched_(self.ldb, bsB)
stride_C = StrideBatched_(self.ldc, bsC)
stride_D = StrideBatched_(self.ldd, bsD)
self.arguments = self.operation.argument_type(
self.gemm_mode,
problem_size_,
int(self.ptr_A),
stride_A,
int(self.ptr_B),
stride_B,
int(self.ptr_C),
stride_C,
int(self.ptr_D),
stride_D,
self.output_op,
)
def initialize(self):
# get the host and evice workspace
device_workspace_size = \
self.operation.rt_module.get_device_workspace_size(self)
if device_workspace_size > 0:
self.workspace_buffer = device_mem_alloc(device_workspace_size)
workspace_ptr = self.workspace_buffer.ptr
err, = cuda.cuMemsetD32(
workspace_ptr, 0, device_workspace_size // 4)
else:
workspace_ptr = None
device_workspace = 0
if (workspace_ptr is not None and
self.gemm_mode == cutlass.gemm.Mode.GemmSplitKParallel):
# in GEMM splik-K parallel, the D pointer is redirected
# to the workspace
self.ptr_D = cuda.CUdeviceptr(workspace_ptr)
elif (workspace_ptr is not None and
self.gemm_mode == cutlass.gemm.Mode.Gemm):
# in GEMM split-K serial
device_workspace = workspace_ptr
self.get_arguments()
res_arg = self.operation.rt_module.get_args(
ctypes.byref(self.arguments), ctypes.c_void_p(int(device_workspace)))
host_workspace = bytearray(res_arg.contents)
grid = self.operation.rt_module.get_grid_shape(
ctypes.byref(self.arguments), ctypes.c_void_p(int(device_workspace)))
block = self.operation.rt_module.get_block_shape()
device_workspace = None
self.host_workspace = host_workspace
self.device_workspace = device_workspace
self.launch_config = LaunchConfiguration([grid.x, grid.y, grid.z],
[block.x, block.y, block.z],
self.operation.rt_module.shared_memory_capacity)
def GemmArguments(operation: 'GemmOperation', problem_size: 'cutlass.gemm.GemmCoord',
A: 'Tensor', B: 'Tensor', C: 'Tensor', D: 'Tensor',
gemm_mode: 'cutlass.gemm.Mode'=cutlass.gemm.Mode.Gemm, **kwargs):
"""
Argument wrapper for GEMM in CUTLASS 2 or 3. It returns either 2x arguments
or 3x arguments depending on the `arch` field specified in `operation`.
:param operation: the GEMM operation to take the argument
:type operation: :class:`pycutlass.GemmOperationUniversal` |
:class:`pycutlass.GemmOperationGrouped`
:param problem_size: GEMM problem size gemm(M, N, K)
:type operation: :class:`cutlass.gemm.GemmCoord`
:param A: tensor A
:type A: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param B: tensor B
:type B: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param C: tensor C
:type C: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param D: tensor D
:type D: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param gemm_mode: GEMM mode
:type gemm_mode: :class:`cutlass.gemm.Mode`
:param output_op: output operator, optional
:type output_op: :class:`pycutlass.LinearCombinationFunctorArguments`
"""
ArgClass = GemmArguments3x if operation.api == ApiVersion.v3x else GemmArguments2x
return ArgClass(operation, problem_size, A, B, C, D, gemm_mode, **kwargs)
class GemmGroupedArguments:
"""
@ -383,7 +532,7 @@ class GemmGroupedArguments:
# process the input arguments
for idx, problem_size in enumerate(problem_sizes):
M, N, K = problem_size.m(), problem_size.n(), problem_size.k()
temp_argument = GemmArguments(
temp_argument = GemmArguments2x(
operation=operation,
problem_size=cutlass.gemm.GemmCoord(M, N, K),
A=A[idx], B=B[idx], C=C[idx], D=D[idx],
@ -657,16 +806,164 @@ extern "C" {
#
workspace_bytes = 4 * arguments.grid_tiled_shape.x * arguments.grid_tiled_shape.y
# TODO: get extra workspace size
# see https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm_universal_base.h
return workspace_bytes
################################################################################
# Runtime module for GEMM Universal within CUTLASS 3
################################################################################
class GemmRTUniversal3x(GemmRTUniversal):
"""
GemmRTUniversal manages the CUTLASS runtime components
"""
KernelTemplate = r'''
using Operator = ${operation_name}${operation_suffix};
extern "C"
__global__ __launch_bounds__(Operator::MaxThreadsPerBlock, Operator::MinBlocksPerMultiprocessor)
void ${operation_name}(__grid_constant__ typename Operator::Params const params) {
// Dynamic shared memory base pointer
extern __shared__ char smem[];
// Declare pointer to dynamic shared memory.
Operator op;
op(params, smem);
}
'''
HostTemplate = r'''
extern "C" {
// Get the size of params in bytes
int ${operation_name}_get_param_size(){
return sizeof(${operation_name}${operation_suffix}::Params);
}
// Get the size of dynamic shared memory in bytes
int ${operation_name}_shared_memory_size() {
return ${operation_name}${operation_suffix}::SharedStorageSize;
}
using GemmType = ${operation_name}_base;
// Get the params as byte array
char* ${operation_name}_get_params(GemmType::Arguments* argument, int* workspace){
GemmType::Params params = GemmType::to_underlying_arguments(*argument, workspace);
char *bytes = ((char*)(&params));
char *output = new char[sizeof(GemmType::Params)];
for (unsigned int i = 0; i < sizeof(GemmType::Params); i ++)
output[i] = bytes[i];
return output;
}
// Get the grid shape
dim3 ${operation_name}_get_grid_shape(GemmType::Arguments* args, int* workspace) {
auto tmp_params = GemmType::to_underlying_arguments(*args, workspace);
return GemmType::get_grid_shape(tmp_params);
}
// Get the block shape
dim3 ${operation_name}_get_block_shape() {
return GemmType::get_block_shape();
}
}
'''
def __init__(self, operation: 'GemmOperation'):
super(GemmRTUniversal3x, self).__init__(operation)
self.extra_funcs = {
'get_grid_shape': dim3_,
'get_block_shape': dim3_
}
self.emitter = EmitGemmUniversalInstance3x('_type')
self.argument_type, self.epilogue_type = get_gemm_arguments_3x(operation.epilogue_functor)
class EmitGemmUniversalInstance3x:
''' Responsible for emitting a CUTLASS 3 template definition'''
def __init__(self, operation_suffix=''):
self.operation_suffix = operation_suffix
self.includes = [
"cutlass/cutlass.h",
"cute/tensor.hpp",
"cute/atom/mma_atom.hpp",
"cutlass/numeric_types.h",
"cutlass/gemm/kernel/gemm_universal.hpp",
"cutlass/gemm/collective/collective_builder.hpp",
"cutlass/epilogue/collective/default_epilogue.hpp",
"cutlass/epilogue/thread/linear_combination.h"
]
self.gemm_template = """
using namespace cute;
${collective_op}
using EpilogueOp = cutlass::epilogue::collective::DefaultEpilogue<
cutlass::gemm::TagToStrideC_t<${layout_c}>,
cutlass::gemm::TagToStrideC_t<${layout_c}>,
${epilogue_functor}
>;
// Gemm operator ${operation_name}
using ${operation_name}_base = cutlass::gemm::kernel::GemmUniversal<
Shape<int,int,int,int>,
CollectiveOp,
EpilogueOp
>;
// Define named type
struct ${operation_name}${operation_suffix} :
public ${operation_name}_base { };
"""
#
def emit(self, operation):
instance_layout_A, instance_layout_B, instance_layout_C = \
(operation.A.layout, operation.B.layout, operation.C.layout)
# Support built-in epilogue functors or user-defined functions
epilogue_functor = operation.epilogue_functor.emit()
collective_op = collective_op_builder.build(operation)
values = {
'operation_name': operation.procedural_name(),
'operation_suffix': self.operation_suffix,
'collective_op': collective_op,
'element_a': DataTypeTag[operation.A.element],
'layout_a': LayoutTag[instance_layout_A],
'element_b': DataTypeTag[operation.B.element],
'layout_b': LayoutTag[instance_layout_B],
'element_c': DataTypeTag[operation.C.element],
'layout_c': LayoutTag[instance_layout_C],
'epilogue_functor': epilogue_functor,
'element_output': DataTypeTag[operation.epilogue_functor.element_output],
'element_accumulator': DataTypeTag[operation.accumulator_type()],
'element_epilogue': DataTypeTag[operation.epilogue_functor.element_epilogue],
'epilogue_vector_length': str(operation.epilogue_functor.epilogue_vector_length),
'opcode_class': OpcodeClassTag[operation.tile_description.math_instruction.opcode_class],
'arch': "cutlass::arch::Sm%d" % operation.arch,
'threadblock_shape_m': str(operation.tile_description.threadblock_shape[0]),
'threadblock_shape_n': str(operation.tile_description.threadblock_shape[1]),
'threadblock_shape_k': str(operation.tile_description.threadblock_shape[2]),
'cluster_shape_m': str(operation.tile_description.cluster_shape[0]),
'cluster_shape_n': str(operation.tile_description.cluster_shape[1]),
'cluster_shape_k': str(operation.tile_description.cluster_shape[2]),
'align_a': str(operation.A.alignment),
'align_b': str(operation.B.alignment)
}
values['epilogue_functor'] = operation.epilogue_functor.emit()
return SubstituteTemplate(self.gemm_template, values)
###################################################################################################
# Runtime module for GEMM Grouped
###################################################################################################
class GemmRTGrouped(GemmRTbase):
"""
GemmRTGrouped manages the CUTLASS runtime components
@ -713,7 +1010,7 @@ class GemmRTGrouped(GemmRTbase):
def __init__(self, operation: 'GemmOperation'):
super(GemmRTGrouped, self).__init__(operation)
self.extra_funcs = ['precompute']
self.extra_funcs = {'precompute': None}
self.emitter = EmitGemmGroupedInstance('_type')
self.argument_type, self.epilogue_type = get_gemm_grouped_arguments(operation.epilogue_functor)
@ -761,7 +1058,7 @@ class GemmOperationBase:
self, gemm_kind, arch, tile_description: TileDescription,
A: TensorDescription, B: TensorDescription, C: TensorDescription,
epilogue_functor,
swizzling_functor=cutlass.IdentitySwizzle1, **kwargs):
swizzling_functor=cutlass.IdentitySwizzle1, api=False, **kwargs):
#: operation kind
self.operation_kind: OperationKind = OperationKind.Gemm
@ -772,8 +1069,11 @@ class GemmOperationBase:
#: gemm kind
self.gemm_kind: GemmKind = gemm_kind
self.api = api
self.prefix = "3x" if self.api == ApiVersion.v3x else ""
# use deep copy to avoid overwritting the original TensorDescription
if C.layout == cutlass.ColumnMajor:
if self.api != ApiVersion.v3x and C.layout == cutlass.ColumnMajor:
#: Operand A
self.A: TensorDescription = copy.deepcopy(B)
#: Operand B
@ -800,7 +1100,6 @@ class GemmOperationBase:
self.direct_store = kwargs["direct_store"]
else:
self.direct_store = False
if "visitor" in kwargs:
self.visitor = kwargs["visitor"]
else:
@ -872,8 +1171,11 @@ class GemmOperationBase:
math_op_string = math_operations_map[math_op] if math_op in math_operations_map.keys(
) else ''
inst_shape = "%d%d%d" % tuple(
self.tile_description.math_instruction.instruction_shape)
if self.tile_description.math_instruction.instruction_shape is not None:
inst_shape = "%dx%dx%d" % tuple(
self.tile_description.math_instruction.instruction_shape)
else:
inst_shape = "Default"
inst_shape += math_op_string
if self.tile_description.math_instruction.element_a != self.A.element and \
@ -905,6 +1207,17 @@ class GemmOperationBase:
return extended_name
#
def extended_name_3x(self):
'''Generates a string representing the MMA atom. Assumes accumulator type is C type.'''
extended_name = "{core_name}_{element_a}_{element_b}_{element_acc}_{element_c}".format(
element_a = DataTypeNames[self.A.element],
element_b = DataTypeNames[self.B.element],
element_acc = DataTypeNames[self.tile_description.math_instruction.element_accumulator],
element_c = DataTypeNames[self.C.element],
core_name = self.core_name())
return extended_name
#
def layout_name(self):
if self.is_complex() or self.is_planar_complex():
@ -916,25 +1229,49 @@ class GemmOperationBase:
)
return "%s%s" % (ShortLayoutTypeNames[self.A.layout], ShortLayoutTypeNames[self.B.layout])
# Generates a short string representing the ABC layout tags (e.g. ntn or tnn)
def layout_name_3x(self):
if self.is_complex() or self.is_planar_complex():
return "{}{}{}".format(
ShortComplexLayoutNames[(self.A.layout, self.A.complex_transform)],
ShortComplexLayoutNames[(self.B.layout, self.B.complex_transform)],
ShortComplexLayoutNames[(self.C.layout, self.C.complex_transform)])
else:
return "{}{}{}".format(
ShortLayoutTypeNames[self.A.layout],
ShortLayoutTypeNames[self.B.layout],
ShortLayoutTypeNames[self.C.layout])
#
def procedural_name(self):
''' The full procedural name indicates architecture, extended name, tile size, and layout. '''
threadblock = self.tile_description.procedural_name()
opcode_class_name = OpcodeClassNames[self.tile_description.math_instruction.opcode_class]
alignment = max([self.A.alignment, self.B.alignment, self.C.alignment])
return SubstituteTemplate(
"cutlass_${opcode_class}_${extended_name}_${threadblock}_${layout}_align${alignment}",
{
'opcode_class': opcode_class_name,
'extended_name': self.extended_name(),
'threadblock': threadblock,
'layout': self.layout_name(),
'alignment': "%d" % self.A.alignment,
}
)
if self.api == ApiVersion.v3x and self.arch >= 90:
kernel_name_template = "cutlass{p}_sm{ar}_{op}_{ex}_{tbm}x{tbn}x{tbk}_{cm}x{cn}x{ck}_{l}_{s}_align{al}"
return kernel_name_template.format(
p = self.prefix,
ar = self.arch,
op = opcode_class_name,
ex = self.extended_name_3x(),
tbm = self.tile_description.threadblock_shape[0],
tbn = self.tile_description.threadblock_shape[1],
tbk = self.tile_description.threadblock_shape[2],
cm = self.tile_description.cluster_shape[0],
cn = self.tile_description.cluster_shape[1],
ck = self.tile_description.cluster_shape[2],
l = self.tile_description.stages,
s = self.layout_name_3x(),
al = str(self.A.alignment))
else:
threadblock = self.tile_description.procedural_name()
return "cutlass{p}_sm{ar}_{op}_{ex}_{tb}_{l}_align{a}".format(
p = self.prefix,
ar = self.arch,
op = opcode_class_name,
ex = self.extended_name(),
tb = threadblock,
l = self.layout_name(),
a = str(self.A.alignment))
#
def configuration_name(self):
@ -945,9 +1282,14 @@ class GemmOperationBase:
class GemmOperationUniversal(GemmOperationBase):
def __init__(self, arch, tile_description: TileDescription, A: TensorDescription, B, C,
epilogue_functor, swizzling_functor=cutlass.IdentitySwizzle1, **kwargs):
api = api_version(arch, tile_description.math_instruction.opcode_class, A.element)
super(GemmOperationUniversal, self).__init__(GemmKind.Universal, arch, tile_description,
A, B, C, epilogue_functor, swizzling_functor, **kwargs)
self.rt_module = GemmRTUniversal(self)
A, B, C, epilogue_functor, swizzling_functor,
api=api, **kwargs)
if api == ApiVersion.v3x:
self.rt_module = GemmRTUniversal3x(self)
else:
self.rt_module = GemmRTUniversal(self)
self.argument_type = self.rt_module.argument_type
self.epilogue_type = self.rt_module.epilogue_type

View File

@ -36,6 +36,7 @@ import re
import enum
import cutlass
import cute
# The following block implements enum.auto() for Python 3.5 variants that don't include it such
# as the default 3.5.2 on Ubuntu 16.04.
@ -182,6 +183,30 @@ DataTypeSize = {
cutlass.dtype.cs64: 128,
}
class DataTypeSizeBytes:
"""
Static class to mimic the `DataTypeSize` dictionary, but with checks for whether the
data type key is less than a full byte or a non-integer number of bytes.
"""
@staticmethod
def __class_getitem__(datatype):
"""
Returns the number of bytes in size the data type is. Raises an exception if the data type
is either less than a full byte or a non-integer number of bytes in size.
:param datatype: data type to query
:return: number of bytes the data type occupies
:rtype: int
"""
bits = DataTypeSize[datatype]
if bits < 8:
raise Exception('Data type {} is less than one byte in size.'.format(datatype))
elif bits % 8 != 0:
raise Exception('Data type {} is not an integer number of bytes.'.format(datatype))
return bits // 8
###################################################################################################
#
@ -350,6 +375,12 @@ ShortComplexLayoutNames = {
(cutlass.RowMajor, cutlass.complex_transform.conj): 'h'
}
#
CuTeLayoutTag = {
cute.GMMAMajor.K: 'cute::GMMA::Major::K',
cute.GMMAMajor.MN: 'cute::GMMA::Major::MN'
}
###################################################################################################
#
@ -436,7 +467,6 @@ OpcodeClassTag = {
#
class OperationKind(enum.Enum):
Gemm = enum_auto()
RankK = enum_auto()
@ -460,16 +490,19 @@ ArchitectureNames = {
70: 'volta',
75: 'turing',
80: 'ampere',
90: 'hopper'
}
#
SharedMemPerCC = {
70: 96, # 96KB of SMEM
72: 96, # 96KB of SMEM
75: 64, # 64KB of SMEM
80: 160, # 164KB of SMEM - 4KB reserved for the driver
86: 100, # 100KB of SMEM
87: 160, # 164KB of SMEM - 4KB reserved for the driver
70: 96 << 10, # 96KB of SMEM
72: 96 << 10, # 96KB of SMEM
75: 64 << 10, # 64KB of SMEM
80: 160 << 10, # 164KB of SMEM - 4KB reserved for the driver
86: 100 << 10, # 100KB of SMEM
87: 160 << 10, # 164KB of SMEM - 4KB reserved for the driver
89: 100 << 10, # 100KB of SMEM
90: 227 << 10, # 228KB of SMEM - 1KB reserved for the driver
}
###################################################################################################
@ -646,7 +679,21 @@ ConvModeTag = {
class MathInstruction:
"""
Description of a the lowest-level matrix-multiply-accumulate operation to be used in a kernel
"""
def __init__(self, instruction_shape, element_a, element_b, element_accumulator, opcode_class=cutlass.OpClass.Simt, math_operation=MathOperation.multiply_add):
"""
:param instruction_shape: size of the [M, N, K] dimensions of the instruction
:type instruction_shape: list or tuple
:param element_a: data type of operand A
:param element_b: data type of operand B
:param element_accumulator: data type used in accumulation
:param opcode_class: higher-level class of the instruction (e.g., SIMT or Tensor Core)
:type opcode_class: cutlass.OpClass
:param math_operation: the type of low-level operation to be performed (e.g., multiply accumulate)
:type math_operation: MathOperation
"""
self.instruction_shape = instruction_shape
self.element_a = element_a
self.element_b = element_b
@ -658,24 +705,65 @@ class MathInstruction:
class TileDescription:
def __init__(self, threadblock_shape, stages, warp_count, math_instruction):
"""
Description of a tile of computation to be performed in the kernel, encompassing threadblock, cluster, and warp shapes,
stage count, and math instruction specification
"""
def __init__(self, threadblock_shape, stages, warp_count, math_instruction, cluster_shape=[1, 1, 1], persistent=False):
"""
:param threadblock_shape: shape of a threadblock tyle
:type threadblock_shape: list or tuple
:param stages: number of pipline stages in the operation. For SM90 kernels, this can be set to `None` and the maximum
number of stages that can be supported for an operation on a given architecture will be computed at a later time
:type stages: int or None
:param warp_count: number of warps in each [M, N, K] dimension of a threadblock tile
:type warp_count: list, tuple, or None
:param math_instruction: specification of the instruction type and shape to be performed and the types of its operands
:type math_instruction: MathInstruction
:param cluster_shape: number of threadblocks in the [X, Y, Z] dimensions of a threadblock cluster
:param persistent: whether the kernel uses persistent warp-specialized threadblocks (only available for SM90+)
:type persistent: bool
"""
self.threadblock_shape = threadblock_shape
#: number of pipeline stages
self.cluster_shape = cluster_shape
self.persistent: bool = persistent
self.stages: int = stages
#: number of warps along x, y, z directions
self.warp_count: list[int] = warp_count
self.math_instruction = math_instruction
#: number threads per threadblock
self.num_threads: int = 32
for cnt in self.warp_count:
self.num_threads *= cnt
# Number of warps along x, y, z directions
self.warp_count = warp_count
@property
def num_threads(self):
"""
Returns the number of threads in the threadblock
:return: number of threads in the threadblock
:rtype: int or None (if warp count is None)
"""
if self.warp_count is not None:
threads = 32
for cnt in self.warp_count:
threads *= cnt
return threads
return None
def procedural_name(self):
return "%dx%d_%dx%d" % (self.threadblock_shape[0], self.threadblock_shape[1], self.threadblock_shape[2], self.stages)
"""
Returns a name identifying the tile description
:return: name identifying the tile description
:rtype: int
"""
emit_stages = 0 if self.stages is None else self.stages
name = "%dx%dx%d_%dx%d_%dx%d" % (
self.cluster_shape[0], self.cluster_shape[1], self.cluster_shape[2],
self.threadblock_shape[0], self.threadblock_shape[1], self.threadblock_shape[2], emit_stages)
if self.persistent:
name += '_persistent'
return name
#
@ -715,30 +803,68 @@ class TriangularTensorDescription:
###################################################################################################
#
def CalculateSmemUsagePerStage(operation):
"""
Returns the amount of shared memory in bytes consumed in a single stage of a kernel.
:param op: operation for which the maximum stages should be computed. If stages are
set via the `op.tile_description.stages` parameter, this setting is ignored
in the present calculation
:type op: pycutlass.Operation
def CalculateSmemUsage(operation):
cta_shape = operation.tile_description.threadblock_shape
stages = operation.tile_description.stages
:return: number of bytes of shared memory consumed by a single stage
:rtype: int
"""
m, n, k = operation.tile_description.threadblock_shape
if operation.operation_kind == OperationKind.Gemm and operation.gemm_kind == GemmKind.Sparse:
# Elements represented by 8 bits of metadata (based on 4:8, 2:4 or 1:2 sparsity)
if DataTypeSize[operation.A.element] == 32:
elements_per_8b_md = 2
elif DataTypeSize[operation.A.element] == 4:
elements_per_8b_md = 8
else:
elements_per_8b_md = 4
smem_per_stage = DataTypeSize[operation.A.element] * cta_shape[0] * (cta_shape[2] // 2) // 8 + \
DataTypeSize[operation.B.element] * cta_shape[1] * cta_shape[2] // 8 + \
cta_shape[0] * (cta_shape[2] // 2) // elements_per_8b_md
if operation.operation_kind == OperationKind.Gemm:
stage_barrier_bytes = 32
return (DataTypeSize[operation.A.element] * m * k // 8) + \
(DataTypeSize[operation.B.element] * k * n // 8) + stage_barrier_bytes
else:
# Few BLAS3 operations only have A tensor
smem_per_stage = DataTypeSize[operation.A.element] * cta_shape[0] * cta_shape[2] // 8 + \
DataTypeSize[operation.A.element] * \
cta_shape[1] * cta_shape[2] // 8
raise Exception('Unsupported operation kind {}.'.format(operation.operation_kind))
#
def CalculateSmemUsage(operation):
"""
Returns the amount of shared memory in bytes consumed by a kernel.
:param op: operation for which the maximum stages should be computed. If stages are
set via the `op.tile_description.stages` parameter, this setting is ignored
in the present calculation
:type op: pycutlass.Operation
:return: int
"""
return operation.tile_description.stages * CalculateSmemUsagePerStage(operation)
class ApiVersion(enum.Enum):
"""
Differentiate between CUTLASS 2.x and 3.x API versions
"""
v2x = enum_auto()
v3x = enum_auto()
def api_version(arch, opclass, datatype):
"""
Returns whether the architecture, opcode class, and datatype in question require using CUTLASS 2.x
or 3.x for code emission.
:param arch: compute capability of device on which to run
:type arch: int
:param opclass: class of the operation being performed
:type opclass: cutlass.OpClass
:param datatype: data type to be used in operation (assumes that ElementA and ElementB are the same)
:return: API version to be used in code emission
:rtype: ApiVersion
"""
if arch >= 90 and opclass == cutlass.OpClass.TensorOp and (datatype != cutlass.float64):
return ApiVersion.v3x
else:
return ApiVersion.v2x
smem_usage = smem_per_stage * stages
return (smem_usage >> 10)
###################################################################################################

View File

@ -32,6 +32,12 @@
import ctypes
from cuda import cuda
from pycutlass.utils.device import device_cc
from cuda import __version__ as __cuda_version__
_version_splits = [int(x) for x in __cuda_version__.split('.')]
supports_cluster_launch = device_cc() >= 90 and (_version_splits[0] > 11 or (_version_splits[0] == 11 and _version_splits[1] >= 8))
################################################################################
#
@ -90,21 +96,58 @@ class ExecutableOperation:
def initialize(self, host_workspace, device_workspace, launch_config, arguments, stream=cuda.CUstream(0)):
raise NotImplementedError()
#
def run(self, host_workspace, device_workspace, launch_config, stream=cuda.CUstream(0)):
def run_with_clusters(self, launch_config, kernel_params, stream=cuda.CUstream(0)):
if hasattr(self.operation, 'tile_description') and hasattr(self.operation.tile_description, 'cluster_shape'):
attr = cuda.CUlaunchAttribute()
attr.value.clusterDim.x, attr.value.clusterDim.y, attr.value.clusterDim.z = self.operation.tile_description.cluster_shape
attr.id = cuda.CUstreamAttrID.CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION
attrs = [attr]
cArg = (ctypes.c_char * len(host_workspace)
).from_buffer(host_workspace)
packed = (ctypes.c_void_p * 1)()
packed[0] = ctypes.addressof(cArg)
# Allow for non-portable cluster sizes
err, = cuda.cuFuncSetAttribute(
self.kernel, cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NON_PORTABLE_CLUSTER_SIZE_ALLOWED, 1)
if err != cuda.CUresult.CUDA_SUCCESS:
return err
else:
attrs = []
config = cuda.CUlaunchConfig()
config.gridDimX, config.gridDimY, config.gridDimZ = launch_config.grid
config.blockDimX, config.blockDimY, config.blockDimZ = launch_config.block
config.blockDimZ = launch_config.block[2]
config.sharedMemBytes = launch_config.shared_memory_capacity
config.hStream = stream
config.attrs = attrs
config.numAttrs = len(attrs)
err, = cuda.cuLaunchKernelEx(config, f=self.kernel, kernelParams=kernel_params, extra=0)
return err
#
def run_without_clusters(self, launch_config, kernel_params, stream=cuda.CUstream(0)):
err, = cuda.cuLaunchKernel(
self.kernel,
launch_config.grid[0], launch_config.grid[1], launch_config.grid[2],
launch_config.block[0], launch_config.block[1], launch_config.block[2],
launch_config.shared_memory_capacity,
stream,
packed,
kernel_params,
0)
return err
#
def run(self, host_workspace, device_workspace, launch_config, stream=cuda.CUstream(0)):
cArg = (ctypes.c_char * len(host_workspace)
).from_buffer(host_workspace)
packed = (ctypes.c_void_p * 1)()
packed[0] = ctypes.addressof(cArg)
if supports_cluster_launch:
return self.run_with_clusters(launch_config, packed, stream)
else:
return self.run_without_clusters(launch_config, packed, stream)

View File

@ -543,7 +543,6 @@ using ${operation_name}_EpilogueVisitor = cutlass::epilogue::threadblock::Epilog
self.elements_per_access = elements_per_access
self.element_compute = element_compute
self.element_output = element_output
# TODO: deprecate this
self.elementwise_functor = elementwise_functor
pass
@ -554,11 +553,8 @@ using ${operation_name}_EpilogueVisitor = cutlass::epilogue::threadblock::Epilog
#
tree = function.epilogue_tree
self.tree = tree
# self.tree.show() # for debug
function.pass_binary_2_unary(self.tree, self.tree.root)
# self.tree.show() # for debug
function.pass_inject_reduction(self.tree, self.tree.root)
# self.tree.show() # for debug
function.pass_inject_epilogue_op(self.tree,self.tree.root)
visitor = self.tree.get_node(self.tree.root).data.epilogue_node
@ -575,7 +571,6 @@ using ${operation_name}_EpilogueVisitor = cutlass::epilogue::threadblock::Epilog
if input_key == "accum":
continue
if function.input_args[input_key][0] == "scalar":
# _kwargs[input_key] = kwargs[input_key]
continue
# tensor input
else:

View File

@ -265,15 +265,6 @@ class Conv2dLauncher:
flops_total_ = flops_mainloop_ + flops_epilogue_
# TODO complex-value support
# switch (operation_desc.tile_description.math_instruction.math_operation) {
# case library::MathOperationID::kMultiplyAddComplex:
# flops_total_ *=4;
# break;
# default: break;
# }
return flops_total_
@ -511,9 +502,8 @@ class Conv2dLauncher:
# (conv_blacklist_sizes)
############################################################################################################
def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleaved=False): # TODO: conv_test_sizes and conv_blacklist_sizes
def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleaved=False):
passed = True
#
# Testbed object
#
@ -529,8 +519,6 @@ def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleave
# Vector of conv2d problem sizes to avoid duplicate runs
conv_tested_sizes = []
# TODO: include resnet 50 sizes, user sepecified sizes, and rigorous sizes
# Flatten 2D problem_vectors into a 1D problem sizes
problem_sizes = conv_problems.conv2d_default_sizes
@ -539,7 +527,6 @@ def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleave
# Sweep conv2d problem sizes (split-k-mode=kSerial, split-k-slices=1, alpha=1.0, beta=0.0)
for conv_problem in problem_sizes:
# TODO: skip blacklist problem sizes
if conv_problem in conv_tested_sizes:
continue
@ -585,9 +572,8 @@ def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleave
passed = testbed.run(conv_problem)
# if not passed: return False
# TODO: If CUTLASS_UNIT_TEST_PROBLEM_COUNT is set reduce the the number of tested problem counts
if not passed:
return False
if interleaved:
return True

View File

@ -184,7 +184,7 @@ class TestbedGrouped:
arguments.sync()
#
# Reference check - TODO: support caching results
# Reference check
#
alpha = self.compute_type(alpha).value()
beta = self.compute_type(beta).value()

View File

@ -33,6 +33,7 @@
from time import sleep
import pycutlass
from pycutlass import *
import pycutlass.utils.datatypes as datatypes
import cutlass
from cuda import cudart
from cuda import cuda
@ -52,16 +53,22 @@ def transpose(layout):
return cutlass.ColumnMajorInterleaved32
def getTensorRef(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: cutlass.layout):
def getTensorRef(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: cutlass.layout, batch_offset: int = 0):
ptr = tensor.__array_interface__['data'][0]
if operand == "a":
tensor_coord = problem_size.mk()
batch_stride = problem_size.m() * problem_size.k()
elif operand == "b":
tensor_coord = problem_size.kn()
batch_stride = problem_size.k() * problem_size.n()
elif operand in ["c", "d"]:
tensor_coord = problem_size.mn()
batch_stride = problem_size.m() * problem_size.n()
else:
raise ValueError("unknonw operand: " + operand)
raise ValueError("Unknown operand: " + operand)
elt_size = DataTypeSizeBytes[datatypes.to_cutlass(tensor.dtype)]
ptr += batch_offset * batch_stride * elt_size
if layout == cutlass.RowMajor:
layout = cutlass.RowMajor.packed(tensor_coord)
@ -96,8 +103,8 @@ def getTensorRef(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, opera
return getattr(cutlass, ref_name)(ptr, layout)
def getTensorView(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: str):
tensor_ref = getTensorRef(tensor, problem_size, operand, layout)
def getTensorView(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: str, batch_offset: int = 0):
tensor_ref = getTensorRef(tensor, problem_size, operand, layout, batch_offset)
if operand == "a":
tensor_coord = problem_size.mk()
@ -106,7 +113,7 @@ def getTensorView(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, oper
elif operand in ["c", "d"]:
tensor_coord = problem_size.mn()
else:
raise ValueError("unknonw operand: " + operand)
raise ValueError("Unknown operand: " + operand)
if layout == cutlass.RowMajor:
layout_tag = "RowMajor"
@ -168,7 +175,12 @@ class GemmUniversalLauncher:
# Compile the operator
#
pycutlass.compiler.add_module([operation, self.reduction_operation])
op_list = [operation]
if operation.arch < 90:
# Split K via Python is currently only supported for pre-SM90 kernels
op_list.append(self.reduction_operation)
pycutlass.compiler.add_module(op_list)
self.operation = operation
@ -206,8 +218,10 @@ class GemmUniversalLauncher:
def print_problem_size(self, p, mode, batch_count):
if mode == cutlass.gemm.Mode.Gemm:
mode = "Gemm"
elif mode == cutlass.gemm.Mode.Batched:
mode = "GemmBatched"
elif mode == cutlass.gemm.Mode.GemmSplitKParallel:
mode = "GemmSplitKParalel"
mode = "GemmSplitKParallel"
problem_size = "problem: %d, %d, %d\n batch_count: %d\n mode: %s" % (
p.m(), p.n(), p.k(), batch_count, mode)
print(problem_size)
@ -251,8 +265,7 @@ class GemmUniversalLauncher:
tensor_ref_B, reordered_tensor_ref_B, problem_size)
return reordered_tensor_B
def host_reference(self, problem_size, tensor_A, tensor_B, tensor_C, alpha, beta):
# TODO
def host_reference(self, problem_size, batch_count, tensor_A, tensor_B, tensor_C, alpha, beta):
tensor_D_ref = np.ones_like(tensor_C)
alpha = self.numpy_type(self.compute_type)(alpha)
beta = self.numpy_type(self.compute_type)(beta)
@ -262,42 +275,46 @@ class GemmUniversalLauncher:
beta = self.compute_type(beta).value()
init_acc = self.accumulator_type(init_acc).value()
if self.operation.switched:
tensor_ref_A = getTensorRef(
tensor_A, problem_size, "a", transpose(self.operation.B.layout))
tensor_ref_B = getTensorRef(
tensor_B, problem_size, "b", transpose(self.operation.A.layout))
tensor_ref_C = getTensorRef(
tensor_C, problem_size, "c", transpose(self.operation.C.layout))
tensor_ref_D_ref = getTensorRef(
tensor_D_ref, problem_size, "d", transpose(self.operation.C.layout))
else:
tensor_ref_A = getTensorRef(
tensor_A, problem_size, "a", self.operation.A.layout)
tensor_ref_B = getTensorRef(
tensor_B, problem_size, "b", self.operation.B.layout)
tensor_ref_C = getTensorRef(
tensor_C, problem_size, "c", self.operation.C.layout)
tensor_ref_D_ref = getTensorRef(
tensor_D_ref, problem_size, "d", self.operation.C.layout)
for i in range(batch_count):
if self.operation.switched:
tensor_ref_A = getTensorRef(
tensor_A, problem_size, "a", transpose(self.operation.B.layout), batch_offset=i)
tensor_ref_B = getTensorRef(
tensor_B, problem_size, "b", transpose(self.operation.A.layout), batch_offset=i)
tensor_ref_C = getTensorRef(
tensor_C, problem_size, "c", transpose(self.operation.C.layout), batch_offset=i)
tensor_ref_D_ref = getTensorRef(
tensor_D_ref, problem_size, "d", transpose(self.operation.C.layout), batch_offset=i)
else:
tensor_ref_A = getTensorRef(
tensor_A, problem_size, "a", self.operation.A.layout, batch_offset=i)
tensor_ref_B = getTensorRef(
tensor_B, problem_size, "b", self.operation.B.layout, batch_offset=i)
tensor_ref_C = getTensorRef(
tensor_C, problem_size, "c", self.operation.C.layout, batch_offset=i)
tensor_ref_D_ref = getTensorRef(
tensor_D_ref, problem_size, "d", self.operation.C.layout, batch_offset=i)
if self.math_operation in [MathOperation.multiply_add_saturate]:
cutlass.test.gemm.host.gemm_saturate(
problem_size, alpha, tensor_ref_A, tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
else:
cutlass.test.gemm.host.gemm(problem_size, alpha, tensor_ref_A,
tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
if self.math_operation in [MathOperation.multiply_add_saturate]:
cutlass.test.gemm.host.gemm_saturate(
problem_size, alpha, tensor_ref_A, tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
else:
cutlass.test.gemm.host.gemm(problem_size, alpha, tensor_ref_A,
tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
return tensor_D_ref
def equal(self, tensor_D, tensor_D_ref, problem_size):
def equal(self, tensor_D, tensor_D_ref, problem_size, batch_count):
for i in range(batch_count):
tensor_view_D = getTensorView(
tensor_D, problem_size, "d", self.operation.C.layout, batch_offset=i)
tensor_view_D_ref = getTensorView(
tensor_D_ref, problem_size, "d", self.operation.C.layout, batch_offset=i)
tensor_view_D = getTensorView(
tensor_D, problem_size, "d", self.operation.C.layout)
tensor_view_D_ref = getTensorView(
tensor_D_ref, problem_size, "d", self.operation.C.layout)
if not cutlass.test.gemm.host.equals(tensor_view_D, tensor_view_D_ref):
return False
return cutlass.test.gemm.host.equals(tensor_view_D, tensor_view_D_ref)
return True
def bytes(self, problem_size, batch_count=1, alpha=1.0, beta=0.0):
m = problem_size.m()
@ -321,9 +338,8 @@ class GemmUniversalLauncher:
n = problem_size.n()
k = problem_size.k()
flops_ = (m * n * k + m * n) * 2 * batch_count
flops_ = (m * n * k) * 2 * batch_count
# TODO: complex
return flops_
def run_cutlass_profiler(self, mode, problem_size, batch_count=1, alpha=1.0, beta=0.0):
@ -368,21 +384,25 @@ class GemmUniversalLauncher:
return runtime
def run(self, mode, problem_size, batch_count=1, alpha=1.0, beta=0.0):
def run(self, mode, problem_size, batch_count=1, split_k_slices=1, alpha=1.0, beta=0.0):
assert get_allocated_size(
) == 0, "%d byte of pool memory is not released in previous run" % get_allocated_size()
np.random.seed(self.seed)
# Assign an actual batch count in cases where we are not running in batched mode.
# This is to differentiate between the number of split K slices and the batch count,
# which are overloaded within the single `batch_count` variable.
true_batch_count = batch_count if mode == cutlass.gemm.Mode.Batched else 1
tensor_A = self.uniform_init(
size=(problem_size.m() * problem_size.k(),), dtype=self.dtype_A)
size=(problem_size.m() * problem_size.k() * true_batch_count,), dtype=self.dtype_A)
tensor_B = self.uniform_init(
size=(problem_size.n() * problem_size.k(),), dtype=self.dtype_B)
size=(problem_size.n() * problem_size.k() * true_batch_count,), dtype=self.dtype_B)
tensor_C = self.uniform_init(
size=(problem_size.m() * problem_size.n(),), dtype=self.dtype_C)
size=(problem_size.m() * problem_size.n() * true_batch_count,), dtype=self.dtype_C)
tensor_D = np.zeros(
shape=(problem_size.m() * problem_size.n(),), dtype=self.dtype_D)
shape=(problem_size.m() * problem_size.n() * true_batch_count,), dtype=self.dtype_D)
#
# Launch kernel
@ -392,14 +412,14 @@ class GemmUniversalLauncher:
operation=self.operation, problem_size=problem_size,
A=tensor_A, B=tensor_B, C=tensor_C, D=tensor_D,
output_op=self.operation.epilogue_type(alpha, beta),
gemm_mode=mode, split_k_slices=batch_count
gemm_mode=mode, split_k_slices=split_k_slices, batch=batch_count
)
if mode == cutlass.gemm.Mode.GemmSplitKParallel:
reduction_arguments = ReductionArguments(
self.reduction_operation, problem_size=[
problem_size.m(), problem_size.n()],
partitions=batch_count,
partitions=split_k_slices,
workspace=arguments.ptr_D,
destination=tensor_D,
source=tensor_C,
@ -419,8 +439,8 @@ class GemmUniversalLauncher:
else:
arguments.sync()
tensor_D_ref = self.host_reference(
problem_size, tensor_A, tensor_B, tensor_C, alpha, beta)
passed = self.equal(tensor_D, tensor_D_ref, problem_size)
problem_size, true_batch_count, tensor_A, tensor_B, tensor_C, alpha, beta)
passed = self.equal(tensor_D, tensor_D_ref, problem_size, true_batch_count)
try:
assert passed
@ -494,7 +514,7 @@ def test_all_gemm(operation: 'GemmOperationUniversal', testcase="universal"):
if operation.A.layout in [cutlass.ColumnMajorInterleaved32, cutlass.RowMajorInterleaved32]:
interleavedk = 32
else:
raise ValueError("unknonw layout")
raise ValueError("Unknown layout")
if testcase == "interleaved":
modes = [cutlass.gemm.Mode.Gemm, ]
@ -515,14 +535,22 @@ def test_all_gemm(operation: 'GemmOperationUniversal', testcase="universal"):
problem_beta = [0.0]
batch_counts = [1, ]
else: # universal
modes = [cutlass.gemm.Mode.Gemm, cutlass.gemm.Mode.GemmSplitKParallel]
modes = [cutlass.gemm.Mode.Gemm]
batch_counts = [1, 2, 3, 5, 7]
if operation.arch < 90:
# Split K kernels via Python are currently only supported pre-SM90
modes.append(cutlass.gemm.Mode.GemmSplitKParallel)
problem_size_m = [alignment_m, 512 - 3 * alignment_m]
problem_size_n = [alignment_n, 512 - 2 * alignment_n]
if operation.tile_description.stages is None:
stages_for_k_calc = 7
else:
stages_for_k_calc = operation.tile_description.stages
problem_size_k = [
alignment_k,
threadblock_k * operation.tile_description.stages - alignment_k,
threadblock_k * operation.tile_description.stages * 3 - alignment_k]
batch_counts = [1, 2, 3, 5, 7]
threadblock_k * stages_for_k_calc - alignment_k,
threadblock_k * stages_for_k_calc * 3 - alignment_k]
problem_alpha = [1.0]
problem_beta = [2.0]
@ -543,8 +571,17 @@ def test_all_gemm(operation: 'GemmOperationUniversal', testcase="universal"):
problem_size = cutlass.gemm.GemmCoord(m, n, k)
if operation.arch < 90:
split_k_slices = batch_count
else:
split_k_slices = 1
overridden_mode = mode
if mode == cutlass.gemm.Mode.Gemm and batch_count > 1:
overridden_mode = cutlass.gemm.Mode.Batched
passed = testbed.run(
mode, problem_size, batch_count, alpha, beta)
overridden_mode, problem_size, batch_count, split_k_slices, alpha, beta)
err, = cudart.cudaDeviceSynchronize()
if err != cuda.CUresult.CUDA_SUCCESS:

View File

@ -0,0 +1,109 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import cutlass
from pycutlass import library, SubstituteTemplate
class Layout:
"""
Utility class to map transpose and non-transpose terminology to row- and column-major terminology
"""
T = cutlass.RowMajor
N = cutlass.ColumnMajor
class LayoutCombination:
"""
Utility class defining all combinations of row- and column-major layouts for operands to a GEMMs
"""
NNN = (Layout.N, Layout.N, Layout.N)
NNT = (Layout.N, Layout.N, Layout.T)
NTN = (Layout.N, Layout.T, Layout.N)
NTT = (Layout.N, Layout.T, Layout.T)
TNN = (Layout.T, Layout.N, Layout.N)
TNT = (Layout.T, Layout.N, Layout.T)
TTN = (Layout.T, Layout.T, Layout.N)
TTT = (Layout.T, Layout.T, Layout.T)
def get_name(layouts, alignments, element_output,
element_accumulator, element_epilogue, cluster_shape,
threadblock_shape, stages, element_a, element_b, arch, opclass, suffix=""):
"""
Generates a procedural name for a test case.
:param layouts: indexable container of layouts of A, B, and C operands
:param alignments: indexable container of alingments of A, B, and C operands
:param element_output: data type of the output element
:param element_accumulator: data type used in accumulation
:param element_epilogue: data type used in computing the epilogue
:param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
:param threadblock_shape: indexable container of dimensions of threadblock tiles
:param stages: number of pipeline stages to use in the kernel
:type stages: int
:param element_a: data type of operand A
:param element_b: data type of operand B
:param arch: compute capability of kernel being generated
:type arch: int
:param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
:type opclass: cutlass.OpClass
:param suffix: additional string to add to the suffix of the name
:type suffix: str
:return: str
"""
name_format = 'test_SM${arch}_Device_Gemm_${eA}${lA}_${eB}${lB}_${eC}${lC}_${opclass}_${acc}_${tbM}x${tbN}x${tbK}_${cM}x${cN}x${cK}_${stages}_align${aA}-${aB}-${aC}${suffix}'
return SubstituteTemplate(name_format,
{
'arch': str(arch),
'eA': library.DataTypeNames[element_a],
'eB': library.DataTypeNames[element_b],
'eC': library.DataTypeNames[element_output],
'lA': library.ShortLayoutTypeNames[layouts[0]],
'lB': library.ShortLayoutTypeNames[layouts[1]],
'lC': library.ShortLayoutTypeNames[layouts[2]],
'opclass': library.OpcodeClassNames[opclass],
'acc': library.DataTypeNames[element_accumulator],
'cM': str(cluster_shape[0]),
'cN': str(cluster_shape[1]),
'cK': str(cluster_shape[2]),
'tbM': str(threadblock_shape[0]),
'tbN': str(threadblock_shape[1]),
'tbK': str(threadblock_shape[2]),
'stages': str(stages) if stages is not None else 'auto',
'aA' : str(alignments[0]),
'aB' : str(alignments[1]),
'aC' : str(alignments[2]),
'suffix': '' if suffix is None else suffix
}
)

View File

@ -0,0 +1,121 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
"""
Utility functions for converting between frontend datatypes and CUTLASS datatypes
"""
from typing import Union, Tuple
import cutlass
import pycutlass.library as library
try:
import numpy as np
numpy_available = True
except ImportError:
numpy_available = False
def numpy_to_cutlass(inp):
if numpy_available:
if inp == np.float16:
return cutlass.float16
elif inp == np.float32:
return cutlass.float32
elif inp == np.float64:
return cutlass.float64
elif inp == np.int8:
return cutlass.int8
elif inp == np.int32:
return cutlass.int32
return None
try:
import cupy as cp
cupy_available = True
cupy_to_cutlass_dict = {
cp.float16: cutlass.float16,
cp.float32: cutlass.float32,
cp.float64: cutlass.float64
}
except ImportError:
cupy_available = False
def cupy_to_cutlass(inp):
if cupy_available:
if inp == cp.float16:
return cutlass.float16
elif inp == cp.float32:
return cutlass.float32
elif inp == cp.float64:
return cutlass.float64
return None
try:
import torch
torch_available = True
torch_to_cutlass_dict = {
torch.half: cutlass.float16,
torch.float16: cutlass.float16,
torch.float: cutlass.float32,
torch.float32: cutlass.float32,
torch.double: cutlass.float64,
torch.float64: cutlass.float64
}
except ImportError:
torch_available = False
def torch_to_cutlass(inp):
if torch_available:
return torch_to_cutlass_dict.get(inp, None)
try:
import bfloat16
bfloat16_available = True
except ImportError:
bfloat16_available = False
def bfloat16_to_cutlass(inp):
if bfloat16_available:
if inp == bfloat16.bfloat16:
return cutlass.bfloat16
def to_cutlass(inp):
for cvt_fn in [bfloat16_to_cutlass, cupy_to_cutlass, numpy_to_cutlass, torch_to_cutlass]:
out = cvt_fn(inp)
if out is not None:
return out
raise Exception('No available conversion from type {} to a CUTLASS type.'.format(inp))

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_dgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
from pycutlass.conv2d_operation import *
from pycutlass import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu
import pycutlass
from pycutlass import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_dgrad_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu
import pycutlass
from pycutlass.conv2d_operation import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu
import pycutlass
from pycutlass import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_fprop_few_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu
import pycutlass
from pycutlass.test import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_fprop_fixed_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu
import pycutlass
from pycutlass.test import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
import pycutlass
from pycutlass import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu
import pycutlass
from pycutlass import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu
import pycutlass
from pycutlass.conv2d_operation import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu
import pycutlass
from pycutlass import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu
import pycutlass
from pycutlass import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_wgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
import pycutlass
from pycutlass import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_wgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
import pycutlass
from pycutlass import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_wgrad_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu
import pycutlass
from pycutlass.conv2d_operation import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# test/unit/conv/device/conv2d_wgrad_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu
import pycutlass
from pycutlass import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import pycutlass
import unittest
from pycutlass.memory_manager import *

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
pushd $CUTLASS_PATH/examples/40_cutlass_py/customizable
python gemm.py -i 8 8 4 -ta float64 -tb float64 -tc float64 -tacc float64 -m multiply_add -op TensorOp -b 32 32 16 -s 4 -w 2 2 1 -cc 80 -la ColumnMajor -aa 1 -lb RowMajor -ab 1 -lc RowMajor -ac 1 -te float64 -ep LinearCombination -sw IdentitySwizzle1 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm Gemm -k 1

View File

@ -1 +1,33 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
CUPY_CACHE_DIR=./ python test_frontend.py

View File

@ -29,13 +29,15 @@
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
## Test case for Pytorch
"""
Test cases for frontends
"""
import pycutlass
import unittest
from pycutlass import *
from pycutlass.utils.device import device_cc
import torch
import cupy as cp
class Test_Frontend(unittest.TestCase):
@ -49,9 +51,7 @@ class Test_Frontend(unittest.TestCase):
cutlass.OpClass.Simt, MathOperation.multiply_add
)
# Stages > 2 is supported only for compute capability 80 and beyond
stages = 4 if cc >= 80 else 2
stages = 2
tile_description = TileDescription(
[128, 128, 8], stages, [2, 4, 1],
math_inst
@ -84,6 +84,11 @@ class Test_Frontend(unittest.TestCase):
def test_torch_frontend(self):
try:
import torch
except:
self.assertTrue(False, "Unable to import torch")
problem_size = cutlass.gemm.GemmCoord(512, 256, 128)
tensor_A = torch.ceil(torch.empty(size=(problem_size.m(), problem_size.k()), dtype=torch.float32, device="cuda").uniform_(-8.5, 7.5))
@ -111,6 +116,11 @@ class Test_Frontend(unittest.TestCase):
self.assertTrue(torch.equal(tensor_D, tensor_D_ref))
def test_cupy_frontend(self):
try:
import cupy as cp
except:
self.assertTrue(False, "Unable to import cupy")
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)
problem_size = cutlass.gemm.GemmCoord(512, 256, 128)
@ -139,7 +149,6 @@ class Test_Frontend(unittest.TestCase):
self.assertTrue(cp.array_equal(tensor_D, tensor_D_ref))
if __name__ == '__main__':
pycutlass.get_memory_pool(2**32, 2**32)
unittest.main()

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import pycutlass
from pycutlass import *
from pycutlass.test import *
@ -92,5 +124,5 @@ class GemmBF16TensorOpSm80(unittest.TestCase):
self.assertTrue(test_all_gemm(operation, "multistage"))
if __name__ == '__main__':
pycutlass.get_memory_pool(2**24, 2**24)
pycutlass.get_memory_pool(2**30, 2**30)
unittest.main()

View File

@ -0,0 +1,138 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
from functools import partial
import pycutlass
from pycutlass import *
from pycutlass import library
from pycutlass.test import *
import unittest
from pycutlass.test.utils import LayoutCombination, get_name
from pycutlass.test.gemm_testbed import test_all_gemm
from pycutlass.utils.device import device_cc
name_fn = partial(get_name, element_a=cutlass.bfloat16, element_b=cutlass.bfloat16, arch=90)
def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
cluster_shape, threadblock_shape, stages, opclass, persistent=False):
"""
Create a test-running function with the given specification and set it as a method of `cls`.
:param cls: class to which the generated method will be added
:type cls: type
:param layouts: indexable container of layouts of A, B, and C operands
:param alignments: indexable container of alingments of A, B, and C operands
:param element_output: data type of the output element
:param element_accumulator: data type used in accumulation
:param element_epilogue: data type used in computing the epilogue
:param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
:param threadblock_shape: indexable container of dimensions of threadblock tiles
:param stages: number of pipeline stages to use in the kernel
:type stages: int
:param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
:type opclass: cutlass.OpClass
:param persistent: whether this is a persistent warp-specialized kernel
:type persistent: bool
"""
def run(self):
"""
Dynamically-generated function that constructs a GEMM operation and verifies it against
multiple test cases.
"""
element_A = cutlass.bfloat16
element_B = cutlass.bfloat16
inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
math_inst = MathInstruction(
instruction_shape=inst_shape,
element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
opcode_class=opclass, math_operation=MathOperation.multiply_add
)
tile_description = TileDescription(
threadblock_shape=threadblock_shape,
cluster_shape=cluster_shape,
stages=stages, warp_count=warp_count,
math_instruction=math_inst,
persistent=persistent
)
A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
epilogue_functor = LinearCombination(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
swizzling_functor = cutlass.IdentitySwizzle1
operation = GemmOperationUniversal(
arch=90, tile_description=tile_description, A=A, B=B, C=C,
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
self.assertTrue(test_all_gemm(operation, "universal"))
if persistent:
suffix = "_persistent"
else:
suffix = ""
name = name_fn(layouts, alignments, element_output, element_accumulator,
element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass, suffix=suffix)
setattr(cls, name, run)
return run
@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
class GemmBF16Sm90(unittest.TestCase):
"""
Wrapper class to which tests will be added dynamically in __main__
"""
pass
add_test_tensorop = partial(add_test, opclass=cutlass.OpClass.TensorOp)
add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
add_test_tensorop(GemmBF16Sm90, LayoutCombination.NNN, [8, 8, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], 3)
add_test_tensorop(GemmBF16Sm90, LayoutCombination.NNN, [4, 4, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], 5)
add_test_tensorop(GemmBF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [2, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmBF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [2, 1, 1], [128, 128, 32], None, persistent=True)
add_test_simt(GemmBF16Sm90, LayoutCombination.NNN, [1, 1, 1], cutlass.bfloat16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 8], 2)
if __name__ == '__main__':
pycutlass.get_memory_pool(2**30, 2**30)
unittest.main()

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import pycutlass
from pycutlass import *
from pycutlass.test import *
@ -443,5 +475,5 @@ class GemmF16Sm80(unittest.TestCase):
if __name__ == '__main__':
pycutlass.get_memory_pool(2**24, 2**24)
pycutlass.get_memory_pool(2**30, 2**30)
unittest.main()

View File

@ -0,0 +1,182 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
from functools import partial
import pycutlass
from pycutlass import *
from pycutlass import library
from pycutlass.test import *
import unittest
from pycutlass.test.utils import LayoutCombination, get_name
from pycutlass.test.gemm_testbed import test_all_gemm
from pycutlass.utils.device import device_cc
# Partial specialziation for naming tests
name_fn = partial(get_name, element_a=cutlass.float16, element_b=cutlass.float16, arch=90)
def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
cluster_shape, threadblock_shape, stages, opclass, persistent=False):
"""
Create a test-running function with the given specification and set it as a method of `cls`.
:param cls: class to which the generated method will be added
:type cls: type
:param layouts: indexable container of layouts of A, B, and C operands
:param alignments: indexable container of alingments of A, B, and C operands
:param element_output: data type of the output element
:param element_accumulator: data type used in accumulation
:param element_epilogue: data type used in computing the epilogue
:param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
:param threadblock_shape: indexable container of dimensions of threadblock tiles
:param stages: number of pipeline stages to use in the kernel
:type stages: int
:param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
:type opclass: cutlass.OpClass
:param persistent: whether this is a persistent warp-specialized kernel
:type persistent: bool
"""
def run(self):
"""
Dynamically-generated function that constructs a GEMM operation and verifies it against
multiple test cases.
"""
element_A = cutlass.float16
element_B = cutlass.float16
inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
math_inst = MathInstruction(
instruction_shape=inst_shape,
element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
opcode_class=opclass, math_operation=MathOperation.multiply_add
)
tile_description = TileDescription(
threadblock_shape=threadblock_shape,
cluster_shape=cluster_shape,
stages=stages, warp_count=warp_count,
math_instruction=math_inst,
persistent=persistent
)
A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
epilogue_functor = LinearCombination(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
swizzling_functor = cutlass.IdentitySwizzle1
operation = GemmOperationUniversal(
arch=90, tile_description=tile_description, A=A, B=B, C=C,
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
self.assertTrue(test_all_gemm(operation, "universal"))
if persistent:
suffix = "_persistent"
else:
suffix = ""
name = name_fn(layouts, alignments, element_output, element_accumulator,
element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass, suffix=suffix)
setattr(cls, name, run)
return run
@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
class GemmF16Sm90(unittest.TestCase):
"""
Wrapper class to which tests will be added dynamically in __main__
"""
pass
add_test_tensorop = partial(add_test, opclass=cutlass.OpClass.TensorOp)
add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
# Tests with 1x1x1 clusters
add_test_tensorop(GemmF16Sm90, LayoutCombination.NNN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], 3)
add_test_tensorop(GemmF16Sm90, LayoutCombination.NNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.NTN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.NTT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 64, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 64, 64], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [4, 4, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [4, 4, 8], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 32], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 64, 64], 5)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [2, 2, 2], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 32], None)
# Tests with different cluster shapes
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.NTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.NNN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 4, 1], [64, 128, 64], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 4, 1], [64, 128, 64], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 1, 1], [64, 128, 64], None)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 2, 1], [64, 128, 64], None)
# Tests for persistent warp-specialized threadblocks
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 128, 64], None, persistent=True)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 1, 1], [64, 128, 64], None, persistent=True)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 64], None, persistent=True)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 1, 1], [128, 128, 64], None, persistent=True)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 2, 1], [64, 128, 64], None, persistent=True)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None, persistent=True)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 4, 1], [64, 128, 64], None, persistent=True)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 4, 1], [64, 128, 64], None, persistent=True)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 1, 1], [64, 128, 64], None, persistent=True)
add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 4, 1], [64, 128, 64], None, persistent=True)
# Tests using SIMT
add_test_simt(GemmF16Sm90, LayoutCombination.NNN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 8], 2)
add_test_simt(GemmF16Sm90, LayoutCombination.TNN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 128, 8], 2)
add_test_simt(GemmF16Sm90, LayoutCombination.NTN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 64, 8], 2)
add_test_simt(GemmF16Sm90, LayoutCombination.TTN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 64, 8], 2)
add_test_simt(GemmF16Sm90, LayoutCombination.NNT, [1, 1, 1], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 8], 2)
if __name__ == '__main__':
pycutlass.get_memory_pool(2**30, 2**30)
unittest.main()

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import pycutlass
from pycutlass import *
from pycutlass.memory_manager import get_allocated_size

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import pycutlass
from pycutlass import *
from pycutlass.test import *
@ -98,5 +130,5 @@ class GemmF64TensorOpSm80(unittest.TestCase):
self.assertTrue(test_all_gemm(operation, "universal"))
if __name__ == '__main__':
pycutlass.get_memory_pool(2**24, 2**24)
pycutlass.get_memory_pool(2**30, 2**30)
unittest.main()

View File

@ -0,0 +1,124 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
from functools import partial
import pycutlass
from pycutlass import *
from pycutlass import library
from pycutlass.test import *
import unittest
from pycutlass.test.utils import LayoutCombination, get_name
from pycutlass.test.gemm_testbed import test_all_gemm
from pycutlass.utils.device import device_cc
name_fn = partial(get_name, element_a=cutlass.float64, element_b=cutlass.float64, arch=90)
def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
cluster_shape, threadblock_shape, stages, opclass):
"""
Create a test-running function with the given specification and set it as a method of `cls`.
:param cls: class to which the generated method will be added
:type cls: type
:param layouts: indexable container of layouts of A, B, and C operands
:param alignments: indexable container of alingments of A, B, and C operands
:param element_output: data type of the output element
:param element_accumulator: data type used in accumulation
:param element_epilogue: data type used in computing the epilogue
:param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
:param threadblock_shape: indexable container of dimensions of threadblock tiles
:param stages: number of pipeline stages to use in the kernel
:type stages: int
:param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
:type opclass: cutlass.OpClass
"""
def run(self):
"""
Dynamically-generated function that constructs a GEMM operation and verifies it against
multiple test cases.
"""
element_A = cutlass.float64
element_B = cutlass.float64
inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
math_inst = MathInstruction(
instruction_shape=inst_shape,
element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
opcode_class=opclass, math_operation=MathOperation.multiply_add
)
tile_description = TileDescription(
threadblock_shape=threadblock_shape,
cluster_shape=cluster_shape,
stages=stages, warp_count=warp_count,
math_instruction=math_inst
)
A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
epilogue_functor = LinearCombination(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
swizzling_functor = cutlass.IdentitySwizzle1
operation = GemmOperationUniversal(
arch=90, tile_description=tile_description, A=A, B=B, C=C,
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
self.assertTrue(test_all_gemm(operation, "universal"))
name = name_fn(layouts, alignments, element_output, element_accumulator,
element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass)
setattr(cls, name, run)
return run
@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
class GemmF64Sm90(unittest.TestCase):
"""
Wrapper class to which tests will be added dynamically in __main__
"""
pass
add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
add_test_simt(GemmF64Sm90, LayoutCombination.NNN, [1, 1, 1], cutlass.float64, cutlass.float64, cutlass.float64, [1, 1, 1], [64, 64, 32], 2)
if __name__ == '__main__':
pycutlass.get_memory_pool(2**30, 2**30)
unittest.main()

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import pycutlass
from pycutlass import *
from pycutlass.test import *
@ -199,5 +231,5 @@ class GemmGroupedSm80(unittest.TestCase):
if __name__ == '__main__':
pycutlass.get_memory_pool(2**26, 2**26)
pycutlass.get_memory_pool(2**30, 2**30)
unittest.main()

View File

@ -1,3 +1,35 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import pycutlass
from pycutlass import *
from pycutlass.epilogue import LinearCombinationClamp
@ -225,5 +257,5 @@ class GemmS8TensorOpF32Sm80(unittest.TestCase):
if __name__ == '__main__':
pycutlass.get_memory_pool(2**24, 2**24)
pycutlass.get_memory_pool(2**30, 2**30)
unittest.main()

View File

@ -0,0 +1,154 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
from functools import partial
import pycutlass
from pycutlass import *
from pycutlass import library
from pycutlass.test import *
import unittest
from pycutlass.test.utils import LayoutCombination, get_name
from pycutlass.test.gemm_testbed import test_all_gemm
from pycutlass.utils.device import device_cc
name_fn = partial(get_name, element_a=cutlass.float16, element_b=cutlass.float16, arch=90)
def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
cluster_shape, threadblock_shape, stages, opclass, persistent=False):
"""
Create a test-running function with the given specification and set it as a method of `cls`.
:param cls: class to which the generated method will be added
:type cls: type
:param layouts: indexable container of layouts of A, B, and C operands
:param alignments: indexable container of alingments of A, B, and C operands
:param element_output: data type of the output element
:param element_accumulator: data type used in accumulation
:param element_epilogue: data type used in computing the epilogue
:param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
:param threadblock_shape: indexable container of dimensions of threadblock tiles
:param stages: number of pipeline stages to use in the kernel
:type stages: int
:param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
:type opclass: cutlass.OpClass
:param persistent: whether this is a persistent warp-specialized kernel
:type persistent: bool
"""
def run(self):
"""
Dynamically-generated function that constructs a GEMM operation and verifies it against
multiple test cases.
"""
element_A = cutlass.int8
element_B = cutlass.int8
inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
math_inst = MathInstruction(
instruction_shape=inst_shape,
element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
opcode_class=opclass, math_operation=MathOperation.multiply_add
)
tile_description = TileDescription(
threadblock_shape=threadblock_shape,
cluster_shape=cluster_shape,
stages=stages, warp_count=warp_count,
math_instruction=math_inst,
persistent=persistent
)
A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
if opclass == cutlass.OpClass.Simt:
epilogue_functor_cls = LinearCombinationClamp
else:
epilogue_functor_cls = LinearCombination
epilogue_functor = epilogue_functor_cls(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
swizzling_functor = cutlass.IdentitySwizzle1
operation = GemmOperationUniversal(
arch=90, tile_description=tile_description, A=A, B=B, C=C,
epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
self.assertTrue(test_all_gemm(operation, "universal"))
if persistent:
suffix = "_persistent"
else:
suffix = ""
name = name_fn(layouts, alignments, element_output, element_accumulator,
element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass, suffix=suffix)
setattr(cls, name, run)
return run
@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
class GemmS8Sm90(unittest.TestCase):
"""
Wrapper class to which tests will be added dynamically in __main__
"""
pass
add_test_tensorop = partial(add_test, opclass=cutlass.OpClass.TensorOp)
add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
# Tests with 1x1x1 clusters
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNN, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], 3)
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], None)
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 8], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], None)
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [64, 128, 128], None)
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 64, 32], None)
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [4, 4, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], None)
# Tests with different cluster shapes
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [2, 2, 1], [128, 128, 128], None)
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 4, 1], [128, 128, 128], None)
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [4, 4, 1], [128, 128, 128], None)
# Tests with persistent warp-specialized threadblocks
add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [2, 1, 1], [128, 128, 128], None, persistent=True)
# Tests for SIMT
add_test_simt(GemmS8Sm90, LayoutCombination.TNN, [1, 1, 1], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [64, 32, 8], 2)
if __name__ == '__main__':
pycutlass.get_memory_pool(2**30, 2**30)
unittest.main()

View File

@ -1,8 +1,40 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import pycutlass
import unittest
if __name__ == '__main__':
pycutlass.get_memory_pool(2**26, 2**26)
pycutlass.get_memory_pool(2**30, 2**30)
loader = unittest.TestLoader()
tests = loader.discover('./', 'gemm_*.py')
testRunner = unittest.runner.TextTestRunner()