CUTLASS 3.3.0 (#1167)

* Release 3.3.0

Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.

* minor doc update
This commit is contained in:
Pradeep Ramani
2023-11-02 08:09:05 -07:00
committed by GitHub
parent 922fb5108b
commit c008b4aea8
263 changed files with 16214 additions and 5008 deletions

27
python/LICENSE.txt Normal file
View File

@ -0,0 +1,27 @@
Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@ -67,14 +67,13 @@ The CUTLASS Python interface currently supports the following operations:
* Grouped GEMM (for pre-SM90 kernels)
### Getting started
We recommend using the CUTLASS Python interface via one of the Docker images located in the [docker](/python/docker) directory.
We recommend using the CUTLASS Python interface via an [NGC PyTorch Docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch):
```bash
docker build -t cutlass-cuda12.1:latest -f docker/Dockerfile-cuda12.1-pytorch .
docker run --gpus all -it --rm cutlass-cuda12.1:latest
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.08-py3
```
The CUTLASS Python interface has been tested with CUDA 11.8, 12.0, and 12.1 on Python 3.8.10 and 3.9.7.
The CUTLASS Python interface has been tested with CUDA 11.8, 12.0, and 12.1 on Python 3.8 and 3.9.
#### Optional environment variables
Prior to installing the CUTLASS Python interface, one may optionally set the following environment variables:
@ -82,19 +81,21 @@ Prior to installing the CUTLASS Python interface, one may optionally set the fol
* `CUDA_INSTALL_PATH`: the path to the installation of CUDA
If these environment variables are not set, the installation process will infer them to be the following:
* `CUTLASS_PATH`: one directory level above the current directory (i.e., `$(pwd)/..`)
* `CUTLASS_PATH`: either one directory level above the current directory (i.e., `$(pwd)/..`) if installed locally or in the `source` directory of the location in which `cutlass_library` was installed
* `CUDA_INSTALL_PATH`: the directory holding `/bin/nvcc` for the first version of `nvcc` on `$PATH` (i.e., `which nvcc | awk -F'/bin/nvcc' '{print $1}'`)
**NOTE:** The version of `cuda-python` installed must match the CUDA version in `CUDA_INSTALL_PATH`.
#### Installation
The CUTLASS Python interface can currently be installed via:
The CUTLASS Python interface can currently be installed by navigating to the root of the CUTLASS directory and performing
```bash
python setup.py develop --user
pip install .
```
This will allow changes to the Python interface source to be reflected when using the Python interface.
We plan to add support for installing via `python setup.py install` in a future release.
If you would like to be able to make changes to CULASS Python interface and have them reflected when using the interface, perform:
```bash
pip install -e .
```
### Examples
Jupyter notebook examples of using the CUTLASS Python interface are located in [examples/python](/examples/python).
@ -135,10 +136,7 @@ python setup_library.py develop --user
Alternatively, `cutlass_library` will automatically be installed if you install the CUTLASS Python interface package.
You can also use the [generator.py](/python/cutlass_library/generator.py) script directly without installing the module via:
```bash
python -m cutlass_library.generator
```
You can also use the [generator.py](/python/cutlass_library/generator.py) script directly without installing the module.
# Copyright

View File

@ -37,14 +37,6 @@ import sys
import cutlass_library
def _cutlass_path_from_dir() -> str:
cutlass_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../../')
if not os.path.isdir(cutlass_path):
raise Exception(f'Environment variable "CUTLASS_PATH" is not defined, '
f'and default path of {cutlass_path} does not exist.')
return cutlass_path
def _cuda_install_path_from_nvcc() -> str:
import subprocess
# Attempt to detect CUDA_INSTALL_PATH based on location of NVCC
@ -60,66 +52,41 @@ def _cuda_install_path_from_nvcc() -> str:
return cuda_install_path
CUTLASS_PATH = os.getenv("CUTLASS_PATH", _cutlass_path_from_dir())
CUDA_INSTALL_PATH = os.getenv("CUDA_INSTALL_PATH", _cuda_install_path_from_nvcc())
CUTLASS_PATH = os.getenv("CUTLASS_PATH", cutlass_library.source_path)
# Alias CUTLASS_PATH as source_path
source_path = CUTLASS_PATH
_CUDA_INSTALL_PATH = None
def cuda_install_path():
"""
Helper method for on-demand fetching of the CUDA installation path. This allows
the import of CUTLASS to proceed even if NVCC is not available, preferring to
raise this error only when an operation that needs NVCC is being performed.
"""
global _CUDA_INSTALL_PATH
if _CUDA_INSTALL_PATH is None:
_CUDA_INSTALL_PATH = os.getenv("CUDA_INSTALL_PATH", _cuda_install_path_from_nvcc())
return _CUDA_INSTALL_PATH
CACHE_FILE = "compiled_cache.db"
# Import types/methods from the CUTLASS utility libraries for profiler generation/emission under
from cutlass_library.library import (
ArchitectureNames,
ComplexTransform,
ComplexTransformTag,
ConvKind,
ConvKindNames,
ConvKindTag,
ConvMode,
from cutlass_library import (
DataType,
DataTypeNames,
DataTypeSize,
DataTypeTag,
EpilogueFunctor,
EpilogueScheduleSuffixes,
EpilogueScheduleTag,
EpilogueScheduleType,
GemmKind,
GemmKindNames,
GemmUniversalMode,
IteratorAlgorithm,
IteratorAlgorithmNames,
IteratorAlgorithmTag,
LayoutTag,
LayoutType,
KernelScheduleSuffixes,
KernelScheduleTag,
KernelScheduleType,
MathInstruction,
MathOperation,
MathOperationTag,
LayoutType,
OpcodeClass,
OpcodeClassNames,
OpcodeClassTag,
OperationKind,
SharedMemPerCC,
ShortComplexLayoutNames,
ShortDataTypeNames,
ShortLayoutTypeNames,
SplitKMode,
StrideSupport,
StrideSupportNames,
StrideSupportTag,
SwizzlingFunctor,
SwizzlingFunctorTag,
TensorDescription,
TileDescription,
TileSchedulerSuffixes,
TileSchedulerTag,
TileSchedulerType,
get_complex_from_real,
)
this = sys.modules[__name__]
this.logger = logging.getLogger(__name__)
# RMM is only supported for Python 3.9+
this.use_rmm = (sys.version_info.major == 3 and sys.version_info.major > 8) or sys.version_info.major > 3
def set_log_level(level: int):
"""
Sets the log level
@ -134,11 +101,20 @@ set_log_level(logging.ERROR)
from cutlass.library_defaults import OptionRegistry
from cutlass.backend.utils.device import device_cc
this.option_registry = OptionRegistry(device_cc())
this._option_registry = None
def get_option_registry():
"""
Helper method for on-demand initialization of the options registry. This avoids building
the registry when CUTLASS is imported.
"""
if this._option_registry is None:
this.logger.info("Initializing option registry")
this._option_registry = OptionRegistry(device_cc())
return this._option_registry
this.__version__ = '3.2.1'
this.__version__ = '3.3.0'
from cutlass.backend import get_memory_pool
from cutlass.backend import create_memory_pool
from cutlass.emit.pytorch import pytorch
from cutlass.op.gemm import Gemm
from cutlass.op.conv import Conv2d, Conv2dFprop, Conv2dDgrad, Conv2dWgrad
@ -146,4 +122,58 @@ from cutlass.op.gemm_grouped import GroupedGemm
from cutlass.op.op import OperationBase
from cutlass.backend.evt.ir.tensor import Tensor
get_memory_pool(init_pool_size=2 ** 30, max_pool_size=2 ** 32)
this.memory_pool = None
def get_memory_pool():
""""
Helper method for on-demand memory pool. This avoids allocating the memory pool unnecessarily
whe CUTLASS is imported.
"""
if this.use_rmm and this.memory_pool is None:
this.memory_pool = create_memory_pool(init_pool_size=2 ** 30, max_pool_size=2 ** 32)
return this.memory_pool
from cuda import cuda
this._context = None
this._device_id = None
def initialize_cuda_context():
if this._device_id is not None:
return
if this.use_rmm:
# This also covers initializing the CUDA context
get_memory_pool()
device_id = os.getenv("CUTLASS_CUDA_DEVICE_ID")
if device_id is None:
if not this.use_rmm:
# We must manually call cuInit in the absence of RMM
err, = cuda.cuInit(0)
if err != cuda.CUresult.CUDA_SUCCESS:
raise Exception(f"cuInit failed with error {err}")
err, device_count = cuda.cuDeviceGetCount()
if err != cuda.CUresult.CUDA_SUCCESS:
raise Exception(f"cuDeviceGetCount failed with error {err}")
if device_count <= 0:
raise Exception("No CUDA devices found")
device_id = 0
this._device_id = device_id
if not this.use_rmm and this._context is None:
# We must manually initialize the context in the absence of RMM
err, device = cuda.cuDeviceGet(this._device_id)
if err != cuda.CUresult.CUDA_SUCCESS:
raise Exception(f"cuDeviceGet failed with error {err}")
err, this._context = cuda.cuCtxCreate(0, device)
if err != cuda.CUresult.CUDA_SUCCESS:
raise Exception(f"cuCtxCreate failed with error {err}")
def device_id() -> int:
initialize_cuda_context()
return this._device_id

View File

@ -6,17 +6,11 @@ from cutlass.backend.epilogue import *
from cutlass.backend.frontend import *
from cutlass.backend.gemm_operation import *
from cutlass.backend.library import *
from cutlass.backend.memory_manager import PoolMemoryManager
from cutlass.backend.memory_manager import PoolMemoryManager, create_memory_pool
from cutlass.backend.operation import *
from cutlass.backend.reduction_operation import *
from cutlass.backend.type_hint import *
from cutlass.backend.utils import *
from cutlass.backend.utils.device import device_cc
from cutlass.backend.utils.software import (
CheckPackages,
SubstituteTemplate,
device_sm_count,
get_memory_pool,
)
compiler = ArtifactManager()

View File

@ -36,16 +36,10 @@ from typing import Union
from cuda import cuda, cudart
import numpy as np
import cutlass
from cutlass.backend.frontend import CupyFrontend, NumpyFrontend, TorchFrontend
from cutlass.backend.utils.software import CheckPackages
torch_available = CheckPackages().check_torch()
if torch_available:
import torch
cupy_available = CheckPackages().check_cupy()
if cupy_available:
import cupy as cp
from cutlass.backend.memory_manager import DevicePtrWrapper
from cutlass.utils.datatypes import is_cupy_tensor, is_numpy_tensor, is_torch_tensor
class ArgumentBase:
@ -76,7 +70,7 @@ class ArgumentBase:
self.ptr_A = self.tensor_to_ptr(A, "A")
self.ptr_B = self.tensor_to_ptr(B, "B")
self.ptr_C = self.tensor_to_ptr(C, "C")
self.ptr_D = self.tensor_to_ptr(D, "D", True)
self.ptr_D = self.tensor_to_ptr(D, "D", is_output=True)
if C is not None:
if not isinstance(C, cuda.CUdeviceptr):
self.tensor_c_numel = prod(C.shape)
@ -88,18 +82,18 @@ class ArgumentBase:
"""
if tensor is None:
return cuda.CUdeviceptr(0)
if isinstance(tensor, np.ndarray):
if is_numpy_tensor(tensor):
if is_output:
assert name
self.buffers[name] = NumpyFrontend.argument(tensor, is_output)
if is_output:
self.host_tensors[name] = tensor
return self.buffers[name].ptr
elif torch_available and isinstance(tensor, torch.Tensor):
elif is_torch_tensor(tensor):
return TorchFrontend.argument(tensor)
elif isinstance(tensor, cuda.CUdeviceptr):
return tensor
elif cupy_available and isinstance(tensor, cp.ndarray):
elif is_cupy_tensor(tensor):
return CupyFrontend.argument(tensor)
else:
raise TypeError("Unsupported Frontend. Only support numpy and torch")
@ -119,3 +113,23 @@ class ArgumentBase:
)
if err != cuda.CUresult.CUDA_SUCCESS:
raise RuntimeError("CUDA Error %s" % str(err))
self.free()
def free(self):
"""
Frees allocated device-side memory
"""
# Free any device memory allocated manually
if not cutlass.use_rmm:
for name, buf in self.buffers.items():
if isinstance(buf, DevicePtrWrapper):
err, = cudart.cudaFree(buf.ptr)
if err != cudart.cudaError_t.cudaSuccess:
raise RuntimeError(f"cudaFree failed with error {err}")
if hasattr(self, "workspace_buffer") and isinstance(self.workspace_buffer, DevicePtrWrapper):
err, = cudart.cudaFree(self.workspace_buffer.ptr)
if err != cudart.cudaError_t.cudaSuccess:
raise RuntimeError(f"cudaFree failed with error {err}")
del self.workspace_buffer

View File

@ -32,7 +32,7 @@
import ctypes
from cutlass import (
from cutlass_library import (
DataType,
KernelScheduleType
)
@ -125,7 +125,7 @@ def get_mainloop_arguments_3x(
Returns the ctypes structure to be used for the 3.x kernel's mainloop parameters.
:param kernel_schedule: type of kernel schedule to be used in the mainloop
:type kerel_schedule: cutlass.KernelScheduleType
:type kernel_schedule: cutlass_library.KernelScheduleType
:param element_A: data type of operand A
:param element_B: data type of operand B
:param alignment_A: alignment of operand A
@ -166,25 +166,10 @@ def get_mainloop_arguments_3x(
args.ptr_A, args.stride_A, args.ptr_B, args.stride_B,
)
tma_alignment_bytes = 16
is_tma_aligned_A = ((DataTypeSizeBytes[element_A] * alignment_A) % tma_alignment_bytes) == 0
is_tma_aligned_B = ((DataTypeSizeBytes[element_B] * alignment_B) % tma_alignment_bytes) == 0
is_tma_aligned = is_tma_aligned_A and is_tma_aligned_B
if kernel_schedule == KernelScheduleType.Multistage:
return _MainloopArgumentsMultistage
elif kernel_schedule == KernelScheduleType.ScheduleAuto:
if is_tma_aligned:
return _MainloopArgumentsTma
else:
return _MainloopArgumentsMultistage
else:
if is_tma_aligned:
return _MainloopArgumentsTma
else:
raise Exception(f"Specified a kernel schedule using TMA ({kernel_schedule}), but "
"the provided data types and alignments are not properly aligned for "
"using TMA.")
# Currently all 3.x kernels (CpAsync and Tma) have the same argument structure.
# Should that become not the case, this is the place to return custom ctypes
# structures based on selected kernel schedule.
return _MainloopArgumentsTma
def get_gemm_arguments_3x(mainloop_arguments, epilogue_functor):

View File

@ -38,12 +38,13 @@ import subprocess
import tempfile
from cuda import cuda, nvrtc
from cutlass_library import SubstituteTemplate
from cutlass import CACHE_FILE, CUDA_INSTALL_PATH, CUTLASS_PATH, logger
import cutlass
from cutlass import CACHE_FILE, CUTLASS_PATH, cuda_install_path, logger
from cutlass.backend.gemm_operation import GemmOperationUniversal
from cutlass.backend.library import ApiVersion
from cutlass.backend.utils.device import device_cc
from cutlass.backend.utils.software import SubstituteTemplate
IncludeTemplate = r"""#include "${include}"
"""
@ -316,7 +317,7 @@ class ArtifactManager:
# compile with nvcc
cmd_template = "${cuda_install_path}/bin/nvcc ${options} -cubin ${srcfile} -o ${tarfile}"
values = {
"cuda_install_path": CUDA_INSTALL_PATH,
"cuda_install_path": cuda_install_path(),
"options": compilation_options.get_str(),
"srcfile": temp_cu.name,
"tarfile": temp_cubin.name,
@ -336,7 +337,7 @@ class ArtifactManager:
cmd = SubstituteTemplate(
cmd_template,
{
"cuda_install_path": CUDA_INSTALL_PATH,
"cuda_install_path": cuda_install_path(),
"options": host_compilation_options.get_str(),
},
)
@ -356,18 +357,15 @@ class ArtifactManager:
Insert a new compiled device module
"""
include_paths = [
CUDA_INSTALL_PATH + "/include",
cuda_install_path() + "/include",
CUTLASS_PATH + "/include",
CUTLASS_PATH + "/tools/util/include",
CUTLASS_PATH + "/python/cutlass/cpp/include",
]
if device_cc() is not None:
arch = device_cc()
else:
# Find the maximum arch tag among the provided operations and compile for that target.
# Since we are compiling to .cubin files, only one architecture may be specified.
arch = max([op.arch for op in operations])
cutlass.initialize_cuda_context()
arch = device_cc()
host_compile_options = CompilationOptions(
self._nvcc_compile_options, arch, include_paths)
if compile_options is None:

View File

@ -34,9 +34,10 @@ import ctypes
from typing import Union
from cuda import cuda
from cutlass_library import SubstituteTemplate
import numpy as np
from cutlass import (
from cutlass_library import (
ConvKindNames,
ConvKindTag,
DataTypeNames,
@ -71,13 +72,9 @@ from cutlass.backend.library import (
)
from cutlass.backend.memory_manager import device_mem_alloc
from cutlass.backend.operation import ExecutableOperation, LaunchConfiguration
from cutlass.backend.utils.datatypes import to_device_ptr
from cutlass.backend.utils.software import CheckPackages, SubstituteTemplate
from cutlass.backend.utils.device import to_device_ptr
from cutlass.shape import GemmCoord
if CheckPackages().check_torch():
import torch
class Conv2dArguments(ArgumentBase):
"""

View File

@ -32,14 +32,15 @@
import ctypes
from cutlass_library import SubstituteTemplate
import numpy as np
from scipy.special import erf
from cutlass import DataType, DataTypeTag
from cutlass_library import DataType, DataTypeTag
from cutlass.backend.c_types import MatrixCoord_
from cutlass.backend.frontend import NumpyFrontend
from cutlass.backend.library import ActivationOp, ActivationOpTag
from cutlass.backend.utils.software import CheckPackages, SubstituteTemplate
from cutlass.utils.datatypes import is_numpy_tensor, is_torch_available, is_torch_tensor
dtype2ctype = {
DataType.f16: ctypes.c_uint16,
@ -49,8 +50,7 @@ dtype2ctype = {
DataType.s32: ctypes.c_int32
}
torch_available = CheckPackages().check_torch()
if torch_available:
if is_torch_available():
import torch
import torch.nn.functional as F
@ -59,11 +59,11 @@ def get_scalar(value):
"""
Returns a scalar value from a container (e.g., np.ndarray)
"""
if isinstance(value, np.ndarray):
if is_numpy_tensor(value):
if value.size != 1:
raise Exception("Scalars used in epilogue must be of size 1")
return value.reshape(-1)[0]
elif CheckPackages().check_torch() and isinstance(value, torch.Tensor):
elif is_torch_tensor(value):
if value.size != 1:
raise Exception("Scalars used in epilogue must be of size 1")
return value.reshape(-1)[0]
@ -353,9 +353,9 @@ class ActivationFunctor:
class ActivationMeta(type):
@classmethod
def __call__(cls, x, *args):
if isinstance(x, np.ndarray):
if is_numpy_tensor(x):
return cls.numpy(x, *args)
elif torch_available and isinstance(x, torch.Tensor):
elif is_torch_tensor(x):
return cls.torch(x, *args)
else:
raise NotImplementedError("Unsupported tensor type")

View File

@ -34,7 +34,7 @@
Base class for Epilogue Visitor Emitter
"""
from cutlass import DataTypeTag
from cutlass_library import DataTypeTag
from cutlass.backend.evt.ir import TopoVisitorNode, DAGIR

View File

@ -30,7 +30,7 @@
#
#################################################################################################
from cutlass import DataTypeTag
from cutlass_library import DataTypeSize, DataTypeTag
from cutlass.backend.evt.ir import (
# Load Node

View File

@ -34,7 +34,7 @@
Emitter for Sm90 Epilogue Visitor
"""
from cutlass import DataTypeTag, EpilogueScheduleTag
from cutlass_library import DataTypeTag, EpilogueScheduleTag
from cutlass.backend import GemmOperationUniversal
from cutlass.backend.evt.backend.emitter_base import FusionCallbacks

View File

@ -32,7 +32,7 @@
from pycute import product
from cutlass import DataTypeSize, DataTypeTag
from cutlass_library import DataTypeSize, DataTypeTag
from cutlass.backend.evt.ir import (
# Load Node
AccumulatorImpl,

View File

@ -37,12 +37,13 @@ Epilogue Visitor interface for compiling, and running visitor-based epilogue.
import ctypes
from cuda import cuda
from cutlass_library import DataType
import numpy as np
from cutlass import DataType
from cutlass.backend.epilogue import EpilogueFunctorBase
import cutlass.backend.evt.backend
from cutlass.backend.frontend import TensorFrontend
from cutlass.utils.datatypes import is_numpy_tensor
class EpilogueFunctorVisitor(EpilogueFunctorBase):
@ -125,7 +126,7 @@ class EpilogueFunctorVisitor(EpilogueFunctorBase):
# The tensor frontend returns a device buffer for np.ndarray
# and device ptr for other frontends
buffer_or_ptr = TensorFrontend.argument(tensor, is_output)
if isinstance(tensor, np.ndarray):
if is_numpy_tensor(tensor):
# Remember the host tensor for later synchronization
setattr(self, f"{tensor_name}_buffer", buffer_or_ptr)
setattr(self, f"{tensor_name}_host", tensor)

View File

@ -36,7 +36,7 @@ Base class for Python EVT Frontend
from typing import Union
from cutlass import DataType
from cutlass_library import DataType
from cutlass.backend.evt.ir import (
ComputeNode,
DAGIR,

View File

@ -38,8 +38,9 @@ import ast
import inspect
import textwrap
from cutlass_library import DataType
import cutlass
from cutlass import DataType
from cutlass.backend.evt.frontend.frontend_base import EVTFrontendBase
from cutlass.backend.epilogue import relu
from cutlass.backend.library import FunctionalOp

View File

@ -36,7 +36,8 @@ DAG IR used by Python EVT
import networkx as nx
from cutlass import DataType
from cutlass_library import DataType
from cutlass.backend.evt.ir.node import NodeBase
from cutlass.backend.utils import device_cc

View File

@ -38,10 +38,10 @@ The layout Nodes change the layout of intermediate nodes in epilogue visitor gra
from copy import deepcopy
from cutlass_library import LayoutType
from pycute import product, flatten
import cutlass
from cutlass import LayoutType
from cutlass.backend.evt.ir.layout_algorithm import _list_to_tuple, _tuple_to_list
from cutlass.backend.evt.ir.node import NodeBase
from cutlass.backend.evt.ir.tensor import Tensor

View File

@ -37,7 +37,8 @@ Base & visitor classes of DAGIR Nodes
import ctypes
from re import sub
from cutlass import LayoutType
from cutlass_library import LayoutType
from cutlass.backend.evt.ir.layout_algorithm import _list_to_tuple, _reverse_tuple
from cutlass.backend.evt.ir.tensor import Tensor

View File

@ -36,7 +36,8 @@ Store node and implementations
import ctypes
from cutlass import DataType
from cutlass_library import DataType
from cutlass.backend.c_types import tuple_factory
from cutlass.backend.epilogue import dtype2ctype, to_ctype_value
from cutlass.backend.evt.ir.node import NodeBase, ImplBase, NoOpImpl

View File

@ -34,7 +34,7 @@
High-level class for tensor
"""
from cutlass import LayoutType
from cutlass_library import LayoutType
from cutlass.backend.evt.ir.layout_algorithm import (
Layout,

View File

@ -32,9 +32,9 @@
import subprocess
from cutlass_library import DataTypeTag
import pydot
from cutlass import DataTypeTag
from cutlass.backend.evt.ir.dag_ir import DAGIR

View File

@ -42,7 +42,6 @@ from cutlass.backend.evt.ir import ComputeNode, StoreNode
from cutlass.backend.evt.passes.pass_manager import EVTPassBase
class PassPreprocessRed(EVTPassBase):
"""
Preprocess red nodes

View File

@ -34,6 +34,7 @@
Compute the shared memory size in bytes
"""
import cutlass_library
from pycute import shape_div, product
import cutlass
@ -56,10 +57,13 @@ class GetSmemSize:
def sm90_epilogue_tile(self, tile_description):
# Get the epilogue tile size
schedule = tile_description.epilogue_schedule
if schedule == cutlass.EpilogueScheduleType.TmaWarpSpecialized:
if schedule == cutlass_library.EpilogueScheduleType.TmaWarpSpecialized:
epilogue_tile_mn = (64, 32)
elif schedule == cutlass.EpilogueScheduleType.TmaWarpSpecializedCooperative:
epilogue_tile_mn = (128, 32)
elif schedule == cutlass_library.EpilogueScheduleType.TmaWarpSpecializedCooperative:
if tile_description.threadblock_shape[0] >= 128:
epilogue_tile_mn = (128, 32)
else:
epilogue_tile_mn = (64, 32)
else:
raise NotImplementedError(f"Unsupported schedule: {schedule}")

View File

@ -34,15 +34,7 @@ from cuda import cuda
import numpy as np
from cutlass.backend.memory_manager import device_mem_alloc, todevice
from cutlass.backend.utils.software import CheckPackages
torch_available = CheckPackages().check_torch()
if torch_available:
import torch
cupy_available = CheckPackages().check_cupy()
if cupy_available:
import cupy as cp
from cutlass.utils.datatypes import is_cupy_tensor, is_numpy_tensor, is_torch_tensor
class NumpyFrontend:
@ -97,6 +89,7 @@ class CupyFrontend:
def argument(cupy_ndarray: "cp.ndarray"):
return cuda.CUdeviceptr(int(cupy_ndarray.data.ptr))
class TensorFrontend:
"""
Universal Frontend for client-provide tensors
@ -104,11 +97,11 @@ class TensorFrontend:
@staticmethod
def argument(tensor, is_output=False):
if isinstance(tensor, np.ndarray):
if is_numpy_tensor(tensor):
return NumpyFrontend.argument(tensor, is_output)
elif torch_available and isinstance(tensor, torch.Tensor):
elif is_torch_tensor(tensor):
return TorchFrontend.argument(tensor)
elif cupy_available and isinstance(tensor, cp.ndarray):
elif is_cupy_tensor(tensor):
return CupyFrontend.argument(tensor)
else:
raise NotImplementedError("Unknown Tensor Type")

View File

@ -35,10 +35,10 @@ import ctypes
import enum
from cuda import cuda, cudart
from cutlass_library import SubstituteTemplate
import numpy as np
import rmm
from cutlass import (
from cutlass_library import (
ComplexTransformTag,
DataType,
DataTypeNames,
@ -96,11 +96,7 @@ from cutlass.backend.library import (
from cutlass.backend.memory_manager import device_mem_alloc, todevice
from cutlass.backend.operation import ExecutableOperation, LaunchConfiguration
from cutlass.backend.type_hint import GemmOperation, Tensor
from cutlass.backend.utils.software import (
CheckPackages,
SubstituteTemplate,
device_sm_count,
)
from cutlass.backend.utils.device import device_sm_count
from cutlass.shape import GemmCoord, MatrixCoord
@ -163,7 +159,7 @@ class GemmArguments2x(ArgumentBase):
:type D: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param gemm_mode: GEMM mode
:type gemm_mode: :class:`cutlass.GemmUniversalMode`
:type gemm_mode: :class:`cutlass_library.GemmUniversalMode`
:param output_op: output operator, optional
:type output_op: :class:`cutlass.backend.LinearCombinationFunctorArguments`
@ -387,7 +383,7 @@ class GemmArguments2xStreamK(GemmArguments2x):
:type D: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param gemm_mode: GEMM mode
:type gemm_mode: :class:`cutlass.GemmUniversalMode`
:type gemm_mode: :class:`cutlass_library.GemmUniversalMode`
:param output_op: output operator, optional
:type output_op: :class:`cutlass.backend.LinearCombinationFunctorArguments`
@ -426,9 +422,12 @@ class GemmArguments2xStreamK(GemmArguments2x):
def initialize(self):
# Get the host and device workspace
device_workspace_size = self.operation.rt_module.get_device_workspace_size(self)
device_workspace_size = self.operation.rt_module.get_device_workspace_size(
self,
device_sm_count(),
self.operation.rt_module.occupancy
)
device_workspace_size = 10 << 20
if device_workspace_size > 0:
self.workspace_buffer = device_mem_alloc(device_workspace_size)
workspace_ptr = self.workspace_buffer.ptr
@ -626,7 +625,7 @@ def GemmArguments(operation, problem_size, A, B, C, D, gemm_mode=GemmUniversalMo
:type D: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
:param gemm_mode: GEMM mode
:type gemm_mode: :class:`cutlass.GemmUniversalMode`
:type gemm_mode: :class:`cutlass_library.GemmUniversalMode`
:param output_op: output operator, optional
:type output_op: :class:`cutlass.backend.LinearCombinationFunctorArguments`
@ -1038,6 +1037,11 @@ extern "C" {
typename GemmType::Params params(*args, device_sms, sm_occupancy);
return params.get_grid_dims();
}
uint64_t ${operation_name}_get_kernel_workspace_size(GemmType::Arguments* args, int device_sms, int sm_occupancy) {
typename GemmType::Params params(*args, device_sms, sm_occupancy);
return params.get_workspace_size();
}
}
"""
@ -1045,6 +1049,7 @@ extern "C" {
super(GemmRTUniversalStreamK, self).__init__(operation)
self.extra_funcs = {
"get_grid_shape": GemmCoord_,
"get_kernel_workspace_size": ctypes.c_uint64,
}
self._occupancy = None
self.argument_type, self.epilogue_type = get_gemm_arguments_streamk(operation.epilogue_functor)
@ -1062,6 +1067,9 @@ extern "C" {
f"{cuda.cuGetErrorString(err)[1]}")
return self._occupancy
def get_device_workspace_size(self, arguments: GemmArguments2xStreamK, device_sms: int, sm_occupancy: int):
return self.get_kernel_workspace_size(ctypes.byref(arguments.get_arguments()), device_sms, sm_occupancy)
################################################################################
# Runtime module for GEMM Universal within CUTLASS 3
@ -1431,7 +1439,7 @@ ${operation_name}(${operation_name}${operation_suffix}::Params params) {
problem_info_array = bytearray(problem_info.contents)
# copy to device memory
return rmm.DeviceBuffer.to_device(problem_info_array).ptr
return todevice(problem_info_array).ptr
def plan(self, arguments):
return LaunchConfiguration(
@ -1537,10 +1545,6 @@ class GemmOperationBase:
return err
def free(self):
if hasattr(self, "workspace_buffer"):
del self.workspace_buffer
def is_complex(self):
complex_operators = [
MathOperation.multiply_add_complex,
@ -1627,7 +1631,7 @@ class GemmOperationBase:
element_b=DataTypeNames[self.B.element],
element_acc=DataTypeNames[self.tile_description.math_instruction.element_accumulator],
element_c=DataTypeNames[self.C.element],
element_d=DataTypeNames[self.C.element],
element_d=DataTypeNames[self.epilogue_functor.element_output],
core_name=self.core_name())
return extended_name

View File

@ -36,7 +36,7 @@ Common data types and string names/tags for them
import enum
from cutlass import (
from cutlass_library import (
ComplexTransform,
DataType,
DataTypeSize,
@ -94,18 +94,6 @@ class DataTypeSizeBytes:
return bits // 8
SharedMemPerCC = {
70: 96 << 10, # 96KB of SMEM
72: 96 << 10, # 96KB of SMEM
75: 64 << 10, # 64KB of SMEM
80: 160 << 10, # 164KB of SMEM - 4KB reserved for the driver
86: 100 << 10, # 100KB of SMEM
87: 160 << 10, # 164KB of SMEM - 4KB reserved for the driver
89: 100 << 10, # 100KB of SMEM
90: 227 << 10, # 228KB of SMEM - 1KB reserved for the driver
}
class SchedulerMode(enum.Enum):
Device = enum_auto()
Host = enum_auto()
@ -277,11 +265,11 @@ class TileDescription:
:type math_instruction: MathInstruction
:param cluster_shape: number of threadblocks in the [X, Y, Z] dimensions of a threadblock cluster
:param kernel_schedule: type of kernel schedule to use (only available for SM90+)
:type kernel_schedule: cutlass.KernelScheduleType
:type kernel_schedule: cutlass_library.KernelScheduleType
:param epilogue_schedule: type of epilogue schedule to use (only available for SM90+)
:type epilogue_schedule: cutlass.EpilogueScheduleType
:type epilogue_schedule: cutlass_library.EpilogueScheduleType
:param tile_scheduler: type of tile scheduler to use (only available for SM90+)
:type tile_scheduler: cutlass.TileSchedulerType
:type tile_scheduler: cutlass_library.TileSchedulerType
"""
if ((kernel_schedule is None and epilogue_schedule is not None) or
(kernel_schedule is not None and epilogue_schedule is None)):
@ -413,7 +401,10 @@ class TensorDescription:
def __init__(self, element, layout, alignment=1, complex_transform=ComplexTransform.none):
self.element = element
self.layout = layout
self.alignment = min(128 // DataTypeSize[self.element], alignment)
if element != DataType.void:
self.alignment = min(128 // DataTypeSize[self.element], alignment)
else:
self.alignment = alignment
self.complex_transform = complex_transform
@ -473,9 +464,9 @@ def api_version(arch, opclass, dtype):
:param arch: compute capability of device on which to run
:type arch: int
:param opclass: class of the operation being performed
:type opclass: cutlass.OpcodeClass
:type opclass: cutlass_library.OpcodeClass
:param dtype: data type to be used in operation (assumes that ElementA and ElementB are the same)
:type dtype: cutlass.DataType
:type dtype: cutlass_library.DataType
:return: API version to be used in code emission
:rtype: ApiVersion

View File

@ -31,7 +31,14 @@
#################################################################################################
import numpy as np
import rmm
import cutlass
from cutlass.utils.datatypes import is_numpy_tensor
if cutlass.use_rmm:
import rmm
else:
from cuda import cudart
class PoolMemoryManager:
@ -44,31 +51,70 @@ class PoolMemoryManager:
self.mr = rmm.mr.TrackingResourceAdaptor(self.pool)
rmm.mr.set_current_device_resource(self.mr)
def get_allocated_size(self):
return self.mr.get_allocated_bytes()
def pool_size(self):
return self.pool.pool_size()
class DevicePtrWrapper:
"""
Wrapper around a pointer to device memory to provide a uniform interface with the RMM DeviceBuffer
(at least in terms of the interface used by the CUTLASS Python interface)
"""
def __init__(self, dev_ptr):
self.dev_ptr = dev_ptr
@property
def ptr(self):
return self.dev_ptr
def _todevice(host_data):
"""
Helper for transferring host data to device memory
"""
if cutlass.use_rmm:
return rmm.DeviceBuffer.to_device(host_data.tobytes())
else:
nbytes = len(host_data.tobytes())
dev_ptr_wrapper = device_mem_alloc(nbytes)
err, = cudart.cudaMemcpy(
dev_ptr_wrapper.ptr,
host_data.__array_interface__['data'][0],
nbytes,
cudart.cudaMemcpyKind.cudaMemcpyHostToDevice
)
if err != cudart.cudaError_t.cudaSuccess:
raise Exception(f"cudaMemcpy failed with error {err}")
return dev_ptr_wrapper
def todevice(host_data, dtype=np.float32):
"""
Pass the host_data to device memory
"""
if isinstance(host_data, list):
return rmm.DeviceBuffer.to_device(np.array(host_data, dtype=dtype).tobytes())
elif isinstance(host_data, np.ndarray):
return rmm.DeviceBuffer.to_device(host_data.tobytes())
return _todevice(np.array(host_data, dtype=dtype))
elif is_numpy_tensor(host_data):
return _todevice(host_data)
def device_mem_alloc(size):
return rmm.DeviceBuffer(size=size)
if cutlass.use_rmm:
return rmm.DeviceBuffer(size=size)
else:
err, ptr = cudart.cudaMalloc(size)
if err != cudart.cudaError_t.cudaSuccess:
raise Exception(f"cudaMalloc failed with error {err}")
return DevicePtrWrapper(ptr)
def align_size(size, alignment=256):
return ((size + alignment - 1) // alignment) * alignment
def get_allocated_size():
device_resource = rmm.mr.get_current_device_resource()
return device_resource.get_allocated_bytes()
def create_memory_pool(init_pool_size=0, max_pool_size=2 ** 34):
if cutlass.use_rmm:
memory_pool = PoolMemoryManager(init_pool_size=init_pool_size, max_pool_size=max_pool_size)
return memory_pool
else:
return None

View File

@ -37,9 +37,15 @@ from cuda import __version__, cuda
from cutlass.backend.utils.device import device_cc
_version_splits = [int(x) for x in __version__.split("rc")[0].split(".")]
supports_cluster_launch = device_cc() >= 90 and (
_version_splits[0] > 11 or (_version_splits[0] == 11 and _version_splits[1] >= 8)
)
_supports_cluster_launch = None
def supports_cluster_launch():
global _supports_cluster_launch
if _supports_cluster_launch is None:
major, minor = _version_splits[0], _version_splits[1]
_supports_cluster_launch = device_cc() >= 90 and (major > 11 or (major == 11 and minor >= 8))
return _supports_cluster_launch
class LaunchConfiguration:
@ -121,7 +127,7 @@ class ExecutableOperation:
packed = (ctypes.c_void_p * 1)()
packed[0] = ctypes.addressof(cArg)
if supports_cluster_launch:
if supports_cluster_launch():
return self.run_with_clusters(launch_config, packed, stream)
else:
return self.run_without_clusters(launch_config, packed, stream)

View File

@ -36,21 +36,22 @@ from typing import Union
from cuda import cuda, cudart
import numpy as np
from cutlass import (
from cutlass_library import (
DataTypeNames,
DataTypeSize,
DataTypeTag,
LayoutType
LayoutType,
SubstituteTemplate
)
import cutlass
from cutlass.backend.c_types import MatrixCoord_, TensorRef2D_, get_reduction_params
from cutlass.backend.frontend import NumpyFrontend, TorchFrontend
from cutlass.backend.library import TensorDescription
from cutlass.backend.memory_manager import DevicePtrWrapper
from cutlass.backend.operation import ExecutableOperation, LaunchConfiguration
from cutlass.backend.utils.software import CheckPackages, SubstituteTemplate
from cutlass.shape import MatrixCoord
if CheckPackages().check_torch():
import torch
from cutlass.utils.datatypes import is_numpy_tensor, is_torch_tensor
class ReductionOperation:
@ -85,13 +86,13 @@ class ReductionArguments:
# number of split-k partitions
self.partitions = partitions
if isinstance(destination, np.ndarray):
if is_numpy_tensor(destination):
self.host_D = destination
self.destination_buffer = NumpyFrontend.argument(destination, True)
self.source_buffer = NumpyFrontend.argument(source, False)
self.ptr_destination = cuda.CUdeviceptr(self.destination_buffer.ptr)
self.ptr_source = cuda.CUdeviceptr(self.source_buffer.ptr)
elif CheckPackages().check_torch() and isinstance(destination, torch.Tensor):
elif is_torch_tensor(destination):
self.ptr_destination = TorchFrontend.argument(destination)
self.ptr_source = TorchFrontend.argument(source)
elif isinstance(destination, cuda.CUdeviceptr):
@ -185,11 +186,22 @@ class ReductionArguments:
if err != cuda.CUresult.CUDA_SUCCESS:
raise RuntimeError("CUDA Error %s" % str(err))
self.free()
def free(self):
if hasattr(self, "destination_buffer"):
del self.destination_buffer
if hasattr(self, "source_buffer"):
del self.source_buffer
"""
Frees allocated device-side memory
"""
# Free any device memory allocated manually
if not cutlass.use_rmm:
for attr in ["destination_buffer", "source_buffer"]:
if hasattr(self, attr):
buf = getattr(self, attr)
if isinstance(buf, DevicePtrWrapper):
err, = cudart.cudaFree(buf.ptr)
if err != cudart.cudaError_t.cudaSuccess:
raise RuntimeError(f"cudaFree failed with error {err}")
del buf
class ReductionRT(ExecutableOperation):

View File

@ -30,11 +30,4 @@
#
################################################################################
from cutlass.backend.utils.datatypes import *
from cutlass.backend.utils.device import check_cuda_errors, device_cc
from cutlass.backend.utils.software import (
CheckPackages,
SubstituteTemplate,
device_sm_count,
get_memory_pool,
)

View File

@ -1,156 +0,0 @@
#################################################################################################
#
# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
"""
Utility functions for converting between frontend datatypes and CUTLASS datatypes
"""
from cuda import cuda
from cutlass import DataType
from cutlass.backend.utils.software import CheckPackages
numpy_available = CheckPackages().check_numpy()
if numpy_available:
import numpy as np
numpy_to_cutlass_dict = {
np.float16: DataType.f16,
np.float32: DataType.f32,
np.float64: DataType.f64,
np.int8: DataType.s8,
np.int32: DataType.s32,
np.dtype('float16'): DataType.f16,
np.dtype('float32'): DataType.f32,
np.dtype('float64'): DataType.f64,
np.dtype('int8'): DataType.s8,
np.dtype('int32'): DataType.s32,
}
def numpy_to_cutlass(inp):
numpy_available = CheckPackages().check_numpy()
if numpy_available:
return numpy_to_cutlass_dict.get(inp, None)
cupy_available = CheckPackages().check_cupy()
if cupy_available:
import cupy as cp
cupy_to_cutlass_dict = {
cp.float16: DataType.f16,
cp.float32: DataType.f32,
cp.float64: DataType.f64,
}
def cupy_to_cutlass(inp):
cupy_available = CheckPackages().check_cupy()
if cupy_available:
return cupy_to_cutlass_dict.get(inp, None)
torch_available = CheckPackages().check_torch()
if torch_available:
import torch
torch_to_cutlass_dict = {
torch.half: DataType.f16,
torch.float16: DataType.f16,
torch.float: DataType.f32,
torch.float32: DataType.f32,
torch.double: DataType.f64,
torch.float64: DataType.f64,
}
def torch_to_cutlass(inp):
if torch_available:
return torch_to_cutlass_dict.get(inp, None)
try:
import bfloat16
bfloat16_available = True
numpy_to_cutlass_dict[np.dtype(bfloat16.bfloat16)] = DataType.bf16
except ImportError:
bfloat16_available = False
def bfloat16_to_cutlass(inp):
if bfloat16_available:
if inp == bfloat16.bfloat16:
return DataType.bf16
def to_cutlass(inp):
for cvt_fn in [
bfloat16_to_cutlass,
cupy_to_cutlass,
numpy_to_cutlass,
torch_to_cutlass,
]:
out = cvt_fn(inp)
if out is not None:
return out
raise Exception(
"No available conversion from type {} to a CUTLASS type.".format(inp)
)
def to_device_ptr(tensor) -> cuda.CUdeviceptr:
"""
Converts a tensor to a CUdeviceptr
:param tensor: tensor to convert
:type tensor: np.ndarray | torch.Tensor | cp.ndarray | int
:return: device pointer
:rtype: cuda.CUdeviceptr
"""
if isinstance(tensor, np.ndarray):
ptr = cuda.CUdeviceptr(tensor.__array_interface__["data"][0])
elif torch_available and isinstance(tensor, torch.Tensor):
ptr = cuda.CUdeviceptr(tensor.data_ptr())
elif cupy_available and isinstance(tensor, cp.ndarray):
ptr = cuda.CUdeviceptr(int(tensor.data.ptr))
elif isinstance(tensor, cuda.CUdeviceptr):
ptr = tensor
elif isinstance(tensor, int):
ptr = cuda.CUdeviceptr(tensor)
else:
raise NotImplementedError(tensor)
return ptr

View File

@ -34,7 +34,10 @@
Utility functions for interacting with the device
"""
from cuda import cudart
from cuda import cuda, cudart
import cutlass
from cutlass.utils.datatypes import is_cupy_tensor, is_numpy_tensor, is_torch_tensor
def check_cuda_errors(result: list):
@ -60,7 +63,7 @@ def check_cuda_errors(result: list):
return result[1:]
def device_cc(device: int = 0) -> int:
def device_cc(device: int = -1) -> int:
"""
Returns the compute capability of the device with ID `device`.
@ -70,7 +73,51 @@ def device_cc(device: int = 0) -> int:
:return: compute capability of the queried device (e.g., 80 for SM80)
:rtype: int
"""
if device == -1:
device = cutlass.device_id()
deviceProp = check_cuda_errors(cudart.cudaGetDeviceProperties(device))
major = str(deviceProp.major)
minor = str(deviceProp.minor)
return int(major + minor)
def device_sm_count(device: int = -1):
if device == -1:
device = cutlass.device_id()
err, device_sm_count = cuda.cuDeviceGetAttribute(
cuda.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, device
)
if err != cuda.CUresult.CUDA_SUCCESS:
raise Exception(
"Failed to retireve SM count. "
f"cuDeviceGetAttribute() failed with error: {cuda.cuGetErrorString(err)[1]}"
)
return device_sm_count
def to_device_ptr(tensor) -> cuda.CUdeviceptr:
"""
Converts a tensor to a CUdeviceptr
:param tensor: tensor to convert
:type tensor: np.ndarray | torch.Tensor | cp.ndarray | int
:return: device pointer
:rtype: cuda.CUdeviceptr
"""
if is_numpy_tensor(tensor):
ptr = cuda.CUdeviceptr(tensor.__array_interface__["data"][0])
elif is_torch_tensor(tensor):
ptr = cuda.CUdeviceptr(tensor.data_ptr())
elif is_cupy_tensor(tensor):
ptr = cuda.CUdeviceptr(int(tensor.data.ptr))
elif isinstance(tensor, cuda.CUdeviceptr):
ptr = tensor
elif isinstance(tensor, int):
ptr = cuda.CUdeviceptr(tensor)
else:
raise NotImplementedError(tensor)
return ptr

View File

@ -1,111 +0,0 @@
#################################################################################################
#
# Copyright (c) 2023 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
import re
import sys
from cutlass.backend.memory_manager import PoolMemoryManager
class CheckPackages:
def __init__(self) -> None:
pass
def check_cupy(self):
if "cupy" in sys.modules:
return True
else:
try:
import cupy
cupy_available = True
except ImportError:
print("cupy is not loaded.")
def check_numpy(self):
if "numpy" in sys.modules:
return True
else:
try:
import numpy
numpy_available = True
except ImportError:
print("numpy is not loaded.")
def check_torch(self):
if "torch" in sys.modules:
return True
else:
try:
import torch
torch_available = True
except ImportError:
print("torch is not loaded.")
def SubstituteTemplate(template, values):
text = template
changed = True
while changed:
changed = False
for key, value in values.items():
regex = "\\$\\{%s\\}" % key
newtext = re.sub(regex, value, text)
if newtext != text:
changed = True
text = newtext
return text
def device_sm_count():
from cuda import cuda
_device = 0
err, _device_sm_count = cuda.cuDeviceGetAttribute(
cuda.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, _device
)
if err != cuda.CUresult.CUDA_SUCCESS:
raise Exception(
"Failed to retireve SM count. "
f"cuDeviceGetAttribute() failed with error: {cuda.cuGetErrorString(err)[1]}"
)
return _device_sm_count
def get_memory_pool(init_pool_size=0, max_pool_size=2 ** 34):
memory_pool = PoolMemoryManager(
init_pool_size=init_pool_size, max_pool_size=max_pool_size
)
return memory_pool

View File

@ -39,7 +39,7 @@ Example usage with JIT compilation:
.. highlight:: python
.. code-block:: python
plan = cutlass.op.Gemm(element=torch.float32, layout=cutlass.LayoutType.RowMajor)
plan = cutlass.op.Gemm(element=torch.float32, layout=cutlass_library.LayoutType.RowMajor)
op = plan.construct()
mod = cutlass.emit.pytorch(op, 'cutlass_gemm', 80, jit=True)
@ -81,15 +81,16 @@ The module can later be used in Python via:
import logging
import os
from cutlass import CUTLASS_PATH, logger, swizzle, ConvKind, ConvKindNames, DataType
from cutlass_library import ConvKind, ConvKindNames, DataType, SubstituteTemplate
from cutlass import CUTLASS_PATH, logger, swizzle
from cutlass.backend.gemm_operation import GemmOperationGrouped, GemmOperationUniversal
from cutlass.backend.conv2d_operation import Conv2dOperation
from cutlass.backend.library import ApiVersion
from cutlass.backend.utils.software import CheckPackages, SubstituteTemplate
from cutlass.emit import common
from cutlass.utils.datatypes import is_torch_available
torch_available = CheckPackages().check_torch()
if torch_available:
if is_torch_available():
import torch

View File

@ -36,10 +36,9 @@ Collection of builtin functions used for host reference in EVT
import numpy as np
from cutlass.backend.utils.software import CheckPackages
from cutlass.utils.datatypes import is_cupy_tensor, is_numpy_tensor, is_torch_available, is_torch_tensor
torch_available = CheckPackages().check_torch()
if torch_available:
if is_torch_available():
import torch
@ -48,16 +47,16 @@ def multiply_add(x, y, z):
def sum(x, dim):
if isinstance(x, np.ndarray):
if is_numpy_tensor(x):
return x.sum(axis=tuple(dim))
elif torch_available and isinstance(x, torch.Tensor):
elif is_torch_tensor(x):
return torch.sum(x, dim)
def max(x, dim):
if isinstance(x, np.ndarray):
if is_numpy_tensor(x):
return x.max(axis=tuple(dim))
elif torch_available and isinstance(x, torch.Tensor):
elif is_torch_tensor(x):
return torch.amax(x, dim)
@ -66,14 +65,14 @@ def max(x, dim):
##############################################################################
def permute(x, indices: tuple):
if isinstance(x, np.ndarray):
if is_numpy_tensor(x):
return np.transpose(x, axes=indices)
elif torch_available and isinstance(x, torch.Tensor):
elif is_torch_tensor(x):
return x.permute(*indices)
def reshape(x, new_shape: tuple):
if isinstance(x, np.ndarray):
if is_numpy_tensor(x):
return np.reshape(x, newshape=new_shape)
elif torch_available and isinstance(x, torch.Tensor):
elif is_torch_tensor(x):
return x.view(new_shape)

View File

@ -69,20 +69,23 @@ class KernelsForDataType:
"""
Add an operation to the list of supported kernels
"""
alignment = operation.A.alignment
if alignment not in self.kernels_by_alignment:
self.kernels_by_alignment[alignment] = []
self.kernels_by_alignment[alignment].append(operation)
alignment_key = f"{operation.A.alignment} {operation.B.alignment} {operation.C.alignment}"
if alignment_key not in self.kernels_by_alignment:
self.kernels_by_alignment[alignment_key] = []
self.kernels_by_alignment[alignment_key].append(operation)
@property
def alignments(self):
def alignments(self, operand: str):
"""
Returns an unsorted list of alignments supported by this data type combination
:param operand: identifier of operand in question (e.g., A, B, C)
:type operand: str
:return: unsorted list of alignments supported by this data type combination
:rtype: list
"""
return list(self.kernels_by_alignment.keys())
operand_idx = self._operand_idx(operand)
return [int(key.split(" ")[operand_idx]) for key in self.kernels_by_alignment.keys()]
@property
def all_operations(self):
@ -97,24 +100,48 @@ class KernelsForDataType:
ops.extend(alignment_ops)
return ops
def operations(self, alignment: int):
"""
Returns operations satisfying the alignment constraint indicated by `alignment`
def default_operation(self):
key = sorted(list(self.kernels_by_alignment.keys()))[0]
return self.kernels_by_alignment[key][0]
:param alignment: alignment constraint of operations to return
:type alignment: int
def operations(self, alignment_A: int, alignment_B: int, alignment_C: int):
"""
Returns operations satisfying the alignment constraints
:param alignment_A: alignment constraint of operations to return
:type alignment_A: int
:param alignment_B: alignment constraint of operations to return
:type alignment_B: int
:param alignment_C: alignment constraint of operations to return
:type alignment_C: int
:return: list of operations
:rtype: list
"""
if alignment not in self.kernels_by_alignment:
raise Exception(
f"No operations of alignment {alignment} found for data type and layout "
f"combination {self.datatype_comb} {self.layout_comb}"
)
return self.kernels_by_alignment[alignment]
key = f"{alignment_A} {alignment_B} {alignment_C}"
def find_alignment(self, shape: tuple, layout: cutlass.LayoutType) -> int:
if key not in self.kernels_by_alignment:
og_key = key
# Reconcile A, B, and C alignments by trying to align to the minimum
min_alignment = min(alignment_A, alignment_B, alignment_C)
key = f"{min_alignment} {min_alignment} {min_alignment}"
if key not in self.kernels_by_alignment:
raise Exception(
f"No operations of alignment {og_key} found for data type and layout "
f"combination {self.datatype_comb} {self.layout_comb}. Tried to fall back "
f"to alignment {key}, but that was also not compatible. Compatible alignments "
f"are {self.kernels_by_alignment.keys()}"
)
return self.kernels_by_alignment[key]
def _operand_idx(self, key: str) -> int:
operand_list = ["A", "B", "C"]
if key not in operand_list:
raise Exception(f"Unexpected operand {operand}")
return operand_list.index(key)
def find_alignment(self, shape: tuple, layout: cutlass.LayoutType, operand=str) -> int:
"""
Returns the most preferable alignment for a given shape and layout
@ -122,10 +149,14 @@ class KernelsForDataType:
:type shape: tuple
:param layout: layout of the tensor
:type layout: cutlass.LayoutType
:param operand: descriptor of the operand in question
:type operand: str
:return: maximum alignment supported by the data type combination and tensor size
:rtype: int
"""
operand_idx = self._operand_idx(operand)
# Determine the leading dimension of the shape
if layout == cutlass.LayoutType.ColumnMajor:
ld = shape[-2]
@ -136,7 +167,8 @@ class KernelsForDataType:
else:
raise Exception(f"Unexpected or unsupported layout {layout}")
for alignment in sorted(list(self.kernels_by_alignment.keys()), reverse=True):
for alignments in sorted(list(self.kernels_by_alignment.keys()), reverse=True):
alignment = int(alignments.split(" ")[operand_idx])
if ld % alignment == 0:
return alignment
@ -165,7 +197,7 @@ class ArchOptions:
:param kernel_cc: compute capability of the kernels to generate
:type kernel_cc: int
:param operation_kind: type of operation to register
:type operation_kind: cutlass.OperationKind
:type operation_kind: cutlass_library.OperationKind
:param gemm_kinds: types of GEMM operations that can be included
:type gemm_kinds: list
:param allowed_math_operations: types of primitive math operations allowed
@ -176,11 +208,12 @@ class ArchOptions:
self,
target_cc: int,
kernel_cc: int,
operation_kind: cutlass.OperationKind,
operation_kind: cutlass_library.OperationKind,
gemm_kinds: list,
allowed_math_operations: list = [
cutlass.MathOperation.multiply_add,
cutlass.MathOperation.multiply_add_saturate,
cutlass_library.MathOperation.multiply_add,
cutlass_library.MathOperation.multiply_add_saturate,
cutlass_library.MathOperation.multiply_add_mixed_input_upcast
]
):
self.cc = kernel_cc
@ -229,7 +262,7 @@ class ArchOptions:
# find available opclasses and data types
for name, op_list in manifest.operations[operation_kind][kernel_cc].items():
for op in op_list:
if operation_kind == cutlass.OperationKind.Gemm:
if operation_kind == cutlass_library.OperationKind.Gemm:
if op.gemm_kind not in gemm_kinds:
continue
@ -237,15 +270,11 @@ class ArchOptions:
if mi.math_operation not in self.allowed_math_operations:
continue
if op.C.element == cutlass.DataType.void:
# The CUTLASS Python interface currently does not support void-C kernels
continue
datatype_comb = (mi.element_a, mi.element_b, mi.element_accumulator)
# Prune operations that don't fit in shared memory
td = td_from_profiler_op(op)
if not valid_stage_count(target_cc, kernel_cc, td)[0]:
if not valid_stage_count(target_cc, kernel_cc, td, verbose=False)[0]:
continue
if mi.opcode_class not in self.operations_by_opclass:
@ -255,17 +284,17 @@ class ArchOptions:
layout_comb = (op.A.layout, op.B.layout)
# Register TF32 kernels as F32 to enable F32 -> TF32 conversion + TF32 Tensor Core operations
if datatype_comb == (cutlass.DataType.tf32, cutlass.DataType.tf32, cutlass.DataType.f32):
if datatype_comb == (cutlass_library.DataType.tf32, cutlass_library.DataType.tf32, cutlass_library.DataType.f32):
# TF32 kernels only supported on SM80 and beyond
if self.cc < 80:
continue
elif self.cc == 90:
if (op.A.element != cutlass.DataType.f32
or op.B.element != cutlass.DataType.f32
or op.C.element != cutlass.DataType.f32):
if (op.A.element != cutlass_library.DataType.f32
or op.B.element != cutlass_library.DataType.f32
or op.C.element != cutlass_library.DataType.f32):
continue
datatype_comb = (cutlass.DataType.f32, cutlass.DataType.f32, cutlass.DataType.f32)
datatype_comb = (cutlass_library.DataType.f32, cutlass_library.DataType.f32, cutlass_library.DataType.f32)
opclass_dict = self.operations_by_opclass[mi.opcode_class]
key = (datatype_comb, layout_comb)
@ -274,82 +303,82 @@ class ArchOptions:
opclass_dict[key].add(op)
# Set the default opclass to TensorOp, if available. Otherwise default to SIMT
if cutlass.OpcodeClass.TensorOp in self.operations_by_opclass:
self.op_class = cutlass.OpcodeClass.TensorOp
if cutlass_library.OpcodeClass.TensorOp in self.operations_by_opclass:
self.op_class = cutlass_library.OpcodeClass.TensorOp
else:
self.op_class = cutlass.OpcodeClass.Simt
self.op_class = cutlass_library.OpcodeClass.Simt
# The profiler's generator may generate only a limited set of combinations of operands for SIMT kernels.
# Here, we generate additional versions via a generic TileDescription.
if cutlass.OpcodeClass.Simt not in self.operations_by_opclass:
self.operations_by_opclass[cutlass.OpcodeClass.Simt] = {}
if cutlass_library.OpcodeClass.Simt not in self.operations_by_opclass:
self.operations_by_opclass[cutlass_library.OpcodeClass.Simt] = {}
if operation_kind == cutlass.OperationKind.Gemm:
if operation_kind == cutlass_library.OperationKind.Gemm:
types = [
(cutlass.DataType.s8, cutlass.DataType.s8, cutlass.DataType.s8),
(cutlass.DataType.s8, cutlass.DataType.s8, cutlass.DataType.s32),
(cutlass.DataType.f16, cutlass.DataType.f16, cutlass.DataType.f16),
(cutlass.DataType.f16, cutlass.DataType.f16, cutlass.DataType.f32),
(cutlass.DataType.f32, cutlass.DataType.f32, cutlass.DataType.f32),
(cutlass.DataType.f64, cutlass.DataType.f64, cutlass.DataType.f64),
(cutlass_library.DataType.s8, cutlass_library.DataType.s8, cutlass_library.DataType.s8),
(cutlass_library.DataType.s8, cutlass_library.DataType.s8, cutlass_library.DataType.s32),
(cutlass_library.DataType.f16, cutlass_library.DataType.f16, cutlass_library.DataType.f16),
(cutlass_library.DataType.f16, cutlass_library.DataType.f16, cutlass_library.DataType.f32),
(cutlass_library.DataType.f32, cutlass_library.DataType.f32, cutlass_library.DataType.f32),
(cutlass_library.DataType.f64, cutlass_library.DataType.f64, cutlass_library.DataType.f64),
]
layouts = [
(cutlass.LayoutType.RowMajor, cutlass.LayoutType.RowMajor),
(cutlass.LayoutType.RowMajor, cutlass.LayoutType.ColumnMajor),
(cutlass.LayoutType.ColumnMajor, cutlass.LayoutType.RowMajor),
(cutlass.LayoutType.ColumnMajor, cutlass.LayoutType.ColumnMajor),
(cutlass_library.LayoutType.RowMajor, cutlass_library.LayoutType.RowMajor),
(cutlass_library.LayoutType.RowMajor, cutlass_library.LayoutType.ColumnMajor),
(cutlass_library.LayoutType.ColumnMajor, cutlass_library.LayoutType.RowMajor),
(cutlass_library.LayoutType.ColumnMajor, cutlass_library.LayoutType.ColumnMajor),
]
elif operation_kind == cutlass.OperationKind.Conv2d:
elif operation_kind == cutlass_library.OperationKind.Conv2d:
types = [
(cutlass.DataType.f16, cutlass.DataType.f16, cutlass.DataType.f16),
(cutlass.DataType.f16, cutlass.DataType.f16, cutlass.DataType.f32),
(cutlass.DataType.f32, cutlass.DataType.f32, cutlass.DataType.f32),
(cutlass.DataType.f64, cutlass.DataType.f64, cutlass.DataType.f64),
(cutlass_library.DataType.f16, cutlass_library.DataType.f16, cutlass_library.DataType.f16),
(cutlass_library.DataType.f16, cutlass_library.DataType.f16, cutlass_library.DataType.f32),
(cutlass_library.DataType.f32, cutlass_library.DataType.f32, cutlass_library.DataType.f32),
(cutlass_library.DataType.f64, cutlass_library.DataType.f64, cutlass_library.DataType.f64),
]
layouts = [
(cutlass.LayoutType.TensorNHWC, cutlass.LayoutType.TensorNHWC),
(cutlass_library.LayoutType.TensorNHWC, cutlass_library.LayoutType.TensorNHWC),
]
else:
raise NotImplementedError(f"Operation kind {operation_kind} is currently unsupported.")
alignment = 1
epilogue_functor = cutlass.EpilogueFunctor.LinearCombination
swizzling_functor = cutlass.SwizzlingFunctor.Identity8
epilogue_functor = cutlass_library.EpilogueFunctor.LinearCombination
swizzling_functor = cutlass_library.SwizzlingFunctor.Identity8
for type_comb in types:
for layout_comb in layouts:
comb = (type_comb, layout_comb)
if comb in self.operations_by_opclass[cutlass.OpcodeClass.Simt]:
if comb in self.operations_by_opclass[cutlass_library.OpcodeClass.Simt]:
continue
A = cutlass.TensorDescription(type_comb[0], layout_comb[0], alignment)
B = cutlass.TensorDescription(type_comb[1], layout_comb[1], alignment)
C = cutlass.TensorDescription(type_comb[2], cutlass.LayoutType.ColumnMajor, alignment)
math_inst = cutlass.MathInstruction(
A = cutlass_library.TensorDescription(type_comb[0], layout_comb[0], alignment)
B = cutlass_library.TensorDescription(type_comb[1], layout_comb[1], alignment)
C = cutlass_library.TensorDescription(type_comb[2], cutlass_library.LayoutType.ColumnMajor, alignment)
math_inst = cutlass_library.MathInstruction(
[1, 1, 1],
type_comb[0],
type_comb[1],
type_comb[2],
cutlass.OpcodeClass.Simt,
cutlass.MathOperation.multiply_add
cutlass_library.OpcodeClass.Simt,
cutlass_library.MathOperation.multiply_add
)
td = cutlass.TileDescription(
td = cutlass_library.TileDescription(
[128, 128, 8], 2, [4, 2, 1], math_inst, 50, 1024)
# Prune operations that don't fit in shared memory
if not valid_stage_count(target_cc, kernel_cc, td_from_profiler_td(td))[0]:
if not valid_stage_count(target_cc, kernel_cc, td_from_profiler_td(td), verbose=False)[0]:
continue
new_kernels = KernelsForDataType(type_comb, layout_comb)
if operation_kind == cutlass.OperationKind.Gemm:
if operation_kind == cutlass_library.OperationKind.Gemm:
new_operation = cutlass_library.manifest.GemmOperation(
cutlass.GemmKind.Universal, td.minimum_compute_capability,
cutlass_library.GemmKind.Universal, td.minimum_compute_capability,
td, A, B, C, type_comb[2], epilogue_functor, swizzling_functor)
new_kernels.add(new_operation)
elif operation_kind == cutlass.OperationKind.Conv2d:
elif operation_kind == cutlass_library.OperationKind.Conv2d:
for conv_kind in [ConvKind.Fprop, ConvKind.Dgrad, ConvKind.Wgrad]:
new_operation = cutlass_library.manifest.Conv2dOperation(
conv_kind, IteratorAlgorithm.Analytic, td.minimum_compute_capability, td,
@ -358,7 +387,7 @@ class ArchOptions:
)
new_kernels.add(new_operation)
self.operations_by_opclass[cutlass.OpcodeClass.Simt][comb] = new_kernels
self.operations_by_opclass[cutlass_library.OpcodeClass.Simt][comb] = new_kernels
# Sort all operations
for oc in self.operations_by_opclass.keys():
@ -366,17 +395,17 @@ class ArchOptions:
self.operations_by_opclass[oc][comb].sort()
def opclass_supports_combination(
self, op_class: cutlass.OpcodeClass, datatype_comb: tuple, layout_comb: tuple
self, op_class: cutlass_library.OpcodeClass, datatype_comb: tuple, layout_comb: tuple
) -> bool:
"""
Returns whether the provided operation class supports the provided data type and layout combination
:param op_class: operation class to consider
:type op_class: cutlass.OpcodeClass
:type op_class: cutlass_library.OpcodeClass
:param datatype_comb: tuple of data types for (element_A, element_B, element_accumulator)
:type datatype_comb: tuple[cutlass.DataType]
:type datatype_comb: tuple[cutlass_library.DataType]
:param layout_comb: tuple of data types for (layout_A, layout_B)
:type layout_comb: tuple[cutlass.LayoutType]
:type layout_comb: tuple[cutlass_library.LayoutType]
:return: set of operation classes that support the provided data type and layout combination
:rtype: set
@ -388,25 +417,25 @@ class ArchOptions:
def supporting_opclasses(
self,
element_a: cutlass.DataType,
element_b: cutlass.DataType,
element_accumulator: cutlass.DataType,
layout_a: cutlass.LayoutType,
layout_b: cutlass.LayoutType,
element_a: cutlass_library.DataType,
element_b: cutlass_library.DataType,
element_accumulator: cutlass_library.DataType,
layout_a: cutlass_library.LayoutType,
layout_b: cutlass_library.LayoutType,
) -> set:
"""
Returns a set of operation classes that support the provided data type combination
:param element_a: data type of operand A
:type element_a: cutlass.DataType
:type element_a: cutlass_library.DataType
:param element_b: data type of operand B
:type element_b: cutlass.DataType
:type element_b: cutlass_library.DataType
:param element_accumulator: data type of accumulator
:type element_accumulator: cutlass.DataType
:type element_accumulator: cutlass_library.DataType
:param layout_a: layout of operand A
:type layout_a: cutlass.LayoutType
:type layout_a: cutlass_library.LayoutType
:param layout_b: layout of operand B
:type layout_b: cutlass.LayoutType
:type layout_b: cutlass_library.LayoutType
:return: set of operation classes that support the provided data type combination
:rtype: set
@ -422,28 +451,28 @@ class ArchOptions:
def operations(
self,
op_class: cutlass.OpcodeClass,
element_a: cutlass.DataType,
element_b: cutlass.DataType,
element_accumulator: cutlass.DataType,
layout_a: cutlass.LayoutType,
layout_b: cutlass.LayoutType,
op_class: cutlass_library.OpcodeClass,
element_a: cutlass_library.DataType,
element_b: cutlass_library.DataType,
element_accumulator: cutlass_library.DataType,
layout_a: cutlass_library.LayoutType,
layout_b: cutlass_library.LayoutType,
) -> KernelsForDataType:
"""
Returns whether the provided operation class supports the provided data type combination
:param op_class: operation class to consider
:type op_class: cutlass.OpcodeClass
:type op_class: cutlass_library.OpcodeClass
:param element_a: data type of operand A
:type element_a: cutlass.DataType
:type element_a: cutlass_library.DataType
:param element_b: data type of operand B
:type element_b: cutlass.DataType
:type element_b: cutlass_library.DataType
:param element_accumulator: data type of accumulator
:type element_accumulator: cutlass.DataType
:type element_accumulator: cutlass_library.DataType
:param layout_a: layout of operand A
:type layout_a: cutlass.LayoutType
:type layout_a: cutlass_library.LayoutType
:param layout_b: layout of operand B
:type layout_b: cutlass.LayoutType
:type layout_b: cutlass_library.LayoutType
:return: container of kernels by alignment supported by the provided combination of parameters
:rtype: KernelsForDataType
@ -469,13 +498,13 @@ class OptionRegistry:
def __init__(self, target_cc: int):
self.registry = {}
gemm_kinds = [cutlass.GemmKind.Universal, cutlass.GemmKind.Universal3x]
operation_kinds = [cutlass.OperationKind.Gemm, cutlass.OperationKind.Conv2d]
gemm_kinds = [cutlass_library.GemmKind.Universal, cutlass_library.GemmKind.Universal3x]
operation_kinds = [cutlass_library.OperationKind.Gemm, cutlass_library.OperationKind.Conv2d]
# Construct options for each CC
for kernel_cc in _generator_ccs:
self.registry[kernel_cc] = {}
for opkind in operation_kinds:
self.registry[kernel_cc][opkind] = ArchOptions(target_cc, kernel_cc, opkind, gemm_kinds)
def options_for_cc(self, cc: int, op_kind=cutlass.OperationKind.Gemm) -> ArchOptions:
def options_for_cc(self, cc: int, op_kind=cutlass_library.OperationKind.Gemm) -> ArchOptions:
return self.registry.get(cc, None)[op_kind]

View File

@ -112,15 +112,18 @@
args.sync()
"""
import cutlass
from cutlass import epilogue
from cutlass import (
from cutlass_library import (
ConvKind,
ConvMode,
DataTypeSize,
IteratorAlgorithm,
OperationKind,
SplitKMode,
StrideSupport,
)
import cutlass
from cutlass import epilogue
from cutlass.backend import compiler
from cutlass.backend.conv2d_operation import Conv2dArguments, Conv2dOperation
from cutlass.backend.reduction_operation import ReductionOperation, ReductionArguments
@ -202,7 +205,7 @@ class Conv2d(OperationBase):
element_accumulator=None,
cc: int = None, kernel_cc: int = None
):
super().__init__(cc=cc, kernel_cc=kernel_cc, operation_kind=cutlass.OperationKind.Conv2d)
super().__init__(cc=cc, kernel_cc=kernel_cc, operation_kind=OperationKind.Conv2d)
# Verify the kernel cc
if self.current_cc == 90:
# The Conv2d kernel on Hopper (SM90) is currently unsupported
@ -305,11 +308,11 @@ class Conv2d(OperationBase):
self._reset_epilogue_functor_activation(epilogue.identity)
self.alignment_pref_A = min(
128 // cutlass.DataTypeSize[self._element_a], max(self.possible_operations.alignments))
128 // DataTypeSize[self._element_a], max(self.possible_operations.alignments("A")))
self.alignment_pref_B = min(
128 // cutlass.DataTypeSize[self._element_b], max(self.possible_operations.alignments))
128 // DataTypeSize[self._element_b], max(self.possible_operations.alignments("B")))
self.alignment_pref_C = min(
128 // cutlass.DataTypeSize[self._element_c], max(self.possible_operations.alignments))
128 // DataTypeSize[self._element_c], max(self.possible_operations.alignments("C")))
#
# Tile description Related
@ -342,8 +345,7 @@ class Conv2d(OperationBase):
return
if isinstance(td, dict):
if self._tile_description is None:
alignment = list(self.possible_operations.kernels_by_alignment.keys())[0]
op = self.possible_operations.operations(alignment)[0]
op = self.possible_operations.default_operation()
self._tile_description = datatypes.td_from_profiler_op(op)
if "cluster_shape" in td.keys():
if td["cluster_shape"] != [1, 1, 1]:
@ -567,8 +569,7 @@ class Conv2d(OperationBase):
if self.tile_description is not None:
tile_description = self.tile_description
else:
min_alignment = min([alignment_A, alignment_B, alignment_C])
op = self.possible_operations.operations(min_alignment)[0]
op = self.possible_operations.operations(alignment_A, alignment_B, alignment_C)[0]
tile_description = datatypes.td_from_profiler_op(op)
else:
valid, err_str = self._valid_tile_description(tile_description)
@ -753,6 +754,8 @@ class Conv2d(OperationBase):
:return: arguments passed in to the kernel
:rtype: cutlass.backend.Conv2dArguments
"""
super().run_setup()
A = self._verify_tensor(A, self.A, self._element_a, self._layout_a, "A")
B = self._verify_tensor(B, self.B, self._element_b, self._layout_b, "B")
C = self._verify_tensor(C, self.C, self._element_c, self._layout_c, "C")
@ -782,9 +785,9 @@ class Conv2d(OperationBase):
shape_c = datatypes.get_tensor_shape(C, op="CONV")
# Get the alignment
alignment_a = self.possible_operations.find_alignment(shape_a, self._layout_a)
alignment_b = self.possible_operations.find_alignment(shape_b, self._layout_b)
alignment_c = self.possible_operations.find_alignment(shape_c, self._layout_c)
alignment_a = self.possible_operations.find_alignment(shape_a, self._layout_a, operand="A")
alignment_b = self.possible_operations.find_alignment(shape_b, self._layout_b, operand="B")
alignment_c = self.possible_operations.find_alignment(shape_c, self._layout_c, operand="C")
alignment_a = check.update_alignment(alignment_a, self.alignment_pref_A)
alignment_b = check.update_alignment(alignment_b, self.alignment_pref_B)
@ -858,6 +861,10 @@ class Conv2d(OperationBase):
if sync:
if split_k[0] == "parallel" and split_k[1] > 1:
reduction_arguments.sync()
# Free memory allocated by args because we are not
# calling `arguments.sync()` in this case (which will free memory)
arguments.free()
else:
arguments.sync()

View File

@ -116,12 +116,14 @@
from math import prod
import cutlass
from cutlass import (
epilogue,
swizzle,
from cutlass_library import (
DataType,
DataTypeSize,
GemmUniversalMode,
)
import cutlass
from cutlass import epilogue, swizzle
from cutlass.backend import compiler
from cutlass.backend.evt import EpilogueFunctorVisitor
from cutlass.backend.gemm_operation import GemmArguments, GemmOperationUniversal
@ -292,7 +294,7 @@ class Gemm(OperationBase):
f'combination {datatype_comb}x{layout_comb}')
if reset_epilogue:
self._reset_epilogue_functor_activation(epilogue.identity)
self._reset_epilogue_functor_activation(cutlass.epilogue.identity)
@property
def swizzling_functor(self):
@ -308,7 +310,7 @@ class Gemm(OperationBase):
"""
Sets the swizzling functor to the type specified by `swizzling_functor`
"""
if swizzling_functor == swizzle.ThreadblockSwizzleStreamK:
if swizzling_functor == cutlass.swizzle.ThreadblockSwizzleStreamK:
if self.op_class == cutlass.OpcodeClass.Simt:
raise Exception('ThreadblockSwizzleStreamK is currently only supported with opcode class TensorOp')
@ -347,8 +349,7 @@ class Gemm(OperationBase):
return
if isinstance(td, dict):
if self._tile_description is None:
alignment = list(self.possible_operations.kernels_by_alignment.keys())[0]
op = self.possible_operations.operations(alignment)[0]
op = self.possible_operations.default_operation()
self._tile_description = datatypes.td_from_profiler_op(op)
td = self._tile_description.clone_and_update(td)
@ -414,22 +415,25 @@ class Gemm(OperationBase):
:return: operation that was constructed
:rtype: cutlass.backend.GemmOperationUniversal
"""
alignment_pref_A = min(128 // cutlass.DataTypeSize[self._element_a], max(self.possible_operations.alignments))
alignment_pref_B = min(128 // cutlass.DataTypeSize[self._element_b], max(self.possible_operations.alignments))
alignment_pref_C = min(128 // cutlass.DataTypeSize[self._element_c], max(self.possible_operations.alignments))
alignment_pref_A = min(128 // DataTypeSize[self._element_a], max(self.possible_operations.alignments("A")))
alignment_pref_B = min(128 // DataTypeSize[self._element_b], max(self.possible_operations.alignments("B")))
alignment_A = check.alignment_or_default(alignment_A, alignment_pref_A)
alignment_B = check.alignment_or_default(alignment_B, alignment_pref_B)
alignment_C = check.alignment_or_default(alignment_C, alignment_pref_C)
self.epilogue_functor = self._reset_epilogue_functor_alignment(alignment_C, self.epilogue_functor)
tensor_A = TensorDescription(self._element_a, self._layout_a, alignment_A)
tensor_B = TensorDescription(self._element_b, self._layout_b, alignment_B)
alignment_pref_C = max(self.possible_operations.alignments("C"))
if self._element_c != DataType.void:
alignment_pref_C = min(128 // DataTypeSize[self._element_c], alignment_pref_C)
alignment_C = check.alignment_or_default(alignment_C, alignment_pref_C)
tensor_C = TensorDescription(self._element_c, self._layout_c, alignment_C)
self.epilogue_functor = self._reset_epilogue_functor_alignment(alignment_C, self.epilogue_functor)
if tile_description is None:
if self._tile_description is None:
op = self.possible_operations.operations(alignment_A)[0]
op = self.possible_operations.operations(alignment_A, alignment_B, alignment_C)[0]
tile_description = datatypes.td_from_profiler_op(op)
else:
tile_description = self._tile_description
@ -527,7 +531,7 @@ class Gemm(OperationBase):
:return: stride between each matrix in the batch
:rtype: int
"""
if len(tensor.shape) > 2:
if tensor is not None and len(tensor.shape) > 2:
return tensor.shape[-2] * tensor.shape[-1]
else:
return 0
@ -566,12 +570,14 @@ class Gemm(OperationBase):
B_row = self._layout_b == cutlass.LayoutType.RowMajor
C_row = self._layout_c == cutlass.LayoutType.RowMajor
batched = lambda x : len(x.shape) > 2 and prod(x.shape[:-2]) == batch_count
# Consider a Tensor to be batched if its rank is > 2 and
# the product of the modes beyond rank 2 equals our pre-determined batch size.
batched = lambda x : x is None or (len(x.shape) > 2 and prod(x.shape[:-2]) == batch_count)
if batched(A) and not batched(B) and batched(C) and A_row and C_row:
if batched(A) and not batched(B) and (C is None or batched(C)) and A_row and C_row:
M *= batch_count
returned_batch_count = 1
elif not batched(A) and batched(B) and batched(C) and not B_row and not C_row:
elif not batched(A) and batched(B) and (C is None or batched(C)) and not B_row and not C_row:
N *= batch_count
returned_batch_count = 1
else:
@ -625,6 +631,7 @@ class Gemm(OperationBase):
:return: arguments passed in to the kernel
:rtype: cutlass.backend.GemmArguments
"""
super().run_setup()
A = self._verify_tensor(A, self.A, self._element_a, self._layout_a, "A")
B = self._verify_tensor(B, self.B, self._element_b, self._layout_b, "B")
C = self._verify_tensor(C, self.C, self._element_c, self._layout_c, "C")
@ -632,14 +639,20 @@ class Gemm(OperationBase):
alpha = self._verify_scalar(alpha, self.alpha, self._element_c, "alpha")
beta = self._verify_scalar(beta, self.beta, self._element_c, "beta")
is_void_c = self._element_c == DataType.void
self._verify_rank(A)
self._verify_rank(B)
self._verify_rank(C)
if not is_void_c:
self._verify_rank(C)
self._verify_rank(D)
alignment_a = self.possible_operations.find_alignment(A.shape, self._layout_a)
alignment_b = self.possible_operations.find_alignment(B.shape, self._layout_b)
alignment_c = self.possible_operations.find_alignment(C.shape, self._layout_c)
alignment_a = self.possible_operations.find_alignment(A.shape, self._layout_a, operand="A")
alignment_b = self.possible_operations.find_alignment(B.shape, self._layout_b, operand="B")
# Set C alignment based on D.shape so as to correctly get an alignment with void-C
# kernels, for which `C` is None.
alignment_c = self.possible_operations.find_alignment(D.shape, self._layout_c, operand="C")
self.compile(self._tile_description, alignment_A=alignment_a, alignment_B=alignment_b,
alignment_C=alignment_c, print_module=print_module)

View File

@ -51,7 +51,8 @@
plan.run([A0, A1], [B0, B1], [C0, C1], [D0, D1])
"""
from cutlass import DataTypeSize
from cutlass_library import DataTypeSize
from cutlass.backend.gemm_operation import (
GemmGroupedArguments,
GemmOperationGrouped,
@ -162,10 +163,9 @@ class GroupedGemm(Gemm):
:return: operation that was constructed
:rtype: cutlass.backend.GemmOperationGrouped
"""
alignment_preference = max(self.possible_operations.alignments)
alignment_A = check.alignment_or_default(alignment_A, alignment_preference)
alignment_B = check.alignment_or_default(alignment_B, alignment_preference)
alignment_C = check.alignment_or_default(alignment_C, alignment_preference)
alignment_A = check.alignment_or_default(alignment_A, max(self.possible_operations.alignments("A")))
alignment_B = check.alignment_or_default(alignment_B, max(self.possible_operations.alignments("B")))
alignment_C = check.alignment_or_default(alignment_C, max(self.possible_operations.alignments("C")))
self.epilogue_functor = self._reset_epilogue_functor_alignment(alignment_C, self.epilogue_functor)
@ -174,7 +174,7 @@ class GroupedGemm(Gemm):
tensor_C = TensorDescription(self._element_c, self._layout_c, alignment_C)
if tile_description is None:
op = self.possible_operations.operations(alignment_A)[0]
op = self.possible_operations.operations(alignment_A, alignment_B, alignment_C)[0]
tile_description = datatypes.td_from_profiler_op(op)
else:
valid, err_str = self._valid_tile_description(tile_description)
@ -221,6 +221,8 @@ class GroupedGemm(Gemm):
:return: arguments passed in to the kernel
:rtype: cutlass.backend.GemmGroupedArguments
"""
super().run_setup()
if len(A) != len(B) or len(A) != len(C) or len(A) != len(D):
raise Exception("Lengths of A, B, C, and D lists must be equal")
@ -236,9 +238,9 @@ class GroupedGemm(Gemm):
alpha = self._verify_scalar(alpha, self.alpha, self._element_c, "alpha")
beta = self._verify_scalar(beta, self.beta, self._element_c, "beta")
alignment_a = min((self.possible_operations.find_alignment(A.shape, self._layout_a) for A in As))
alignment_b = min((self.possible_operations.find_alignment(B.shape, self._layout_b) for B in Bs))
alignment_c = min((self.possible_operations.find_alignment(C.shape, self._layout_c) for C in Cs))
alignment_a = min((self.possible_operations.find_alignment(A.shape, self._layout_a, operand="A") for A in As))
alignment_b = min((self.possible_operations.find_alignment(B.shape, self._layout_b, operand="B") for B in Bs))
alignment_c = min((self.possible_operations.find_alignment(C.shape, self._layout_c, operand="C") for C in Cs))
self.compile(self.tile_description, alignment_A=alignment_a, alignment_B=alignment_b,
alignment_C=alignment_c, print_module=print_module)

View File

@ -36,11 +36,13 @@ Base operation used for defining high-level CUTLASS operations (e.g., GEMM, Conv
from bisect import bisect_left
from cutlass_library import DataType, DataTypeSize, OperationKind, SharedMemPerCC
import cutlass
from cutlass import option_registry, epilogue
from cutlass import get_option_registry
from cutlass.backend.evt import EpilogueFunctorVisitor
from cutlass.backend.utils.device import device_cc
from cutlass.epilogue import get_activations
from cutlass.epilogue import get_activations, get_activation_epilogue, identity
from cutlass.library_defaults import KernelsForDataType, _generator_ccs
from cutlass.swizzle import get_swizzling_functors
from cutlass.utils import datatypes, check
@ -51,12 +53,14 @@ class OperationBase:
Base operation used for defining high-level CUTLASS operations (e.g., GEMM, Conv2d)
"""
def __init__(self, cc: int = None, kernel_cc: int = None, operation_kind = cutlass.OperationKind.Gemm):
def __init__(self, cc: int = None, kernel_cc: int = None, operation_kind = OperationKind.Gemm):
"""
:param cc: compute capability of device for which kernels should be compiled. For example, if running on H100, this should be set to 90
:type cc: int
:param kernel_cc: compute capability of kernels to generate. For example, if running on SM90, but desiring to use a CUTLASS 2.x-style Ampere kernel, this should be set to 80
:type kernel_cc: int
:param operation_kind: class of operation that will be performed (e.g., GEMM, Conv)
:type operation_kind: cutlass_library.OperationKind
"""
self.operation_kind = operation_kind
self.cc = cc if cc is not None else device_cc()
@ -64,13 +68,13 @@ class OperationBase:
self.current_cc = kernel_cc if kernel_cc is not None else self._find_closest_cc(self.cc)
self.tile_description = None
self.options = option_registry.options_for_cc(self.current_cc, operation_kind)
self.options = get_option_registry().options_for_cc(self.current_cc, operation_kind)
if self.options is None:
raise Exception(f"Invalid or unsupported compute capability: {self.current_cc}")
# Default activation function: identity
self._activation = epilogue.identity
self._activation = identity
def _find_closest_cc(self, cc: int) -> int:
"""
@ -120,7 +124,7 @@ class OperationBase:
if cc not in _generator_ccs:
raise Exception(f'Invalid CC for CUTLASS kernels: {cc}.')
self.current_cc = cc
self.options = option_registry.options_for_cc(self.current_cc, self.operation_kind)
self.options = get_option_registry().options_for_cc(self.current_cc, self.operation_kind)
def _verify_scalar(self, scalar, ref_scalar, ref_dtype, name):
"""
@ -158,9 +162,12 @@ class OperationBase:
def _verify_tensor(self, tensor, ref_tensor, ref_dtype, ref_layout, name):
"""
Verifies the following properties:
1) Either ``tensor`` or ``ref_tensor`` must be set (i.e., not ``None``)
2) If ``tensor`` is not ``None``, its datatype and layout must match matches the current versions
set by the plan (i.e., those in ``ref_dtype`` and ``ref_layout``)
If ref_dtype is not void:
1) Either ``tensor`` or ``ref_tensor`` must be set (i.e., not ``None``)
2) If ``tensor`` is not ``None``, its datatype and layout must match matches the current versions
set by the plan (i.e., those in ``ref_dtype`` and ``ref_layout``)
If ref_dtype is void:
Neither ``tensor`` nor ``ref_tensor`` are set
If either of these properties does not hold, an exception is raised. If these properties hold and
``tensor`` is not ``None``, ``tensor`` is returned. Otherwise, ``ref_tensor`` is returned.
@ -177,6 +184,11 @@ class OperationBase:
:return: valid tensor object to use
:rtype: numpy/cupy/torch array/tensor object
"""
if ref_dtype == DataType.void:
if tensor is not None or ref_tensor is not None:
raise Exception("Operands with element DataType.void must not be provided a tensor")
return None
if tensor is None:
if ref_tensor is None:
raise Exception(f"Tensor {name} must be set.")
@ -211,58 +223,60 @@ class OperationBase:
f'({self._element_a}, {self._element_b}, {self._element_accumulator}) and '
f'layout combination ({self._layout_a}, {self._layout_b}).')
# Changing the op class changes the elements per access in the epilogue. Reset this.
if self.op_class == cutlass.OpcodeClass.Simt:
elements_per_access = 1
else:
elements_per_access = 128 // cutlass.DataTypeSize[self._element_c]
if self.epilogue_functor is not None:
self.epilogue_functor = self._reset_epilogue_functor_alignment(elements_per_access, self.epilogue_functor)
# Changing the op class also changes the possible operations available. Reset these.
self.possible_operations = self.options.operations(
self.op_class, self._element_a, self._element_b,
self._element_accumulator, self._layout_a, self._layout_b)
# Changing the op class changes the elements per access in the epilogue. Reset this.
if self.epilogue_functor is not None:
self.epilogue_functor = self._reset_epilogue_functor_alignment(self._elements_per_access(), self.epilogue_functor)
#
# Epilogue
#
def _elements_per_access(self):
if self.op_class == cutlass.OpcodeClass.Simt:
return 1
elif self._element_c != DataType.void:
return 128 // DataTypeSize[self._element_c]
else:
return 128 // max(self.possible_operations.alignments("C"))
def _create_epilogue_functor_activation(self, activation):
"""
Returns the epilogue functor with given activation function
"""
if self.epilogue_functor is None:
if self.op_class == cutlass.OpcodeClass.Simt:
elements_per_access = 1
else:
elements_per_access = 128 // cutlass.DataTypeSize[self._element_c]
elements_per_access = self._elements_per_access()
else:
elements_per_access = self.epilogue_functor.epilogue_vector_length
if not self.specified_kernel_cc:
if self.current_cc == 90 and activation != epilogue.identity:
# CUTLASS 3.0 kernels currently only support identity activation. If one requests a non-identity activation,
if self.current_cc == 90 and activation != identity:
# CUTLASS 3.0 kernels in Python currently only support identity activation. If one requests a non-identity activation,
# revert to using a CUTLASS 2.x kernel by using SM80-tagged kernels.
cutlass.logger.warning("Reverting to using SM80-tagged kernel. Opclass may change.")
if self._element_c != self._element_d:
raise Exception("CUTLASS 2.x kernels require element C to be the same as element D")
self._reset_options(80)
self._reset_operations(reset_epilogue=False)
elif (self.cc == 90 and self.current_cc != 90 and activation == epilogue.identity):
elif (self.cc == 90 and self.current_cc != 90 and activation == identity):
# SM80 fallback kernels are currently used. Since an identity activation is requested,
# we can switch back to using SM90 kernels.
self._reset_options(90)
self._reset_operations(reset_epilogue=False)
else:
if self.current_cc == 90 and activation != epilogue.identity:
if self.current_cc == 90 and activation != identity:
raise Exception("Epilogues with elementwise fusion are not currently supported "
"in the Python interface for 3.x kernels. To use 2.x kernels "
"with fused elementwise epilogues, do not set the `kernel_cc` "
"parameter when constructing the Gemm object.")
return epilogue.get_activation_epilogue(
return get_activation_epilogue(
activation,
self._element_c,
self._element_d,
elements_per_access,
self._element_accumulator,
self._element_accumulator,
@ -283,13 +297,13 @@ class OperationBase:
if epilogue_functor is None or not hasattr(epilogue_functor, 'activation_functor'):
# Identity epilogue does not have 'activation_functor'
activation = epilogue.identity
activation = identity
else:
activation = epilogue_functor.activation_functor
epilogue_functor = epilogue.get_activation_epilogue(
epilogue_functor = get_activation_epilogue(
activation,
self._element_c,
self._element_d,
alignment,
self._element_accumulator,
self._element_accumulator,
@ -304,7 +318,7 @@ class OperationBase:
if hasattr(self.epilogue_functor, "activation_functor"):
return self.epilogue_functor.activation_functor
else:
return epilogue.identity
return identity
@activation.setter
def activation(self, act):
@ -363,8 +377,8 @@ class OperationBase:
epilogue_smem_bytes = self.epilogue_functor.get_smem_size(td)
# Verify the maximum number of mainloop stages
mainloop_smem_per_stage = check.calculate_smem_usage_per_stage(td, cutlass.OperationKind.Gemm)
smem_capacity_bytes = cutlass.SharedMemPerCC[self.cc] << 10
mainloop_smem_per_stage = check.calculate_smem_usage_per_stage(td, OperationKind.Gemm)
smem_capacity_bytes = SharedMemPerCC[self.cc] << 10
mainloop_stages = (smem_capacity_bytes - epilogue_smem_bytes) // mainloop_smem_per_stage
if mainloop_stages < 2:
# Mainloop stages must >= 2
@ -376,3 +390,11 @@ class OperationBase:
"The epilogue consumes too much shared memory. "
"No valid tile description is found in the generator.")
self.possible_operations = new_possible_operations
def run_setup(self):
"""
Steps that must be taken before caling `plan.run()`
"""
# Initialize the memory pool if, if not already done
cutlass.get_memory_pool()

View File

@ -1,37 +0,0 @@
#################################################################################################
#
# Copyright (c) 2023 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
"""
Profilers for Python Interface
"""
from cutlass.profiler.event_profiler import CUDAEventProfiler

View File

@ -34,7 +34,7 @@
Utilities for expressing shapes
"""
from cutlass import (
from cutlass_library import (
ConvMode,
ConvKind,
LayoutType
@ -64,7 +64,7 @@ class MatrixCoord:
Returns the leading dimension for a matrix with layout ``layout`` and shape provided by the MatrixCoord.
:param layout: layout of matrix
:type layout: cutlass.LayoutType
:type layout: cutlass_library.LayoutType
:returns: leading dimension
:rtype: int

View File

@ -34,7 +34,7 @@
Registry of swizzling functions
"""
from cutlass import SwizzlingFunctor
from cutlass_library import SwizzlingFunctor
IdentitySwizzle1 = SwizzlingFunctor.Identity1

View File

@ -36,26 +36,27 @@ Utility functions for checking constraints on kernels and calculating kernel att
import ctypes
from cutlass_library import DataTypeSize, OperationKind, SharedMemPerCC
import cutlass
from cutlass import DataTypeSize
from cutlass.backend.library import TileDescription
def calculate_smem_usage_per_stage(td: TileDescription, operation_kind: cutlass.OperationKind) -> int:
def calculate_smem_usage_per_stage(td: TileDescription, operation_kind: OperationKind) -> int:
"""
Returns the amount of shared memory in bytes consumed in a single stage of a kernel.
:param td: tile description to compute shared memory of
:type td: TileDescription
:param operation_kind: identifier for the type of operation being performed
:type operation_kind: cutlass.OperationKind
:type operation_kind: cutlass_library.OperationKind
:return: number of bytes of shared memory consumed by a single stage
:rtype: int
"""
m, n, k = td.threadblock_shape
if operation_kind == cutlass.OperationKind.Gemm:
if operation_kind == OperationKind.Gemm:
stage_barrier_bytes = 32
return (
(DataTypeSize[td.math_instruction.element_a] * m * k // 8)
@ -82,7 +83,8 @@ def valid_stage_count(
kernel_cc: int,
td: TileDescription,
element_C: cutlass.DataType = None,
element_D: cutlass.DataType = None) -> tuple:
element_D: cutlass.DataType = None,
verbose: bool = True) -> tuple:
"""
Checks whether a device with `cc` supports the number of stages within `tile_description`, both
based on raw limits on the number of stages and based on shared memory capacity
@ -97,6 +99,8 @@ def valid_stage_count(
:type element_C: cutlass.DataType
:param element_D: data type of operand D
:type element_D: cutlass.DataType
:param verbose: whether to log warnings
:type verbose: bool
:return: tuple with the first element indicating whether the provided tile description is
valid for the provided device and the second element being an error message
@ -107,7 +111,7 @@ def valid_stage_count(
# Stage count of None or 0 for SM90 indicates that the CollectiveBuilder automatically
# determines the stage count to use. Thus, all settings are valid in these scenarios.
return (True, "")
else:
elif verbose:
cutlass.logger.warning(
"Setting an explicit stage count for SM90 kernels currently may "
"result in compilation errors if the combination of tile shape, "
@ -125,9 +129,9 @@ def valid_stage_count(
# only catches cases in which the mainloop exceeds the device's shared memory capacity.
# This is not a concern for CUTLASS 2.x kernels, for which the shared memory of the
# mainloop and epilogue is shared.
smem_per_stage = calculate_smem_usage_per_stage(td, cutlass.OperationKind.Gemm)
smem_per_stage = calculate_smem_usage_per_stage(td, OperationKind.Gemm)
smem_usage_mainloop = (smem_per_stage * td.stages)
smem_arch = cutlass.SharedMemPerCC[cc] << 10
smem_arch = SharedMemPerCC[cc] << 10
if smem_usage_mainloop > smem_arch:
return ( False,
"Configuration uses too much shared memory. Consider reducing stage count or tile shape.\n"
@ -214,7 +218,9 @@ def valid_schedule(
return (False, "Kernel and epilogue schedules must either both be auto or neither be auto")
if not tile_scheduler_default:
if (tile_scheduler == cutlass.TileSchedulerType.StreamK) and (kernel_schedule != cutlass.KernelScheduleType.TmaWarpSpecializedCooperative):
cooperative_kernels = [cutlass.KernelScheduleType.TmaWarpSpecializedCooperative,
cutlass.KernelScheduleType.CpAsyncWarpSpecializedCooperative]
if (tile_scheduler == cutlass.TileSchedulerType.StreamK) and (kernel_schedule not in cooperative_kernels):
return (False, "Stream-K tile scheduler is currently only supported with the cooperative kernel schedule")
return (True, "")

View File

@ -35,33 +35,55 @@ Utility functions for converting between frontend datatypes and CUTLASS datatype
"""
import cutlass
from cutlass import (
from cutlass_library import (
DataTypeSize,
MathOperation,
MathInstruction
)
from cutlass.backend.library import (
MathInstruction,
MathOperation,
TileDescription,
)
try:
import numpy as np
bfloat16_available = None
cupy_available = None
numpy_available = None
torch_available = None
_library_to_cupy_dict = None
_library_to_numpy_dict = None
_library_to_torch_dict = None
_torch_to_library_dict = None
numpy_available = True
_library_to_numpy_dict = {
cutlass.DataType.f16: np.float16,
cutlass.DataType.f32: np.float32,
cutlass.DataType.f64: np.float64,
cutlass.DataType.s8: np.int8,
cutlass.DataType.s32: np.int32,
}
except ImportError:
numpy_available = False
_library_to_numpy_dict = {}
def is_numpy_available():
global numpy_available, _library_to_numpy_dict
if numpy_available is None:
try:
import numpy as np
numpy_available = True
_library_to_numpy_dict = {
cutlass.DataType.f16: np.float16,
cutlass.DataType.f32: np.float32,
cutlass.DataType.f64: np.float64,
cutlass.DataType.s8: np.int8,
cutlass.DataType.s32: np.int32,
}
except ImportError:
numpy_available = False
_library_to_numpy_dict = {}
return numpy_available
def is_numpy_tensor(inp) -> bool:
if is_numpy_available():
import numpy as np
return isinstance(inp, np.ndarray)
return False
def numpy_library_type(inp) -> cutlass.DataType:
if numpy_available:
if is_numpy_available():
import numpy as np
if inp == np.float16:
return cutlass.DataType.f16
elif inp == np.float32:
@ -79,24 +101,36 @@ def numpy_type(inp):
return _library_to_numpy_dict.get(inp, None)
try:
import cupy as cp
def is_cupy_available():
global cupy_available
if cupy_available is None:
try:
import cupy as cp
cupy_available = True
_library_to_cupy_dict = {
cutlass.DataType.f16: cp.float16,
cutlass.DataType.f32: cp.float32,
cutlass.DataType.f64: cp.float64,
cutlass.DataType.s8: cp.int8,
cutlass.DataType.s32: cp.int32,
}
except ImportError:
cupy_available = False
_library_to_cupy_dict = {}
cupy_available = True
_library_to_cupy_dict = {
cutlass.DataType.f16: cp.float16,
cutlass.DataType.f32: cp.float32,
cutlass.DataType.f64: cp.float64,
cutlass.DataType.s8: cp.int8,
cutlass.DataType.s32: cp.int32,
}
except ImportError:
cupy_available = False
_library_to_cupy_dict = {}
return cupy_available
def is_cupy_tensor(inp) -> bool:
if is_cupy_available():
import cupy as cp
return isinstance(inp, cp.ndarray)
return False
def cupy_library_type(inp) -> cutlass.DataType:
if cupy_available:
if is_cupy_available():
import cupy as cp
if inp == cp.float16:
return cutlass.DataType.f16
elif inp == cp.float32:
@ -110,39 +144,50 @@ def cupy_type(inp):
return _library_to_cupy_dict.get(inp, None)
try:
import torch
def is_torch_available():
global torch_available, _library_to_torch_dict, _torch_to_library_dict
if torch_available is None:
try:
import torch
torch_available = True
_torch_to_library_dict = {
torch.half: cutlass.DataType.f16,
torch.float16: cutlass.DataType.f16,
torch.bfloat16: cutlass.DataType.bf16,
torch.float: cutlass.DataType.f32,
torch.float32: cutlass.DataType.f32,
torch.double: cutlass.DataType.f64,
torch.float64: cutlass.DataType.f64,
torch.int8: cutlass.DataType.s8,
torch.int32: cutlass.DataType.s32,
torch.uint8: cutlass.DataType.u8,
}
torch_available = True
_torch_to_library_dict = {
torch.half: cutlass.DataType.f16,
torch.float16: cutlass.DataType.f16,
torch.bfloat16: cutlass.DataType.bf16,
torch.float: cutlass.DataType.f32,
torch.float32: cutlass.DataType.f32,
torch.double: cutlass.DataType.f64,
torch.float64: cutlass.DataType.f64,
torch.int8: cutlass.DataType.s8,
torch.int32: cutlass.DataType.s32,
torch.uint8: cutlass.DataType.u8,
}
_library_to_torch_dict = {
cutlass.DataType.f16: torch.half,
cutlass.DataType.f16: torch.float16,
cutlass.DataType.bf16: torch.bfloat16,
cutlass.DataType.f32: torch.float,
cutlass.DataType.f32: torch.float32,
cutlass.DataType.f64: torch.double,
cutlass.DataType.f64: torch.float64,
cutlass.DataType.s8: torch.int8,
cutlass.DataType.s32: torch.int32,
cutlass.DataType.u8: torch.uint8,
}
except ImportError:
torch_available = False
_torch_to_library_dict = {}
_library_to_torch_dict = {}
_library_to_torch_dict = {
cutlass.DataType.f16: torch.half,
cutlass.DataType.f16: torch.float16,
cutlass.DataType.bf16: torch.bfloat16,
cutlass.DataType.f32: torch.float,
cutlass.DataType.f32: torch.float32,
cutlass.DataType.f64: torch.double,
cutlass.DataType.f64: torch.float64,
cutlass.DataType.s8: torch.int8,
cutlass.DataType.s32: torch.int32,
cutlass.DataType.u8: torch.uint8,
}
except ImportError:
torch_available = False
_torch_to_library_dict = {}
_library_to_torch_dict = {}
return torch_available
def is_torch_tensor(inp) -> bool:
if is_torch_available():
import torch
return isinstance(inp, torch.Tensor)
return False
def torch_library_type(inp) -> cutlass.DataType:
@ -153,28 +198,35 @@ def torch_type(inp):
return _library_to_torch_dict.get(inp, None)
try:
import bfloat16
def is_bfloat16_available():
global bfloat16_available
bfloat16_available = True
except ImportError:
bfloat16_available = False
if bfloat16_available is None:
try:
import bfloat16
bfloat16_available = True
except ImportError:
bfloat16_available = False
return bfloat16_available
def bfloat16_library_type(inp) -> cutlass.DataType:
if bfloat16_available:
if is_bfloat16_available():
import bfloat16
if inp == bfloat16.bfloat16:
return cutlass.DataType.bf16
def bfloat16_type(inp):
if bfloat16_available:
if is_bfloat16_available():
import bfloat16
if inp == cutlass.DataType.bf16:
return bfloat16.bfloat16
def library_type(inp):
if inp in cutlass.DataTypeSize.keys():
if inp in DataTypeSize:
return inp
for cvt_fn in [
@ -205,23 +257,20 @@ def _tensor_from_torch(pt_tensor):
def get_datatype_and_layout(tensor):
if (numpy_available and isinstance(tensor, np.ndarray)) or (
cupy_available and isinstance(tensor, cp.ndarray)
):
if (is_numpy_tensor(tensor) or is_cupy_tensor(tensor)):
return _tensor_from_numpy(tensor)
elif torch_available and isinstance(tensor, torch.Tensor):
elif is_torch_tensor(tensor):
return _tensor_from_torch(tensor)
elif isinstance(tensor, float) or isinstance(tensor, int):
return (cutlass.DataType.f32, cutlass.LayoutType.RowMajor)
else:
raise Exception(f"Unable to convert tensor of type {type(tensor)} to Python-bound CUTLASS datatype and layout.")
def get_tensor_shape(tensor, op="GEMM"):
if (numpy_available and isinstance(tensor, np.ndarray)) or (
cupy_available and isinstance(tensor, cp.ndarray)
):
if (is_numpy_tensor(tensor) or is_cupy_tensor(tensor)):
return tensor.shape
elif torch_available and isinstance(tensor, torch.Tensor):
elif is_torch_tensor(tensor):
size = tensor.size()
if op == "CONV":
# PyTorch Tensors have shape NCHW
@ -237,7 +286,7 @@ def get_tensor_shape(tensor, op="GEMM"):
_math_operation_value_map = {x.value: x for x in MathOperation}
def backend_math_operation(math_op: cutlass.MathOperation):
def backend_math_operation(math_op: MathOperation):
if math_op.value not in _math_operation_value_map.keys():
raise Exception(f"Unable to convert math operation of type {math_op} to backend math operation.")
return _math_operation_value_map[math_op.value]

View File

@ -39,12 +39,12 @@ import subprocess
from cuda import cuda, cudart
import numpy as np
import torch
from cutlass import CUTLASS_PATH
from cutlass.backend.library import DataTypeSize
from cutlass.op.op import OperationBase
from cutlass.shape import GemmCoord
from cutlass.utils.datatypes import is_numpy_tensor
class GpuTimer:

View File

@ -30,6 +30,7 @@
#
#################################################################################################
import os
import sys
from . import conv2d_operation
@ -47,3 +48,16 @@ from . import rank_2k_operation
from . import rank_k_operation
from . import symm_operation
from . import trmm_operation
# Make enum types from library.py accessible via cutlass_library.*
from .library import *
# Set up `source` to point to the path containing the CUTLASS source.
# Check first if the path cotains a `source` subdirectory -- this will
# be the case when the package has been installed via pip. Otherwise,
# default to the root of CUTLASS.
install_source_path = os.path.join(__path__[0], 'source')
if os.path.isdir(install_source_path):
source_path = install_source_path
else:
source_path = os.path.join(__path__[0], '../..')

View File

@ -38,7 +38,13 @@ import enum
import os.path
import shutil
from cutlass_library.library import *
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
except ImportError:
from library import *
###################################################################################################
@ -62,11 +68,6 @@ class Conv2dOperation:
self.stride_support = stride_support
self.swizzling_functor = swizzling_functor
self.group_mode = group_mode
#
def is_mixed_input(self):
return self.A.element != self.B.element
#
def is_complex(self):
complex_operators = [
@ -75,6 +76,10 @@ class Conv2dOperation:
]
return self.tile_description.math_instruction.math_operation in complex_operators
#
def is_mixed_input(self):
return self.A.element != self.B.element
#
def accumulator_type(self):
accum = self.tile_description.math_instruction.element_accumulator
@ -262,7 +267,7 @@ class EmitConv2dInstance:
1,
${threadblock_output_shape_n},
${threadblock_output_shape_p},
${threadblock_output_shape_q}>,
${threadblock_output_shape_q}>,
${stages},
${math_operator},
${iterator_algorithm},

View File

@ -38,7 +38,13 @@ import enum
import os.path
import shutil
from cutlass_library.library import *
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
except ImportError:
from library import *
###################################################################################################
@ -60,11 +66,11 @@ class Conv3dOperation:
self.iterator_algorithm = iterator_algorithm
self.stride_support = stride_support
self.swizzling_functor = swizzling_functor
#
def is_mixed_input(self):
return self.A.element != self.B.element
#
def core_name(self):
''' The basic operation kind is prefixed with a letter indicating the accumulation type. '''

View File

@ -34,14 +34,20 @@
Utilities for emitting GEMM kernels
"""
import collections
import enum
import os.path
import shutil
import functools
import operator
import collections
import os.path
import shutil
from cutlass_library.library import *
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
except ImportError:
from library import *
###################################################################################################
#
@ -55,9 +61,14 @@ class GemmOperation:
def __init__(self, gemm_kind, arch, tile_description, A, B, C, element_epilogue, \
epilogue_functor = EpilogueFunctor.LinearCombination, swizzling_functor = SwizzlingFunctor.Identity8, D = None,
kernel_schedule = KernelScheduleType.ScheduleAuto, epilogue_schedule = EpilogueScheduleType.ScheduleAuto,
tile_scheduler = TileSchedulerType.Default):
tile_scheduler = TileSchedulerType.Default, extra_args = None):
self.prefix = "3x" if gemm_kind == GemmKind.Universal3x else ""
kinds_3x = {
GemmKind.Universal3x,
GemmKind.SparseUniversal3x,
}
self.is_3x = gemm_kind in kinds_3x
self.prefix = "3x" if self.is_3x else ""
self.operation_kind = OperationKind.Gemm
self.arch = arch
self.tile_description = tile_description
@ -66,10 +77,11 @@ class GemmOperation:
self.B = B
self.C = C
self.D = D
if self.D == None:
self.D = self.C
if gemm_kind != GemmKind.Universal3x:
if not self.is_3x:
assert(kernel_schedule == KernelScheduleType.ScheduleAuto)
assert(epilogue_schedule == EpilogueScheduleType.ScheduleAuto)
self.kernel_schedule = kernel_schedule
@ -91,7 +103,7 @@ class GemmOperation:
#
def is_mixed_input(self):
return self.A.element != self.B.element
#
def is_planar_complex(self):
return self.gemm_kind in (GemmKind.PlanarComplex, GemmKind.PlanarComplexArray)
@ -125,13 +137,20 @@ class GemmOperation:
MathOperation.and_popc: 'and'
}
if self.tile_description.math_instruction.opcode_class == OpcodeClass.TensorOp or \
self.tile_description.math_instruction.opcode_class == OpcodeClass.WmmaTensorOp:
tensor_ops = [
OpcodeClass.TensorOp,
OpcodeClass.WmmaTensorOp,
OpcodeClass.SparseTensorOp,
]
is_tensor_op = self.tile_description.math_instruction.opcode_class in tensor_ops
if is_tensor_op:
math_op = self.tile_description.math_instruction.math_operation
math_op_string = math_operations_map[math_op] if math_op in math_operations_map.keys() else ''
if self.gemm_kind == GemmKind.Universal3x:
if self.is_3x:
inst_shape = "{0}x{1}x{2}".format(*tuple(self.tile_description.math_instruction.instruction_shape))
else:
inst_shape = "{0}{1}{2}".format(*tuple(self.tile_description.math_instruction.instruction_shape))
@ -183,6 +202,16 @@ class GemmOperation:
core_name = self.core_name())
return extended_name
def datatype_name_3x(self):
'''Generates a string representing the MMA atom. Assumes accumulator type is C type.'''
datatype_name = "{element_a}_{element_b}_{element_acc}_{element_c}_{element_d}".format(
element_a = DataTypeNames[self.A.element],
element_b = DataTypeNames[self.B.element],
element_acc = DataTypeNames[self.tile_description.math_instruction.element_accumulator],
element_c = DataTypeNames[self.C.element],
element_d = DataTypeNames[self.D.element])
return datatype_name
# Generates a short string representing the AB layout tags (e.g. nt or tn)
def layout_name(self):
if self.is_complex() or self.is_planar_complex():
@ -213,6 +242,10 @@ class GemmOperation:
def epilogue_schedule_name_3x(self):
return EpilogueScheduleSuffixes[self.epilogue_schedule]
# Generate a short string representing the operation class
def opcode_class_name(self):
return OpcodeClassNames[self.tile_description.math_instruction.opcode_class]
# Generates the full kernel function name
def procedural_name(self):
''' The full procedural name indicates architecture, extended name, tile size, and layout. '''
@ -661,7 +694,6 @@ ${compile_guard_end}
###################################################################################################
#
class EmitGemmUniversal3xInstance:
''' Responsible for emitting a CUTLASS 3.x template definition'''
@ -687,10 +719,10 @@ class EmitGemmUniversal3xInstance:
using ${operation_name}_epilogue =
typename cutlass::epilogue::collective::CollectiveBuilder<
${arch}, ${opcode_class},
${arch}, ${opcode_class_epi},
cute::Shape<cute::_${tile_shape_m}, cute::_${tile_shape_n}, cute::_${tile_shape_k}>,
cute::Shape<cute::_${cluster_m},cute::_${cluster_n},cute::_${cluster_k}>,
cutlass::epilogue::collective::EpilogueTileAuto,
${epi_tile_mn},
${element_accumulator}, ${element_epilogue},
${element_c}, ${layout_c}, ${align_c},
${element_d}, ${layout_d}, ${align_d},
@ -699,7 +731,7 @@ using ${operation_name}_epilogue =
using ${operation_name}_mainloop =
typename cutlass::gemm::collective::CollectiveBuilder<
${arch}, ${opcode_class},
${arch}, ${opcode_class_main},
${element_a}, ${layout_a}, ${align_a},
${element_b}, ${layout_b}, ${align_b},
${element_accumulator},
@ -743,6 +775,10 @@ ${compile_guard_end}
stage_count_string = f"cutlass::gemm::collective::StageCountAutoCarveout<sizeof(typename {str(operation.procedural_name())}_epilogue::SharedStorage)>"
warp_shape = [tile_shape[idx] // warp_count[idx] for idx in range(3)]
epi_tile_mn = "cutlass::epilogue::collective::EpilogueTileAuto"
opcode_class_main = operation.tile_description.math_instruction.opcode_class
opcode_class_epi = opcode_class_main
instance_layout_A, instance_layout_B, instance_layout_C , instance_layout_D = \
(operation.A.layout, operation.B.layout, operation.C.layout, operation.D.layout)
@ -760,20 +796,23 @@ ${compile_guard_end}
else:
epilogue_functor = self.epilogue_functor.emit_declaration()
#
element_a = DataTypeTag[operation.A.element]
element_b = DataTypeTag[operation.B.element]
epilogue_schedule_type = EpilogueScheduleTag[operation.epilogue_schedule]
values = {
'operation_name': operation.procedural_name(),
'operation_suffix': self.operation_suffix,
'element_a': DataTypeTag[operation.A.element],
'element_a': element_a,
'layout_a': LayoutTag[instance_layout_A],
'element_b': DataTypeTag[operation.B.element],
'element_b': element_b,
'layout_b': LayoutTag[instance_layout_B],
'element_c': DataTypeTag[operation.C.element],
'layout_c': LayoutTag[instance_layout_C],
'element_d': DataTypeTag[operation.D.element],
'layout_d': LayoutTag[instance_layout_D],
'element_accumulator': DataTypeTag[operation.accumulator_type()],
'opcode_class': OpcodeClassTag[operation.tile_description.math_instruction.opcode_class],
'opcode_class_main': OpcodeClassTag[opcode_class_main],
'opcode_class_epi': OpcodeClassTag[opcode_class_epi],
'arch': "cutlass::arch::Sm%d" % operation.arch,
'tile_shape_m': str(operation.tile_description.tile_shape[0]),
'tile_shape_n': str(operation.tile_description.tile_shape[1]),
@ -788,7 +827,8 @@ ${compile_guard_end}
'instruction_shape_n': str(operation.tile_description.math_instruction.instruction_shape[1]),
'instruction_shape_k': str(operation.tile_description.math_instruction.instruction_shape[2]),
'kernel_schedule' : str(KernelScheduleTag[operation.kernel_schedule]),
'epilogue_schedule' : str(EpilogueScheduleTag[operation.epilogue_schedule]),
'epilogue_schedule' : str(epilogue_schedule_type),
'epi_tile_mn' : epi_tile_mn,
'epilogue_functor': epilogue_functor,
'stages': stage_count_string,
'align_a': str(operation.A.alignment),
@ -800,7 +840,7 @@ ${compile_guard_end}
'math_operation': MathOperationTag[operation.tile_description.math_instruction.math_operation],
'epilogue_vector_length': str(epilogue_vector_length),
'element_epilogue': str(DataTypeTag[operation.element_epilogue]),
'tile_scheduler': str(TileSchedulerTag[operation.tile_scheduler])
'tile_scheduler': str(TileSchedulerTag[operation.tile_scheduler]),
}
return SubstituteTemplate(self.gemm_template, values)

View File

@ -34,16 +34,52 @@
Utilities for enumerating CUTLASS library kernels
"""
import argparse
import enum
from itertools import product
import logging
import os.path
import shutil
import argparse
import logging
from cutlass_library.library import *
from cutlass_library.manifest import *
from itertools import product
import sys
# Certain usecases of cutlass_library nearly always prefer to run as scripts with
# relative imports, rather than via an installed Python package. An example of this
# is using CUTLASS's CMake system to generate a library of kernels to be profiled.
# To make it easy to use these use cases when an existing installation of cutlass_library
# exists, this global flag can be set to true (via command-line arguments) to ensure
# that package-based installations are not used.
# Create a temporary argument parser to check only for the availability of the
# --disable-cutlass-package-imports argument, which controls whether package-based
# imports are disabled.
def _add_package_disablement_flag(argparser):
argparser.add_argument("--disable-cutlass-package-imports", action='store_true', required=False,
help="Disable use of cutlass_library from Python package")
_parser = argparse.ArgumentParser()
_add_package_disablement_flag(_parser)
_args, _ = _parser.parse_known_args()
# Add `CUTLASS_IGNORE_PACKAGE` to `builtins` so that it is visible for gating future
# imports without requiring importing another module. Ideally, we would just place this
# as a global variable in a module to that could be imported and checked (e.g.,
# utils.CUTLASS_IGNORE_PACKAGE). However, this raises the issue of determining
# where this module should be sourced (from the cutlass_library package or from
# a relative import), which is the problem this variable is being used to solve in the
# first place.
import builtins
builtins.CUTLASS_IGNORE_PACKAGE = _args.disable_cutlass_package_imports
try:
if CUTLASS_IGNORE_PACKAGE:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
from cutlass_library.manifest import *
except ImportError:
from library import *
from manifest import *
###################################################################################################
#
@ -79,7 +115,7 @@ def EpilogueAlignment(max_alignment, tile, epilogue_steps = 8):
return min(max_alignment, elements_per_thread)
def DefaultSwizzlingFunctor():
return SwizzlingFunctor.Identity8;
return SwizzlingFunctor.Identity8
# To use StreamK decomposition for basic GEMMs, set `swizzling_functor = SwizzlingFunctor.StreamK`
#
@ -103,7 +139,7 @@ def CreateGemmOperator(manifest, layouts, tile_descriptions, data_type, \
for tile_description in tile_descriptions:
for alignment in alignment_constraints:
for complex_transform in complex_transforms:
# If alignment is a tuple or a list, then we have different alignments for A and B
alignment_a = alignment if isinstance(alignment, int) else alignment[0]
alignment_b = alignment if isinstance(alignment, int) else alignment[1]
@ -121,7 +157,6 @@ def CreateGemmOperator(manifest, layouts, tile_descriptions, data_type, \
return operations
# Generates 3.0 API based GemmUniversal API kernels. Alignment constraints are folded in with layouts
def CreateGemmUniversal3xOperator(
manifest, layouts, tile_descriptions, data_types,
@ -157,11 +192,14 @@ def CreateGemmUniversal3xOperator(
C = TensorDescription(data_type["c_type"], layout[2][0], layout[2][1])
D = TensorDescription(data_type["d_type"], layout[2][0], layout[2][1])
extra_args = {}
gemm_kind = GemmKind.Universal3x
element_compute = data_type.get("epi_type", data_type["acc_type"])
operation = GemmOperation(
GemmKind.Universal3x, tile_description.minimum_compute_capability,
gemm_kind, tile_description.minimum_compute_capability,
tile_description, A, B, C, element_compute, epilogue_functor, swizzling_functor, D,
kernel_schedule, epilogue_schedule, tile_scheduler)
kernel_schedule, epilogue_schedule, tile_scheduler, extra_args)
manifest.append(operation)
operations.append(operation)
@ -2153,7 +2191,6 @@ def GenerateSM80_PlanarComplexTensorOp_16816(manifest, cuda_version):
CreateGemmPlanarComplexOperator(manifest, layouts, tile_descriptions, \
data_type_mixed, alignment_constraints, complex_transforms)
#
def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
@ -2225,8 +2262,9 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
math_inst.element_accumulator,
]
# streamk uses more regs which can cause spill for the biggest warp tile size when the accumulators are 32bit.
operations = CreateGemmOperator(manifest, layouts, tile_descriptions, \
data_type, alignment_constraints)
data_type, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
# Avoid emitting two kernels if the accumulator type does not differ from the input type (e.g. F16 accumulation)
if math_inst.element_a != math_inst.element_accumulator:
@ -2239,14 +2277,13 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
]
operations += CreateGemmOperator(manifest, layouts, tile_descriptions, \
data_type_mixed, alignment_constraints)
data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
for op in operations:
if (DataTypeSize[op.C.element] == 16) and \
(op.tile_description.threadblock_shape[1] <= 32):
op.C.alignment = 4
#
def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
@ -2287,8 +2324,7 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
# inner list contains the alignment constraints for operands/matrices
# [[alignA, alignB, alignC],..]
alignment_constraints = [[8, 16, 8],]
for math_inst in math_instructions:
tile_descriptions = [
# 128x128
@ -2321,8 +2357,9 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
math_inst.element_accumulator,
]
# streamk uses more regs which can cause spill for the biggest warp tile size when the accumulators are 32bit.
CreateGemmOperator(manifest, layouts, tile_descriptions, \
data_type, alignment_constraints)
data_type, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
# Avoid emitting two kernels if the accumulator type does not differ from the input type (e.g. F16 accumulation)
if math_inst.element_a != math_inst.element_accumulator:
@ -2335,12 +2372,12 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
]
operations = CreateGemmOperator(manifest, layouts, tile_descriptions, \
data_type_mixed, alignment_constraints)
data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
for op in operations:
if op.tile_description.threadblock_shape[1] <= 32:
op.C.alignment = 4
#
def GenerateSM80_TensorOp_16832_TN(manifest, cuda_version):
@ -2723,6 +2760,7 @@ def GenerateSM80_TensorOp_16864_Interleaved(manifest, cuda_version):
for op in operations:
op.C.alignment = 16
#
#
def GenerateSM80_TensorOp_168256(manifest, cuda_version):
@ -4458,6 +4496,154 @@ def GenerateSM90_TensorOp_16b_WGMMA_gemm(manifest, cuda_version):
[[KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
tile_schedulers=[TileSchedulerType.StreamK])
#
def GenerateSM90_TensorOp_16b_WGMMA_alignx_gemm(manifest, cuda_version):
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
return
# layouts for ABC and their alignments.
layouts = [
[[LayoutType.RowMajor, 4], [LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 1]],
[[LayoutType.RowMajor, 4], [LayoutType.RowMajor, 4], [LayoutType.ColumnMajor, 1]],
[[LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 1]],
[[LayoutType.ColumnMajor, 4], [LayoutType.RowMajor, 4], [LayoutType.ColumnMajor, 1]],
[[LayoutType.RowMajor, 2], [LayoutType.ColumnMajor, 2], [LayoutType.ColumnMajor, 1]],
[[LayoutType.RowMajor, 2], [LayoutType.RowMajor, 2], [LayoutType.ColumnMajor, 1]],
[[LayoutType.ColumnMajor, 2], [LayoutType.ColumnMajor, 2], [LayoutType.ColumnMajor, 1]],
[[LayoutType.ColumnMajor, 2], [LayoutType.RowMajor, 2], [LayoutType.ColumnMajor, 1]],
]
math_instructions = [
MathInstruction(
[64, 128, 16],
DataType.f16, DataType.f16, DataType.f16,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[64, 128, 16],
DataType.f16, DataType.f16, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[64, 128, 16],
DataType.bf16, DataType.bf16, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
]
min_cc = 90
max_cc = 90
for math_inst in math_instructions:
tile_descriptions_small = [
# TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
# 0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
]
tile_descriptions_medium = [
TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
# TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1]*2, math_inst.instruction_shape[2]*4],
# 0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
]
tile_descriptions = tile_descriptions_small + tile_descriptions_medium
data_type = {
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : math_inst.element_accumulator,
"d_type" : math_inst.element_accumulator,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
}
# Set alignment c based on Destination format.
for layout in layouts:
if data_type["c_type"] in [DataType.s32, DataType.f32]:
layout[2][1] = 4
elif data_type["c_type"] in [DataType.f16, DataType.bf16]:
layout[2][1] = 8
schedules = [
# [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
[KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized]
]
stream_k_schedules = []
if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
schedules += [
[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized],
# [KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.NoSmemWarpSpecialized]
]
stream_k_schedules += [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]]
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, schedules)
if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
# Add stream-K variants
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, stream_k_schedules, tile_schedulers=[TileSchedulerType.StreamK])
# persistent kernels with TMA epilogues
# if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
# CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
# [[KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.TmaWarpSpecialized],
# [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
# CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
# [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
# tile_schedulers=[TileSchedulerType.StreamK])
# # Emit instance without C allocation + load
# data_type["c_type"] = DataType.void
# CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
# [[KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.TmaWarpSpecialized],
# [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
# CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
# [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
# tile_schedulers=[TileSchedulerType.StreamK])
# for mixed precision kernels, also generate kernels that write output matrix in the A/B format
# Avoid emitting two kernels if the accumulator type does not differ from the input type (e.g. F16 accumulation)
if math_inst.element_a != math_inst.element_accumulator:
data_type_mixed = {
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : math_inst.element_a,
"d_type" : math_inst.element_a,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
}
# Set alignment c based on Destination format.
for layout in layouts:
if data_type_mixed["c_type"] in [DataType.s32, DataType.f32]:
layout[2][1] = 4
elif data_type_mixed["c_type"] in [DataType.f16, DataType.bf16]:
layout[2][1] = 8
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed, schedules)
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed, stream_k_schedules, tile_schedulers=[TileSchedulerType.StreamK])
# persistent kernels with TMA epilogues
# if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
# CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
# [[KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.TmaWarpSpecialized],
# [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
# CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
# [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
# tile_schedulers=[TileSchedulerType.StreamK])
# # Emit instance without C allocation+load
# data_type_mixed["c_type"] = DataType.void
# CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
# [[KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.TmaWarpSpecialized],
# [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
# CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
# [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
# tile_schedulers=[TileSchedulerType.StreamK])
#
def GenerateSM90_TensorOp_tf32_WGMMA_gemm(manifest, cuda_version):
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
@ -4582,6 +4768,91 @@ def GenerateSM90_TensorOp_tf32_WGMMA_gemm(manifest, cuda_version):
CreateGemmUniversal3xOperator(manifest, layouts_tf32_tn_nn_nt, tile_descriptions, data_types, schedules_default)
CreateGemmUniversal3xOperator(manifest, layouts_tf32_tt, tile_descriptions, data_types, schedules_transposed_epilogue)
#
def GenerateSM90_TensorOp_tf32_WGMMA_alignx_gemm(manifest, cuda_version):
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
return
# layouts for ABC and their alignments.
layouts = [
[[LayoutType.RowMajor, 2], [LayoutType.ColumnMajor, 2], [LayoutType.ColumnMajor, 1]],
[[LayoutType.RowMajor, 2], [LayoutType.RowMajor, 2], [LayoutType.ColumnMajor, 1]],
[[LayoutType.ColumnMajor, 2], [LayoutType.ColumnMajor, 2], [LayoutType.ColumnMajor, 1]],
[[LayoutType.ColumnMajor, 2], [LayoutType.RowMajor, 2], [LayoutType.ColumnMajor, 1]],
[[LayoutType.RowMajor, 1], [LayoutType.ColumnMajor, 1], [LayoutType.ColumnMajor, 1]],
[[LayoutType.RowMajor, 1], [LayoutType.RowMajor, 1], [LayoutType.ColumnMajor, 1]],
[[LayoutType.ColumnMajor, 1], [LayoutType.ColumnMajor, 1], [LayoutType.ColumnMajor, 1]],
[[LayoutType.ColumnMajor, 1], [LayoutType.RowMajor, 1], [LayoutType.ColumnMajor, 1]],
]
math_inst = MathInstruction(
[64, 128, 8],
DataType.tf32, DataType.tf32, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add)
min_cc = 90
max_cc = 90
tile_descriptions_medium = [
TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1])
]
tile_descriptions_small = [
# TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
# 0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1])
]
tile_descriptions = tile_descriptions_medium + tile_descriptions_small
data_types = [
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : math_inst.element_accumulator,
"d_type" : math_inst.element_accumulator,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
{
"a_type" : DataType.f32,
"b_type" : DataType.f32,
"c_type" : math_inst.element_accumulator,
"d_type" : math_inst.element_accumulator,
"acc_type" : math_inst.element_accumulator,
"epi_type" : DataType.f32
}
]
is_tt_layout = lambda v: v[0][0] == LayoutType.RowMajor and v[1][0] == LayoutType.RowMajor
# Split kernels into TN/NT, NN or TT layouts
layouts_tn_nn_nt = filter(lambda v: not is_tt_layout(v), layouts)
layouts_tt = filter(is_tt_layout, layouts)
CreateGemmUniversal3xOperator(manifest, layouts_tn_nn_nt, tile_descriptions, data_types, [
# [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
[KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized],
])
# Kernels with TT layout use EpilogueTransposed (NoSmemWarpSpecialized with swapped strides),
# because they use NN kernels underneath and transposing its epilogue will get the correct output
CreateGemmUniversal3xOperator(manifest, layouts_tt, tile_descriptions, data_types, [
# [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.EpilogueTransposed],
[KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.EpilogueTransposed]
])
if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
CreateGemmUniversal3xOperator(manifest, layouts_tn_nn_nt, tile_descriptions, data_types, [
# [KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.NoSmemWarpSpecialized],
[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]
])
# Stream-K schedules
CreateGemmUniversal3xOperator(manifest, layouts_tn_nn_nt, tile_descriptions, data_types, [
[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]
], tile_schedulers=[TileSchedulerType.StreamK])
#
def GenerateSM90_TensorOp_int8_WGMMA_gemm(manifest, cuda_version):
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
@ -4677,6 +4948,81 @@ def GenerateSM90_TensorOp_int8_WGMMA_gemm(manifest, cuda_version):
tile_schedulers=[TileSchedulerType.Persistent, TileSchedulerType.StreamK]
)
#
def GenerateSM90_TensorOp_int8_WGMMA_alignx_gemm(manifest, cuda_version):
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
return
# layouts for ABC and their alignments
layouts = [
[[LayoutType.RowMajor, 8], [LayoutType.ColumnMajor, 8], [LayoutType.ColumnMajor, 1]],
[[LayoutType.RowMajor, 4], [LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 1]],
]
math_instructions = [
MathInstruction(
[64, 128, 32],
DataType.s8, DataType.s8, DataType.s32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[64, 128, 32],
DataType.u8, DataType.u8, DataType.s32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
]
min_cc = 90
max_cc = 90
for math_inst in math_instructions:
tile_descriptions_small = [
# TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
# 0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
]
tile_descriptions_medium = [
TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
]
tile_descriptions = tile_descriptions_medium + tile_descriptions_small
data_types = [
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : math_inst.element_accumulator,
"d_type" : math_inst.element_accumulator,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : DataType.s8,
"d_type" : math_inst.element_a,
"acc_type" : math_inst.element_accumulator,
"epi_type" : DataType.f32
}
]
for data_type in data_types:
for layout in layouts:
layout[2][1] = 128 // DataTypeSize[data_type["d_type"]]
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, [
# [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.NoSmemWarpSpecialized],
[KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized]
])
if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, [
# [KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.NoSmemWarpSpecialized],
[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]
])
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
[[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]],
tile_schedulers=[TileSchedulerType.StreamK])
#
def GenerateSM90_TensorOp_fp8_WGMMA_gemm(manifest, cuda_version):
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
return
@ -4882,6 +5228,188 @@ def GenerateSM90_TensorOp_fp8_WGMMA_gemm(manifest, cuda_version):
[KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
tile_schedulers=[TileSchedulerType.StreamK])
#
def GenerateSM90_TensorOp_fp8_WGMMA_alignx_gemm(manifest, cuda_version):
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
return
# layouts for ABC and their alignments
layouts = [
[[LayoutType.RowMajor, 8], [LayoutType.ColumnMajor, 8], [LayoutType.ColumnMajor, 1]], # TN Layout
[[LayoutType.RowMajor, 4], [LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 1]], # TN Layout
]
math_instructions = [
# inst 64x128x32
MathInstruction(
[64, 128, 32],
DataType.e4m3, DataType.e4m3, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[64, 128, 32],
DataType.e4m3, DataType.e5m2, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[64, 128, 32],
DataType.e5m2, DataType.e4m3, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[64, 128, 32],
DataType.e5m2, DataType.e5m2, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
# inst 64x64x32
# MathInstruction(
# [64, 64, 32],
# DataType.e4m3, DataType.e4m3, DataType.f32,
# OpcodeClass.TensorOp,
# MathOperation.multiply_add),
# MathInstruction(
# [64, 64, 32],
# DataType.e4m3, DataType.e5m2, DataType.f32,
# OpcodeClass.TensorOp,
# MathOperation.multiply_add),
# MathInstruction(
# [64, 64, 32],
# DataType.e5m2, DataType.e4m3, DataType.f32,
# OpcodeClass.TensorOp,
# MathOperation.multiply_add),
# MathInstruction(
# [64, 64, 32],
# DataType.e5m2, DataType.e5m2, DataType.f32,
# OpcodeClass.TensorOp,
# MathOperation.multiply_add),
]
min_cc = 90
max_cc = 90
for math_inst in math_instructions:
data_types = [
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : DataType.f32,
"d_type" : DataType.f32,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : DataType.f32,
"d_type" : DataType.e4m3,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : DataType.f32,
"d_type" : DataType.e5m2,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : DataType.bf16,
"d_type" : DataType.bf16,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : DataType.bf16,
"d_type" : DataType.e4m3,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : DataType.bf16,
"d_type" : DataType.e5m2,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : DataType.f16,
"d_type" : DataType.f16,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : DataType.f16,
"d_type" : DataType.e4m3,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
{
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : DataType.f16,
"d_type" : DataType.e5m2,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
},
]
if math_inst.instruction_shape[1] == 128:
tile_descriptions = [
# 128x128x128
TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
]
# elif math_inst.instruction_shape[1] == 64:
# tile_descriptions = [
# # 256x64x128
# TileDescription([math_inst.instruction_shape[0]*4, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
# 0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
# ]
else:
assert False, "math inst is not supported"
if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
schedules = [
# [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized],
# [KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.NoSmemWarpSpecialized],
[KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized],
]
stream_k_schedules = [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]]
else:
schedules = [
# [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
[KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized]
]
stream_k_schedules = []
for data_type in data_types:
# With No-SMEM epilogues
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, schedules)
if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
# Persistent kernels with TMA epilogues
# CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
# [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
# Add stream-K variants (with and without TMA epilogues)
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, stream_k_schedules, tile_schedulers=[TileSchedulerType.StreamK])
# CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
# [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
# tile_schedulers=[TileSchedulerType.StreamK])
#
def GenerateSM90_TensorOp_1684(manifest, cuda_version):
@ -5488,9 +6016,13 @@ def GenerateSM90_TensorOp_1684_symm_complex_gaussian(manifest, cuda_version):
#
def GenerateSM90(manifest, cuda_version):
GenerateSM90_TensorOp_16b_WGMMA_gemm(manifest, cuda_version)
GenerateSM90_TensorOp_16b_WGMMA_alignx_gemm(manifest, cuda_version)
GenerateSM90_TensorOp_tf32_WGMMA_gemm(manifest, cuda_version)
GenerateSM90_TensorOp_tf32_WGMMA_alignx_gemm(manifest, cuda_version)
GenerateSM90_TensorOp_int8_WGMMA_gemm(manifest, cuda_version)
GenerateSM90_TensorOp_int8_WGMMA_alignx_gemm(manifest, cuda_version)
GenerateSM90_TensorOp_fp8_WGMMA_gemm(manifest, cuda_version)
GenerateSM90_TensorOp_fp8_WGMMA_alignx_gemm(manifest, cuda_version)
GenerateSM90_TensorOp_1684(manifest, cuda_version)
GenerateSM90_TensorOp_1684_complex(manifest, cuda_version)
GenerateSM90_TensorOp_1684_complex_gaussian(manifest, cuda_version)
@ -5543,6 +6075,7 @@ def define_parser():
parser.add_argument("--disable-full-archs-compilation", action="store_true", required=False, help="Disable compilation for every archs in --architectures")
parser.add_argument("--log-level", default='info', type=numeric_log_level, required=False,
help='Logging level to be used by the generator script')
_add_package_disablement_flag(parser)
return parser

View File

@ -400,6 +400,9 @@ ShortComplexLayoutNames = {
class KernelScheduleType(enum.Enum):
ScheduleAuto = enum_auto()
Multistage = enum_auto()
CpAsyncWarpSpecialized = enum_auto()
CpAsyncWarpSpecializedPingpong = enum_auto()
CpAsyncWarpSpecializedCooperative = enum_auto()
Tma = enum_auto()
TmaWarpSpecialized = enum_auto()
TmaWarpSpecializedPingpong = enum_auto()
@ -411,6 +414,9 @@ class KernelScheduleType(enum.Enum):
KernelScheduleTag = {
KernelScheduleType.ScheduleAuto: 'cutlass::gemm::collective::KernelScheduleAuto',
KernelScheduleType.Multistage: 'cutlass::gemm::KernelMultistage',
KernelScheduleType.CpAsyncWarpSpecialized: 'cutlass::gemm::KernelCpAsyncWarpSpecialized',
KernelScheduleType.CpAsyncWarpSpecializedPingpong: 'cutlass::gemm::KernelCpAsyncWarpSpecializedPingpong',
KernelScheduleType.CpAsyncWarpSpecializedCooperative: 'cutlass::gemm::KernelCpAsyncWarpSpecializedCooperative',
KernelScheduleType.Tma: 'cutlass::gemm::KernelTma',
KernelScheduleType.TmaWarpSpecialized: 'cutlass::gemm::KernelTmaWarpSpecialized',
KernelScheduleType.TmaWarpSpecializedPingpong: 'cutlass::gemm::KernelTmaWarpSpecializedPingpong',
@ -424,6 +430,9 @@ KernelScheduleTag = {
KernelScheduleSuffixes = {
KernelScheduleType.ScheduleAuto: '',
KernelScheduleType.Multistage: '_cpasync',
KernelScheduleType.CpAsyncWarpSpecialized: '_cpasync_warpspecialized',
KernelScheduleType.CpAsyncWarpSpecializedPingpong: '_cpasync_warpspecialized_pingpong',
KernelScheduleType.CpAsyncWarpSpecializedCooperative: '_cpasync_warpspecialized_cooperative',
KernelScheduleType.Tma: '_unspecialized',
KernelScheduleType.TmaWarpSpecialized: '_warpspecialized',
KernelScheduleType.TmaWarpSpecializedPingpong: '_warpspecialized_pingpong',
@ -541,7 +550,6 @@ class OpcodeClass(enum.Enum):
WmmaTensorOp = enum_auto()
SparseTensorOp = enum_auto()
OpcodeClassNames = {
OpcodeClass.Simt: 'simt',
OpcodeClass.TensorOp: 'tensorop',
@ -628,19 +636,20 @@ class GemmKind(enum.Enum):
Sparse = enum_auto()
Universal = enum_auto()
Universal3x = enum_auto()
SparseUniversal3x = enum_auto()
PlanarComplex = enum_auto()
PlanarComplexArray = enum_auto()
Grouped = enum_auto()
#
GemmKindNames = {
GemmKind.Gemm: "gemm",
GemmKind.Sparse: "spgemm",
GemmKind.Universal: "gemm",
GemmKind.Universal3x: "gemm",
GemmKind.SparseUniversal3x: "spgemm",
GemmKind.PlanarComplex: "gemm_planar_complex",
GemmKind.PlanarComplexArray: "gemm_planar_complex_array",
GemmKind.Grouped: "gemm_grouped"
GemmKind.Grouped: "gemm_grouped",
}
#
@ -797,7 +806,7 @@ class GroupMode(enum.Enum):
NoneGroup = enum_auto() # dense conv (G=1)
SingleGroup = enum_auto() # grouped convolution (single group per CTA)
MultipleGroup = enum_auto() # grouped convolution ( multiple groups per CTA)
Depthwise = enum_auto() # Depthwise convolution ( C=K=G )
Depthwise = enum_auto() # Depthwise convolution ( C=K=G )
#
GroupModeTag = {
@ -818,14 +827,18 @@ GroupModeNames = {
#
class MathInstruction:
def __init__(self, instruction_shape, element_a, element_b, element_accumulator, opcode_class, math_operation = MathOperation.multiply_add):
def __init__(self,
instruction_shape, \
element_a, element_b, element_accumulator, \
opcode_class, math_operation = MathOperation.multiply_add \
):
self.instruction_shape = instruction_shape
self.element_a = element_a
self.element_b = element_b
self.element_accumulator = element_accumulator
self.opcode_class = opcode_class
self.math_operation = math_operation
#
class TileDescription:

View File

@ -36,18 +36,31 @@ and building code
"""
import enum
import logging
import os.path
import shutil
from cutlass_library.library import *
from cutlass_library.gemm_operation import *
from cutlass_library.rank_k_operation import *
from cutlass_library.rank_2k_operation import *
from cutlass_library.trmm_operation import *
from cutlass_library.symm_operation import *
from cutlass_library.conv2d_operation import *
from cutlass_library.conv3d_operation import *
import logging
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
from cutlass_library.gemm_operation import *
from cutlass_library.rank_k_operation import *
from cutlass_library.rank_2k_operation import *
from cutlass_library.trmm_operation import *
from cutlass_library.symm_operation import *
from cutlass_library.conv2d_operation import *
from cutlass_library.conv3d_operation import *
except ImportError:
from library import *
from gemm_operation import *
from rank_k_operation import *
from rank_2k_operation import *
from trmm_operation import *
from symm_operation import *
from conv2d_operation import *
from conv3d_operation import *
###################################################################################################
_LOGGER = logging.getLogger(__name__)
@ -380,7 +393,6 @@ class Manifest:
architectures = args.architectures.split(';') if len(args.architectures) else ['50',]
architectures = [x if x != '90a' else '90' for x in architectures]
self.compute_capabilities = [int(x) for x in architectures]
if args.filter_by_cc in ['false', 'False', '0']:

View File

@ -35,12 +35,18 @@ Utilities for emitting Rank2K kernels
"""
import enum
import os.path
import shutil
import functools
import operator
import os.path
import shutil
from cutlass_library.library import *
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
except ImportError:
from library import *
###################################################################################################
@ -82,7 +88,7 @@ class Rank2KOperation:
#
def is_mixed_input(self):
return self.A.element != self.B.element
#
def is_planar_complex(self):
return False
@ -234,7 +240,7 @@ using Operation_${operation_name} =
"""
self.rank_k_complex_template = """
// Rank K operator ${operation_name}
using Operation_${operation_name} =
using Operation_${operation_name} =
typename cutlass::gemm::device::Rank2K<
${element_a}, ${layout_a},
${element_b}, ${layout_b},

View File

@ -35,12 +35,18 @@ Utilities for emitting RankK kernels
"""
import enum
import os.path
import shutil
import functools
import operator
import os.path
import shutil
from cutlass_library.library import *
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
except ImportError:
from library import *
###################################################################################################
@ -80,7 +86,7 @@ class RankKOperation:
#
def is_mixed_input(self):
return False
#
def is_planar_complex(self):
return False
@ -259,7 +265,7 @@ using Operation_${operation_name} =
def emit(self, operation):
threadblock_shape = operation.tile_description.threadblock_shape
warp_count = operation.tile_description.warp_count
warp_shape = [threadblock_shape[idx] // warp_count[idx] for idx in range(3)]

View File

@ -35,12 +35,18 @@ Utilities for emitting Symm kernels
"""
import enum
import os.path
import shutil
import functools
import operator
import os.path
import shutil
from cutlass_library.library import *
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
except ImportError:
from library import *
###################################################################################################
@ -82,7 +88,7 @@ class SymmOperation:
#
def is_mixed_input(self):
return self.A.element != self.B.element
#
def is_planar_complex(self):
return False
@ -241,7 +247,7 @@ using Operation_${operation_name} =
// Symm operator ${operation_name}
using Operation_${operation_name} =
typename cutlass::gemm::device::Symm<
${element_a}, ${layout_a}, ${side_mode}, ${fill_mode},
${element_a}, ${layout_a}, ${side_mode}, ${fill_mode},
${element_b}, ${layout_b},
${element_c}, ${layout_c},
${element_accumulator},

View File

@ -35,12 +35,18 @@ Utilities for emitting Trmm kernels
"""
import enum
import os.path
import shutil
import functools
import operator
import os.path
import shutil
from cutlass_library.library import *
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
except ImportError:
from library import *
###################################################################################################
@ -84,7 +90,7 @@ class TrmmOperation:
#
def is_mixed_input(self):
return self.A.element != self.B.element
#
def accumulator_type(self):
accum = self.tile_description.math_instruction.element_accumulator

View File

@ -1,40 +0,0 @@
#################################################################################################
#
# Copyright (c) 2023 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
FROM nvcr.io/nvidia/pytorch:22.11-py3
RUN chmod ugo+rwx /home
RUN pip uninstall -y rmm
RUN pip install rmm-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
ENV CUDA_INSTALL_PATH=/usr/local/cuda

View File

@ -1,38 +0,0 @@
#################################################################################################
#
# Copyright (c) 2023 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
FROM nvcr.io/nvidia/pytorch:23.01-py3
RUN chmod ugo+rwx /home
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
ENV CUDA_INSTALL_PATH=/usr/local/cuda

View File

@ -1,38 +0,0 @@
#################################################################################################
#
# Copyright (c) 2023 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
FROM nvcr.io/nvidia/pytorch:23.03-py3
RUN chmod ugo+rwx /home
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
ENV CUDA_INSTALL_PATH=/usr/local/cuda

View File

@ -9,28 +9,25 @@ Prior to installing the CUTLASS Python interface, one may optionally set the fol
* `CUDA_INSTALL_PATH`: the path to the installation of CUDA
If these environment variables are not set, the installation process will infer them to be the following:
* `CUTLASS_PATH`: one directory level above the current directory (i.e., `$(pwd)/..`)
* `CUTLASS_PATH`: either one directory level above the current directory (i.e., `$(pwd)/..`) if installed locally or in the `source` directory of the location in which `cutlass_library` was installed
* `CUDA_INSTALL_PATH`: the directory holding `/bin/nvcc` for the first version of `nvcc` on `$PATH` (i.e., `which nvcc | awk -F'/bin/nvcc' '{print $1}'`)
**NOTE:** The version of `cuda-python` installed must match the CUDA version in `CUDA_INSTALL_PATH`.
### Installing a developer-mode package
The CUTLASS Python interface can currently be installed via:
The CUTLASS Python interface can currently be installed by navigating to the root of the CUTLASS directory and performing
```bash
python setup.py develop --user
pip install .
```
This will allow changes to the Python interface source to be reflected when using the Python interface.
We plan to add support for installing via `python setup.py install` in a future release.
If you would like to be able to make changes to CULASS Python interface and have them reflected when using the interface, perform:
```bash
pip install -e .
```
## Docker
To ensure that you have all of the necessary Python modules for running the examples using the
CUTLASS Python interface, we recommend using one of the Docker images located in the docker directory.
We recommend using the CUTLASS Python interface via an [NGC PyTorch Docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch):
For example, to build and launch a container that uses CUDA 12.1 via an NGC PyTorch container, run:
```bash
docker build -t cutlass-cuda12.1:latest -f docker/Dockerfile-cuda12.1-pytorch .
docker run --gpus all -it --rm cutlass-cuda12.1:latest
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.08-py3
```
The CUTLASS Python interface has been tested with CUDA 11.8, 12.0, and 12.1 on Python 3.8.10 and 3.9.7.

View File

@ -51,7 +51,7 @@ setup_pycute.perform_setup()
setup(
name='cutlass',
version='3.2.1',
version='3.3.0',
description='CUTLASS Pythonic Interface',
package_dir={'': '.'},
packages=[

View File

@ -36,7 +36,7 @@ from setuptools import setup
def perform_setup():
setup(
name='cutlass_library',
version='3.2.1',
version='3.3.0',
description='CUTLASS library generation scripts',
packages=['cutlass_library']
)

View File

@ -36,7 +36,7 @@ from setuptools import setup
def perform_setup():
setup(
name='pycute',
version='3.2.1',
version='3.3.0',
description='Python implementation of CuTe',
packages=['pycute'],
)