CUTLASS 3.5.0 (#1411)
This commit is contained in:
@ -1,12 +1,14 @@
|
||||

|
||||
|
||||
# Python packages associated with CUTLASS
|
||||
|
||||
This directory contains Python packages that are associated with CUTLASS:
|
||||
|
||||
* `cutlass`: the CUTLASS Python interface, which enables one to compile and run CUTLASS kernels from within Python
|
||||
* `cutlass_library`: utilities used for enumerating and emitting C++ code for CUTLASS kernels
|
||||
|
||||
## CUTLASS Python Interface
|
||||
|
||||
The CUTLASS Python interface enables one to compile and run CUTLASS operations from within Python.
|
||||
|
||||
```python
|
||||
@ -19,34 +21,46 @@ plan.run(A, B, C, D)
|
||||
```
|
||||
|
||||
### Overview
|
||||
The CUTLASS Python interface aims to provide an ease-of-use interface for using CUTLASS via Python. Toward this goal,
|
||||
the CUTLASS Python interface attempts to:
|
||||
|
||||
* Present high-level interfaces for operators that require only few parameters
|
||||
* Select sensible default configurations for an operator given the parameters that have been specified
|
||||
* Enumerate configurations for users that are known to work in a given setting
|
||||
* Reduce the occurrence of C++ compile-time errors in favor of descriptive Python exceptions
|
||||
* Make it easy to export CUTLASS kernels to framework extensions (e.g., PyTorch CUDA extensions)
|
||||
The CUTLASS Python interface prioritizes ease of use.
|
||||
It has the following features that support this goal.
|
||||
|
||||
* It presents high-level interfaces for operators, that require only few parameters.
|
||||
* It selects sensible default configurations for an operator given the parameters that have been specified.
|
||||
* It enumerates configurations for users that are known to work in a given setting.
|
||||
* It favors emitting descriptive Python run-time exceptions instead of C++ compile-time errors, where possible.
|
||||
* It simplifies exporting CUTLASS kernels to framework extensions (e.g., PyTorch CUDA extensions).
|
||||
|
||||
#### Non-goals
|
||||
The CUTLASS Python interface does not intended to:
|
||||
The CUTLASS Python interface does not intend to:
|
||||
|
||||
**Select optimal kernel configurations.**
|
||||
As an ease-of-use interface, the default selections for operator parameters made by the CUTLASS Python interface may
|
||||
not achieve the highest possible performance in all scenarios. Users wishing to achieve the highest performance possible
|
||||
should consider profile different combinations of configuration parameters, or use a library such as [cuBLAS](https://developer.nvidia.com/cublas)
|
||||
that contains heuristics for selecting kernels.
|
||||
1. select optimal kernel configurations,
|
||||
2. act as a fast container for CUTLASS kernels, or
|
||||
3. act as a Python-to-CUDA-kernel just-in-time (JIT) compilation engine.
|
||||
|
||||
**Act as a fast container for CUTLASS kernels.**
|
||||
The CUTLASS Python interface does not strive to minimize overhead in its Python functions surrounding the running of a kernel.
|
||||
Those wishing to deploy a CUTLASS kernel should consider either using the C++ emitted by the Python interface directly, or using
|
||||
one of the CUTLASS emitters for automatically creating a framework extension for the kernel (e.g., a PyTorch CUDA extension).
|
||||
Regarding selection of optimal kernel configurations,
|
||||
the interface favors ease-of-use over maximum configurability.
|
||||
Thus, its default selections for operator parameters may
|
||||
not achieve the highest possible performance in all scenarios. Users wishing to achieve the highest performance possible should either
|
||||
|
||||
**Act as a Python-to-CUDA-kernel JIT compilation engine.**
|
||||
The CUTLASS Python interface intends to enable one to use CUTLASS via Python. It can be used by frameworks for JIT compiling
|
||||
* select parameters by profiling different combinations of them, or
|
||||
* use a library such as [cuBLAS](https://developer.nvidia.com/cublas)
|
||||
that contains heuristics for selecting kernels.
|
||||
|
||||
Regarding acting as a fast container for CUTLASS kernels:
|
||||
the interface does not strive to minimize overhead in its Python functions surrounding the running of a kernel.
|
||||
Those wishing to deploy a CUTLASS kernel should either
|
||||
|
||||
* use the C++ emitted by the Python interface directly, or
|
||||
* use one of the CUTLASS emitters for automatically creating a framework extension for the kernel (e.g., a PyTorch CUDA extension).
|
||||
|
||||
Regarding acting as a Python-to-CUDA-kernel JIT compilation engine:
|
||||
the interface enables use of CUTLASS in Python code.
|
||||
It can be used by frameworks for JIT compiling
|
||||
Python to CUDA kernels, but does not set out to be such a framework.
|
||||
|
||||
#### Comparison to PyCUTLASS
|
||||
|
||||
The CUTLASS Python interface builds atop CUTLASS's [PyCUTLASS](https://github.com/NVIDIA/cutlass/tree/v3.0.0/tools/library/scripts/pycutlass) library. PyCUTLASS enables
|
||||
one to declare, compile, and run GEMMs, convolutions, and grouped GEMM operators with nearly the same configuration
|
||||
space as CUTLASS's C++ interface. While this flexibility enables one to achieve the similar levels of functionality
|
||||
@ -73,17 +87,21 @@ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.08-py3 -p 8888:8888
|
||||
The CUTLASS Python interface has been tested with CUDA 11.8, 12.0, and 12.1 on Python 3.8 and 3.9.
|
||||
|
||||
#### Optional environment variables
|
||||
|
||||
Prior to installing the CUTLASS Python interface, one may optionally set the following environment variables:
|
||||
|
||||
* `CUTLASS_PATH`: the path to the cloned CUTLASS repository
|
||||
* `CUDA_INSTALL_PATH`: the path to the installation of CUDA
|
||||
|
||||
If these environment variables are not set, the installation process will infer them to be the following:
|
||||
|
||||
* `CUTLASS_PATH`: either one directory level above the current directory (i.e., `$(pwd)/..`) if installed locally or in the `source` directory of the location in which `cutlass_library` was installed
|
||||
* `CUDA_INSTALL_PATH`: the directory holding `/bin/nvcc` for the first version of `nvcc` on `$PATH` (i.e., `which nvcc | awk -F'/bin/nvcc' '{print $1}'`)
|
||||
|
||||
**NOTE:** The version of `cuda-python` installed must match the CUDA version in `CUDA_INSTALL_PATH`.
|
||||
|
||||
#### Installation
|
||||
|
||||
Stable releases of the CUTLASS Python interface are available via the `nvidia-cutlass` PyPI package. Any other packages with the name `cutlass` are not affiliated with NVIDIA CUTLASS.
|
||||
```bash
|
||||
pip install nvidia-cutlass
|
||||
@ -94,7 +112,7 @@ The CUTLASS Python interface can also be installed from source by navigating to
|
||||
pip install .
|
||||
```
|
||||
|
||||
If you would like to be able to make changes to CUTLASS Python interface and have them reflected when using the interface, perform:
|
||||
If you would like to be able to make changes to the CUTLASS Python interface and have them reflected when using the interface, perform:
|
||||
```bash
|
||||
pip install -e .
|
||||
```
|
||||
@ -118,6 +136,7 @@ Currently, the following operations can be exported to a PyTorch CUDA extension:
|
||||
* Conv2d
|
||||
|
||||
### Examples
|
||||
|
||||
Jupyter notebook examples of using the CUTLASS Python interface are located in [examples/python](/examples/python).
|
||||
|
||||
To launch these notebooks from this directory, run:
|
||||
@ -126,9 +145,10 @@ jupyter-lab ../examples/python
|
||||
```
|
||||
|
||||
### Building documentation
|
||||
|
||||
The CUTLASS Python interface uses [Sphinx](https://www.sphinx-doc.org/en/master/) for documentation.
|
||||
|
||||
Building the documentation requires additional packages. These can be installed via:
|
||||
Building the documentation requires additional packages. The following commands will install them.
|
||||
```bash
|
||||
sudo apt-get install pandoc
|
||||
pip install --upgrade Sphinx furo pandoc myst-parser sphinx-copybutton nbsphinx nbsphinx-link sphinx-inline-tabs
|
||||
@ -137,7 +157,7 @@ pip install --upgrade Sphinx furo pandoc myst-parser sphinx-copybutton nbsphinx
|
||||
To build documentation, you must first have installed the CUTLASS Python interface via the
|
||||
[installation instructions](#installation).
|
||||
|
||||
Documentation can then be built via the following commands:
|
||||
Documentation can then be built via the following commands.
|
||||
```bash
|
||||
sphinx-apidoc -o docs_src/source/ cutlass/ cutlass/backend*
|
||||
cd docs_src
|
||||
@ -146,6 +166,7 @@ mv _build/* ../docs
|
||||
```
|
||||
|
||||
## CUTLASS library package
|
||||
|
||||
[cutlass_library](/python/cutlass_library) contains utilities for enumerating and emitting CUTLASS C++ kernels.
|
||||
It is used by the CUTLASS CMake system to construct a library of kernels that can be profiled using the CUTLASS profiler.
|
||||
|
||||
|
||||
@ -121,7 +121,7 @@ def get_option_registry():
|
||||
this._option_registry = OptionRegistry(device_cc())
|
||||
return this._option_registry
|
||||
|
||||
this.__version__ = '3.4.1'
|
||||
this.__version__ = '3.5.0'
|
||||
|
||||
from cutlass.backend import create_memory_pool
|
||||
from cutlass.emit.pytorch import pytorch
|
||||
|
||||
@ -244,7 +244,7 @@ def get_gemm_arguments_3x(mainloop_arguments, epilogue_functor, scheduler_args,
|
||||
class _HardwareInfo(ctypes.Structure):
|
||||
_fields_ = [
|
||||
("device_id", ctypes.c_int),
|
||||
("sm_count", ctypes.c_int)
|
||||
("sm_count", ctypes.c_int),
|
||||
]
|
||||
|
||||
class _GemmArguments(ctypes.Structure):
|
||||
|
||||
@ -122,7 +122,7 @@ class LinearCombination(EpilogueFunctorBase):
|
||||
:param element_output: data type used to load and store tensors
|
||||
|
||||
:param epilogue_vector_length: number of elements computed per operation.
|
||||
Usually it is 128/sizeof_bits<ElementOutput_>, but we use 64 and 32 sometimes
|
||||
Usually it is 128/sizeof_bits_v<ElementOutput_>, but we use 64 and 32 sometimes
|
||||
when there are not enough data to store
|
||||
|
||||
:param element_accumulator: Accumulator data type
|
||||
@ -207,7 +207,7 @@ class LinearCombinationClamp(LinearCombination):
|
||||
:param element_output: data type used to load and store tensors
|
||||
|
||||
:param epilogue_vector_length: number of elements computed per operation.
|
||||
Usually it is 128/sizeof_bits<ElementOutput_>, but we use 64 and 32 sometimes
|
||||
Usually it is 128/sizeof_bits_v<ElementOutput_>, but we use 64 and 32 sometimes
|
||||
when there are not enough data to store
|
||||
|
||||
:param element_accumulator: Accumulator data type
|
||||
@ -260,7 +260,7 @@ class FastLinearCombinationClamp(EpilogueFunctorBase):
|
||||
:param element_output: data type used to load and store tensors
|
||||
|
||||
:param epilogue_vector_length: number of elements computed per operation.
|
||||
Usually it is 128/sizeof_bits<ElementOutput_>, but we use 64 and 32 sometimes
|
||||
Usually it is 128/sizeof_bits_v<ElementOutput_>, but we use 64 and 32 sometimes
|
||||
when there are not enough data to store
|
||||
"""
|
||||
|
||||
@ -310,7 +310,7 @@ class LinearCombinationGeneric(LinearCombination):
|
||||
:param element_output: data type used to load and store tensors
|
||||
|
||||
:param epilogue_vector_length: number of elements computed per operation.
|
||||
Usually it is 128/sizeof_bits<ElementOutput_>, but we use 64 and 32 sometimes
|
||||
Usually it is 128/sizeof_bits_v<ElementOutput_>, but we use 64 and 32 sometimes
|
||||
when there are not enough data to store
|
||||
|
||||
:param element_accumulator: Accumulator data type
|
||||
|
||||
@ -299,7 +299,7 @@ class Sm90ColumnReductionImpl(ColumnReductionImpl):
|
||||
|
||||
self._type_decl = f"""
|
||||
using {self.name_camel} = cutlass::epilogue::fusion::Sm90ColReduction<
|
||||
{op_tag(self.reg_reduce_fn)}, {op_tag(self.gmem_reduce_fn)}, 0,
|
||||
{op_tag(self.reg_reduce_fn)}, {op_tag(self.reg_reduce_fn)}, {op_tag(self.gmem_reduce_fn)}, 0,
|
||||
typename EpilogueDescriptor::TileShape, {DataTypeTag[self.element]},
|
||||
{DataTypeTag[self.element_compute]}, {FloatRoundStyleTag[self.round_style]},
|
||||
{self.stride_mnl}
|
||||
@ -321,7 +321,7 @@ class Sm90RowReductionImpl(RowReductionImpl):
|
||||
|
||||
self._type_decl = f"""
|
||||
using {self.name_camel} = cutlass::epilogue::fusion::Sm90RowReduction<
|
||||
{op_tag(self.reg_reduce_fn)}, {op_tag(self.gmem_reduce_fn)}, 0 /* Stages */,
|
||||
{op_tag(self.reg_reduce_fn)}, {op_tag(self.reg_reduce_fn)}, {op_tag(self.gmem_reduce_fn)}, 0 /* Stages */,
|
||||
typename EpilogueDescriptor::TileShape, {DataTypeTag[self.element]},
|
||||
{DataTypeTag[self.element_compute]}, {FloatRoundStyleTag[self.round_style]},
|
||||
{self.stride_mnl}
|
||||
|
||||
@ -565,7 +565,9 @@ class GemmArguments3x(GemmArguments2x):
|
||||
)
|
||||
|
||||
# Set hardware info
|
||||
hw_info_ = hw_info(0, device_sm_count())
|
||||
hw_info_ = hw_info(
|
||||
0, device_sm_count(),
|
||||
)
|
||||
|
||||
self.arguments = argument_type(
|
||||
int(self.gemm_mode),
|
||||
@ -1300,7 +1302,7 @@ using DeviceKernel = cutlass::gemm::device::GemmUniversalAdapter<${operation_nam
|
||||
# Support built-in epilogue functors or user-defined functions
|
||||
|
||||
if operation.tile_description.stages is None or operation.tile_description.stages == 0:
|
||||
stage_count_type = "cutlass::gemm::collective::StageCountAutoCarveout<sizeof(typename CollectiveEpilogue::SharedStorage)>"
|
||||
stage_count_type = "cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>"
|
||||
else:
|
||||
stage_count_type = "_" + str(operation.tile_description.stages)
|
||||
|
||||
|
||||
@ -35,16 +35,22 @@ Utilities for emitting Conv2d kernels
|
||||
"""
|
||||
|
||||
import enum
|
||||
import logging
|
||||
import os.path
|
||||
import shutil
|
||||
from string import Template
|
||||
|
||||
try:
|
||||
import builtins
|
||||
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
|
||||
raise ImportError("Disabling attempt to import cutlass_library")
|
||||
from cutlass_library.library import *
|
||||
from cutlass_library.conv3x_emitter import EmitConv3xInstance, EmitConv3xIncludes
|
||||
except ImportError:
|
||||
from library import *
|
||||
from conv3x_emitter import EmitConv3xInstance, EmitConv3xIncludes
|
||||
|
||||
_LOGGER = logging.getLogger(__name__)
|
||||
|
||||
###################################################################################################
|
||||
|
||||
@ -174,6 +180,8 @@ class Conv2dOperation:
|
||||
|
||||
class EmitConv2dInstance:
|
||||
def __init__(self):
|
||||
# Emitter for CUTLASS 3 convolution operations
|
||||
self.conv3x_emitter = EmitConv3xInstance()
|
||||
self.template = """
|
||||
// Conv2d${conv_kind_name} ${iterator_algorithm_name} kernel instance "${operation_name}"
|
||||
using ${operation_name}_base =
|
||||
@ -277,7 +285,18 @@ class EmitConv2dInstance:
|
||||
>::Kernel;
|
||||
"""
|
||||
|
||||
def arch_number_to_type(self, arch: int):
|
||||
return f"cutlass::arch::Sm{arch}"
|
||||
|
||||
def emit(self, operation):
|
||||
_LOGGER.debug("*** EmitConv2dInstance::emit")
|
||||
_LOGGER.debug("*** operation: procedural_name()=" + operation.procedural_name())
|
||||
|
||||
if hasattr(operation, 'is_3x') and operation.is_3x:
|
||||
_LOGGER.debug("*** CUTLASS 3 operation")
|
||||
return self.conv3x_emitter.emit(operation)
|
||||
|
||||
_LOGGER.debug("*** CUTLASS 2 operation")
|
||||
|
||||
warp_shape = [int(operation.tile_description.threadblock_shape[idx] / operation.tile_description.warp_count[idx]) for idx in range(3)]
|
||||
|
||||
@ -320,9 +339,11 @@ class EmitConv2dInstance:
|
||||
}
|
||||
|
||||
if operation.group_mode == GroupMode.NoneGroup:
|
||||
_LOGGER.debug("*** group_mode=NoneGroup")
|
||||
return SubstituteTemplate(self.template, values)
|
||||
|
||||
elif operation.group_mode == GroupMode.Depthwise:
|
||||
_LOGGER.debug("*** group_mode=Depthwise")
|
||||
values['group_mode'] = GroupModeTag[operation.group_mode]
|
||||
# Setup other template params
|
||||
values['threadblock_output_shape_n'] = str(operation.tile_description.threadblock_output_shape[0])
|
||||
@ -343,6 +364,7 @@ class EmitConv2dInstance:
|
||||
return SubstituteTemplate(self.template_depthwise_direct_conv, values)
|
||||
|
||||
else:
|
||||
_LOGGER.debug("*** group_mode=" + GroupModeTag[operation.group_mode])
|
||||
values['group_mode'] = GroupModeTag[operation.group_mode]
|
||||
return SubstituteTemplate(self.template_group_conv, values)
|
||||
|
||||
@ -354,6 +376,7 @@ class EmitConv2dInstance:
|
||||
|
||||
#
|
||||
def GenerateConv2dTensorOp(manifest, tile_descriptions, min_cc, align = 128):
|
||||
_LOGGER.debug("*** GenerateConv2dTensorOp")
|
||||
|
||||
for tile in tile_descriptions:
|
||||
for conv_kind in [ConvKind.Fprop, ConvKind.Dgrad, ConvKind.Wgrad]:
|
||||
@ -372,6 +395,24 @@ def GenerateConv2dTensorOp(manifest, tile_descriptions, min_cc, align = 128):
|
||||
|
||||
manifest.append(Conv2dOperation(conv_kind, min_cc, tile, A, B, C, tile.math_instruction.element_accumulator))
|
||||
|
||||
class EmitConv2dIncludes:
|
||||
'''Emit includes that are specific to the operation.'''
|
||||
|
||||
def __init__(self):
|
||||
self.includes = ['conv2d_operation.h']
|
||||
self.emitter_3x = EmitConv3xIncludes()
|
||||
|
||||
def operation_is_3x(self, operation) -> bool:
|
||||
"""Whether operation is a CUTLASS 3 convolution (as opposed to CUTLASS 2)"""
|
||||
return hasattr(operation, 'is_3x') and operation.is_3x
|
||||
|
||||
def emit(self, operation) -> str:
|
||||
if self.operation_is_3x(operation):
|
||||
return self.emitter_3x.emit(operation)
|
||||
|
||||
return '\n'.join(f"#include \"{incl}\"" for incl in self.includes) + \
|
||||
"\n\n///////////////////////////////////////////////////////////////////////////////////////////////////"
|
||||
|
||||
###################################################################################################
|
||||
#
|
||||
# Emitters functions for all targets
|
||||
@ -384,17 +425,8 @@ class EmitConv2dConfigurationLibrary:
|
||||
self.configuration_path = os.path.join(operation_path, "%s.cu" % configuration_name)
|
||||
|
||||
self.instance_emitter = EmitConv2dInstance()
|
||||
self.includes_emitter = EmitConv2dIncludes()
|
||||
|
||||
self.instance_template = """
|
||||
${operation_instance}
|
||||
|
||||
// Derived class
|
||||
struct ${operation_name} :
|
||||
public ${operation_name}_base { };
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////////////////////////////
|
||||
|
||||
"""
|
||||
self.header_template = """
|
||||
/*
|
||||
Generated by conv2d_operation.py - Do not edit.
|
||||
@ -407,9 +439,17 @@ struct ${operation_name} :
|
||||
#include "cutlass/library/manifest.h"
|
||||
|
||||
#include "library_internal.h"
|
||||
#include "conv2d_operation.h"
|
||||
"""
|
||||
|
||||
self.instance_template = """
|
||||
${stub_begin}
|
||||
${operation_instance}
|
||||
// Derived class
|
||||
struct ${operation_name} :
|
||||
public ${operation_name}_base { };
|
||||
${stub_end}
|
||||
///////////////////////////////////////////////////////////////////////////////////////////////////
|
||||
|
||||
"""
|
||||
|
||||
self.configuration_header = """
|
||||
@ -419,32 +459,22 @@ namespace library {
|
||||
|
||||
// Initialize all instances
|
||||
void initialize_${configuration_name}(Manifest &manifest) {
|
||||
|
||||
"""
|
||||
|
||||
self.configuration_instance = """
|
||||
using Operation_${operation_name} = cutlass::conv::device::ImplicitGemmConvolution<
|
||||
self.configuration_instance = """${stub_begin}
|
||||
using Operation_${operation_name} = cutlass::conv::device::${kernel_name}<
|
||||
${operation_name}>;
|
||||
|
||||
manifest.append(new cutlass::library::Conv2dOperation<
|
||||
Operation_${operation_name}>(
|
||||
"${operation_name}"));
|
||||
|
||||
manifest.append(new cutlass::library::${operation_wrapper}<
|
||||
Operation_${operation_name}
|
||||
>(
|
||||
"${operation_name}"
|
||||
));
|
||||
${stub_end}
|
||||
"""
|
||||
|
||||
self.configuration_direct_conv_instance = """
|
||||
using Operation_${operation_name} = cutlass::conv::device::DirectConvolution<
|
||||
${operation_name}>;
|
||||
self.configuration_epilogue = "}\n"
|
||||
|
||||
manifest.append(new cutlass::library::DirectConv2dOperation<
|
||||
Operation_${operation_name}>(
|
||||
"${operation_name}"));
|
||||
|
||||
"""
|
||||
|
||||
self.configuration_epilogue = """
|
||||
}
|
||||
"""
|
||||
self.epilogue_template = """
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////////////////////////////
|
||||
@ -456,42 +486,131 @@ void initialize_${configuration_name}(Manifest &manifest) {
|
||||
|
||||
"""
|
||||
|
||||
#
|
||||
def operation_is_3x(self, operation):
|
||||
"""Whether operation is a CUTLASS 3 convolution (as opposed to CUTLASS 2)"""
|
||||
return hasattr(operation, 'is_3x') and operation.is_3x
|
||||
|
||||
def __enter__(self):
|
||||
"""
|
||||
Open the configuration_file, and write the "header" C++ code to it.
|
||||
|
||||
The "header" consists of a comment (that this is generated code,
|
||||
so it should not be edited), and includes that are common
|
||||
to all kinds of kernels.
|
||||
"""
|
||||
_LOGGER.debug('*** EmitConv2dConfigurationLibrary::__enter__')
|
||||
_LOGGER.debug('*** configuration_path (file to write): ' +
|
||||
str(self.configuration_path))
|
||||
_LOGGER.debug('*** configuration_name: ' + self.configuration_name)
|
||||
self.configuration_file = open(self.configuration_path, "w")
|
||||
|
||||
self.configuration_file.write(SubstituteTemplate(self.header_template, {
|
||||
'configuration_name': self.configuration_name
|
||||
}))
|
||||
self.operations = []
|
||||
return self
|
||||
|
||||
#
|
||||
def emit(self, operation):
|
||||
"""
|
||||
Write three pieces of C++ code to the configuration_file
|
||||
(that was opened by the __enter__ method above):
|
||||
|
||||
1. the header includes that are specific to the operation
|
||||
(CUTLASS 2 vs. CUTLASS 3);
|
||||
|
||||
2. the "operation instance" (a "using" declaration ending in "_base"); and
|
||||
|
||||
3. the "operation name" (declaration and definition of a derived class
|
||||
of the above operation instance).
|
||||
|
||||
The "using" declaration turns a C++ class name, possibly namespace-qualified,
|
||||
possibly also with angle brackets, into a C-style, easily demangled identifier.
|
||||
"""
|
||||
_LOGGER.debug('*** EmitConv2dConfigurationLibrary::emit')
|
||||
_LOGGER.debug('*** operation.procedural_name(): ' + operation.procedural_name())
|
||||
self.operations.append(operation)
|
||||
self.configuration_file.write(SubstituteTemplate(self.instance_template, {
|
||||
|
||||
self.configuration_file.write(self.includes_emitter.emit(operation))
|
||||
|
||||
stub_begin = ''
|
||||
stub_end = ''
|
||||
# It can be useful to stub (comment) out instantiations for testing.
|
||||
# In this case, one need only set is_stub to True.
|
||||
is_stub = False
|
||||
if is_stub:
|
||||
stub_begin = "// STUB for now\n#if 0"
|
||||
stub_end = '#endif // 0'
|
||||
|
||||
self.configuration_file.write(Template(self.instance_template).substitute({
|
||||
'configuration_name': self.configuration_name,
|
||||
'operation_name': operation.procedural_name(),
|
||||
'operation_instance': self.instance_emitter.emit(operation)
|
||||
'operation_instance': self.instance_emitter.emit(operation),
|
||||
'stub_begin': stub_begin,
|
||||
'stub_end': stub_end
|
||||
}))
|
||||
|
||||
#
|
||||
def __exit__(self, exception_type, exception_value, traceback):
|
||||
"""
|
||||
Write the rest of the C++ code to the configuration_file, and close the file.
|
||||
|
||||
The "rest of the C++ code" has the following components.
|
||||
|
||||
1. Configuration header: Open the namespace(s), and open the definition
|
||||
of the "initialize_${configuration_name}" registration function
|
||||
that registers the operation with the Manifest.
|
||||
("Registration" helps turn C++ compile-time polymorphism
|
||||
(via template parameters) into a run-time choice of parameters.)
|
||||
|
||||
2. Configuration instance: In the body of the registration function,
|
||||
make a "using" declaration Operation_${operation_name} for the
|
||||
operation type (which uses operation_name as its template argument).
|
||||
Then, tell the manifest about the operation via a "manifest.append" call.
|
||||
The argument of the call is a new instance of
|
||||
"SomethingOperation<Operation_${operation_name}>"
|
||||
(replace Something with a specific name).
|
||||
|
||||
3. Configuration epilogue: Close the definition of the registration function.
|
||||
|
||||
4. Epilogue template: Close the namespace(s).
|
||||
"""
|
||||
|
||||
_LOGGER.debug('*** EmitConv2dConfigurationLibrary::__exit__')
|
||||
_LOGGER.debug('*** configuration_path (file to write): ' +
|
||||
str(self.configuration_path))
|
||||
_LOGGER.debug('*** configuration_name: ' + self.configuration_name)
|
||||
|
||||
self.configuration_file.write(SubstituteTemplate(self.configuration_header, {
|
||||
'configuration_name': self.configuration_name
|
||||
}))
|
||||
|
||||
for operation in self.operations:
|
||||
stub_begin = ''
|
||||
stub_end = ''
|
||||
# It can be useful to stub (comment) out instantiations for testing.
|
||||
# In this case, one need only set is_stub to True.
|
||||
is_stub = False
|
||||
if is_stub:
|
||||
stub_begin = "// STUB for now\n#if 0"
|
||||
stub_end = "#endif // 0"
|
||||
|
||||
if operation.group_mode == GroupMode.Depthwise:
|
||||
self.configuration_file.write(SubstituteTemplate(self.configuration_direct_conv_instance, {
|
||||
'configuration_name': self.configuration_name,
|
||||
'operation_name': operation.procedural_name()
|
||||
}))
|
||||
kernel_name = 'DirectConvolution'
|
||||
operation_wrapper = 'DirectConv2dOperation'
|
||||
else:
|
||||
self.configuration_file.write(SubstituteTemplate(self.configuration_instance, {
|
||||
'configuration_name': self.configuration_name,
|
||||
'operation_name': operation.procedural_name()
|
||||
}))
|
||||
kernel_name = 'ImplicitGemmConvolution'
|
||||
operation_wrapper = 'Conv2dOperation'
|
||||
if self.operation_is_3x(operation):
|
||||
kernel_name = 'ConvUniversalAdapter'
|
||||
operation_wrapper = 'ConvOperation3x'
|
||||
|
||||
self.configuration_file.write(SubstituteTemplate(self.configuration_instance, {
|
||||
'configuration_name': self.configuration_name,
|
||||
'operation_name': operation.procedural_name(),
|
||||
'kernel_name': kernel_name,
|
||||
'operation_wrapper': operation_wrapper,
|
||||
'stub_begin': stub_begin,
|
||||
'stub_end': stub_end
|
||||
}))
|
||||
|
||||
self.configuration_file.write(self.configuration_epilogue)
|
||||
self.configuration_file.write(self.epilogue_template)
|
||||
|
||||
@ -35,16 +35,22 @@ Utilities for emitting Conv3d kernels
|
||||
"""
|
||||
|
||||
import enum
|
||||
import logging
|
||||
import os.path
|
||||
import shutil
|
||||
from string import Template
|
||||
|
||||
try:
|
||||
import builtins
|
||||
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
|
||||
raise ImportError("Disabling attempt to import cutlass_library")
|
||||
from cutlass_library.library import *
|
||||
from cutlass_library.conv3x_emitter import EmitConv3xInstance, EmitConv3xIncludes
|
||||
except ImportError:
|
||||
from library import *
|
||||
from conv3x_emitter import EmitConv3xInstance, EmitConv3xIncludes
|
||||
|
||||
_LOGGER = logging.getLogger(__name__)
|
||||
|
||||
###################################################################################################
|
||||
|
||||
@ -148,6 +154,8 @@ class Conv3dOperation:
|
||||
|
||||
class EmitConv3dInstance:
|
||||
def __init__(self):
|
||||
# Emitter for CUTLASS 3 convolution operations
|
||||
self.conv3x_emitter = EmitConv3xInstance()
|
||||
self.template = """
|
||||
// Conv3d${conv_kind_name} ${iterator_algorithm_name} kernel instance "${operation_name}"
|
||||
using ${operation_name}_base =
|
||||
@ -178,8 +186,15 @@ class EmitConv3dInstance:
|
||||
>::Kernel;
|
||||
"""
|
||||
|
||||
|
||||
def emit(self, operation):
|
||||
_LOGGER.debug("*** EmitConv3dInstance::emit")
|
||||
_LOGGER.debug("*** operation: procedural_name()=" + operation.procedural_name())
|
||||
|
||||
if hasattr(operation, 'is_3x') and operation.is_3x:
|
||||
_LOGGER.debug("*** CUTLASS 3 operation")
|
||||
return self.conv3x_emitter.emit(operation)
|
||||
|
||||
_LOGGER.debug("*** CUTLASS 2 operation")
|
||||
|
||||
warp_shape = [int(operation.tile_description.threadblock_shape[idx] / operation.tile_description.warp_count[idx]) for idx in range(3)]
|
||||
|
||||
@ -245,6 +260,24 @@ def GenerateConv3dTensorOp(manifest, tile_descriptions, min_cc, align = 128):
|
||||
|
||||
manifest.append(Conv3dOperation(conv_kind, min_cc, tile, A, B, C, tile.math_instruction.element_accumulator))
|
||||
|
||||
class EmitConv3dIncludes:
|
||||
'''Emit includes that are specific to the operation.'''
|
||||
|
||||
def __init__(self):
|
||||
self.includes = ['conv3d_operation.h']
|
||||
self.emitter_3x = EmitConv3xIncludes()
|
||||
|
||||
def operation_is_3x(self, operation) -> bool:
|
||||
"""Whether operation is a CUTLASS 3 convolution (as opposed to CUTLASS 2)"""
|
||||
return hasattr(operation, 'is_3x') and operation.is_3x
|
||||
|
||||
def emit(self, operation) -> str:
|
||||
if self.operation_is_3x(operation):
|
||||
return self.emitter_3x.emit(operation)
|
||||
|
||||
return '\n'.join(f"#include \"{incl}\"" for incl in self.includes) + \
|
||||
"\n\n///////////////////////////////////////////////////////////////////////////////////////////////////"
|
||||
|
||||
###################################################################################################
|
||||
#
|
||||
# Emitters functions for all targets
|
||||
@ -257,17 +290,8 @@ class EmitConv3dConfigurationLibrary:
|
||||
self.configuration_path = os.path.join(operation_path, "%s.cu" % configuration_name)
|
||||
|
||||
self.instance_emitter = EmitConv3dInstance()
|
||||
self.includes_emitter = EmitConv3dIncludes()
|
||||
|
||||
self.instance_template = """
|
||||
${operation_instance}
|
||||
|
||||
// Derived class
|
||||
struct ${operation_name} :
|
||||
public ${operation_name}_base { };
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////////////////////////////
|
||||
|
||||
"""
|
||||
self.header_template = """
|
||||
/*
|
||||
Generated by conv3d_operation.py - Do not edit.
|
||||
@ -280,9 +304,17 @@ struct ${operation_name} :
|
||||
#include "cutlass/library/manifest.h"
|
||||
|
||||
#include "library_internal.h"
|
||||
#include "conv3d_operation.h"
|
||||
"""
|
||||
|
||||
self.instance_template = """
|
||||
${stub_begin}
|
||||
${operation_instance}
|
||||
// Derived class
|
||||
struct ${operation_name} :
|
||||
public ${operation_name}_base { };
|
||||
${stub_end}
|
||||
///////////////////////////////////////////////////////////////////////////////////////////////////
|
||||
|
||||
"""
|
||||
|
||||
self.configuration_header = """
|
||||
@ -292,22 +324,22 @@ namespace library {
|
||||
|
||||
// Initialize all instances
|
||||
void initialize_${configuration_name}(Manifest &manifest) {
|
||||
|
||||
"""
|
||||
|
||||
self.configuration_instance = """
|
||||
using Operation_${operation_name} = cutlass::conv::device::ImplicitGemmConvolution<
|
||||
self.configuration_instance = """${stub_begin}
|
||||
using Operation_${operation_name} = cutlass::conv::device::${kernel_name}<
|
||||
${operation_name}>;
|
||||
|
||||
manifest.append(new cutlass::library::Conv3dOperation<
|
||||
Operation_${operation_name}>(
|
||||
"${operation_name}"));
|
||||
|
||||
manifest.append(new cutlass::library::${operation_wrapper}<
|
||||
Operation_${operation_name}
|
||||
>(
|
||||
"${operation_name}"
|
||||
));
|
||||
${stub_end}
|
||||
"""
|
||||
|
||||
self.configuration_epilogue = """
|
||||
}
|
||||
"""
|
||||
self.configuration_epilogue = "}\n"
|
||||
|
||||
self.epilogue_template = """
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////////////////////////////
|
||||
@ -319,35 +351,126 @@ void initialize_${configuration_name}(Manifest &manifest) {
|
||||
|
||||
"""
|
||||
|
||||
#
|
||||
def operation_is_3x(self, operation):
|
||||
"""Whether operation is a CUTLASS 3 convolution (as opposed to CUTLASS 2)"""
|
||||
return hasattr(operation, 'is_3x') and operation.is_3x
|
||||
|
||||
def __enter__(self):
|
||||
"""
|
||||
Open the configuration_file, and write the "header" C++ code to it.
|
||||
|
||||
The "header" consists of a comment (that this is generated code,
|
||||
so it should not be edited), and includes that are common
|
||||
to both the CUTLASS 2 and the CUTLASS 3 cases.
|
||||
"""
|
||||
_LOGGER.debug('*** EmitConv3dConfigurationLibrary::__enter__')
|
||||
_LOGGER.debug('*** configuration_path (file to write): ' +
|
||||
str(self.configuration_path))
|
||||
_LOGGER.debug('*** configuration_name: ' + self.configuration_name)
|
||||
self.configuration_file = open(self.configuration_path, "w")
|
||||
|
||||
self.configuration_file.write(SubstituteTemplate(self.header_template, {
|
||||
'configuration_name': self.configuration_name
|
||||
}))
|
||||
self.operations = []
|
||||
return self
|
||||
|
||||
#
|
||||
def emit(self, operation):
|
||||
"""
|
||||
Write three pieces of C++ code to the configuration_file
|
||||
(that was opened by the __enter__ method above):
|
||||
|
||||
1. the header includes that are specific to the operation
|
||||
(CUTLASS 2 vs. CUTLASS 3);
|
||||
|
||||
2. the "operation instance" (a "using" declaration ending in "_base"); and
|
||||
|
||||
3. the "operation name" (declaration and definition of a derived class
|
||||
of the above operation instance).
|
||||
|
||||
The "using" declaration turns a C++ class name, possibly namespace-qualified,
|
||||
possibly also with angle brackets, into a C-style, easily demangled identifier.
|
||||
"""
|
||||
_LOGGER.debug('*** EmitConv3dConfigurationLibrary::emit')
|
||||
_LOGGER.debug('*** operation.procedural_name(): ' + operation.procedural_name())
|
||||
self.operations.append(operation)
|
||||
self.configuration_file.write(SubstituteTemplate(self.instance_template, {
|
||||
|
||||
self.configuration_file.write(self.includes_emitter.emit(operation))
|
||||
|
||||
stub_begin = ''
|
||||
stub_end = ''
|
||||
# It can be useful to stub (comment) out instantiations for testing.
|
||||
# In this case, one need only set is_stub to True.
|
||||
is_stub = False
|
||||
if is_stub:
|
||||
stub_begin = "// STUB for now\n#if 0"
|
||||
stub_end = '#endif // 0'
|
||||
|
||||
self.configuration_file.write(Template(self.instance_template).substitute({
|
||||
'configuration_name': self.configuration_name,
|
||||
'operation_name': operation.procedural_name(),
|
||||
'operation_instance': self.instance_emitter.emit(operation)
|
||||
'operation_instance': self.instance_emitter.emit(operation),
|
||||
'stub_begin': stub_begin,
|
||||
'stub_end': stub_end
|
||||
}))
|
||||
|
||||
#
|
||||
def __exit__(self, exception_type, exception_value, traceback):
|
||||
"""
|
||||
Write the rest of the C++ code to the configuration_file, and close the file.
|
||||
|
||||
The "rest of the C++ code" has the following components.
|
||||
|
||||
1. Configuration header: Open the namespace(s), and open the definition
|
||||
of the "initialize_${configuration_name}" registration function
|
||||
that registers the operation with the Manifest.
|
||||
("Registration" helps turn C++ compile-time polymorphism
|
||||
(via template parameters) into a run-time choice of parameters.)
|
||||
|
||||
2. Configuration instance: In the body of the registration function,
|
||||
make a "using" declaration Operation_${operation_name} for the
|
||||
operation type (which uses operation_name as its template argument).
|
||||
Then, tell the manifest about the operation via a "manifest.append" call.
|
||||
The argument of the call is a new instance of
|
||||
"SomethingOperation<Operation_${operation_name}>"
|
||||
(replace Something with a specific name).
|
||||
|
||||
3. Configuration epilogue: Close the definition of the registration function.
|
||||
|
||||
4. Epilogue template: Close the namespace(s).
|
||||
"""
|
||||
|
||||
_LOGGER.debug('*** EmitConv3dConfigurationLibrary::__exit__')
|
||||
_LOGGER.debug('*** configuration_path (file to write): ' +
|
||||
str(self.configuration_path))
|
||||
_LOGGER.debug('*** configuration_name: ' + self.configuration_name)
|
||||
|
||||
self.configuration_file.write(SubstituteTemplate(self.configuration_header, {
|
||||
'configuration_name': self.configuration_name
|
||||
}))
|
||||
|
||||
for operation in self.operations:
|
||||
stub_begin = ''
|
||||
stub_end = ''
|
||||
# It can be useful to stub (comment) out instantiations for testing.
|
||||
# In this case, one need only set is_stub to True.
|
||||
is_stub = False
|
||||
if is_stub:
|
||||
stub_begin = "// STUB for now\n#if 0"
|
||||
stub_end = "#endif // 0"
|
||||
|
||||
kernel_name = 'ImplicitGemmConvolution'
|
||||
operation_wrapper = 'Conv3dOperation'
|
||||
if self.operation_is_3x(operation):
|
||||
kernel_name = 'ConvUniversalAdapter'
|
||||
operation_wrapper = 'ConvOperation3x'
|
||||
|
||||
self.configuration_file.write(SubstituteTemplate(self.configuration_instance, {
|
||||
'configuration_name': self.configuration_name,
|
||||
'operation_name': operation.procedural_name()
|
||||
'operation_name': operation.procedural_name(),
|
||||
'kernel_name': kernel_name,
|
||||
'operation_wrapper': operation_wrapper,
|
||||
'stub_begin': stub_begin,
|
||||
'stub_end': stub_end
|
||||
}))
|
||||
|
||||
self.configuration_file.write(self.configuration_epilogue)
|
||||
@ -357,4 +480,3 @@ void initialize_${configuration_name}(Manifest &manifest) {
|
||||
|
||||
###################################################################################################
|
||||
###################################################################################################
|
||||
|
||||
|
||||
220
python/cutlass_library/conv3x_emitter.py
Normal file
220
python/cutlass_library/conv3x_emitter.py
Normal file
@ -0,0 +1,220 @@
|
||||
#################################################################################################
|
||||
#
|
||||
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: BSD-3-Clause
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# 1. Redistributions of source code must retain the above copyright notice, this
|
||||
# list of conditions and the following disclaimer.
|
||||
#
|
||||
# 2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
#
|
||||
# 3. Neither the name of the copyright holder nor the names of its
|
||||
# contributors may be used to endorse or promote products derived from
|
||||
# this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
#################################################################################################
|
||||
|
||||
"""
|
||||
Utilities for emitting CUTLASS >= 3 convolution kernels
|
||||
"""
|
||||
|
||||
import enum
|
||||
import os.path
|
||||
import shutil
|
||||
import logging
|
||||
from string import Template
|
||||
|
||||
try:
|
||||
import builtins
|
||||
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
|
||||
raise ImportError("Disabling attempt to import cutlass_library")
|
||||
from cutlass_library.library import *
|
||||
except ImportError:
|
||||
from library import *
|
||||
|
||||
_LOGGER = logging.getLogger(__name__)
|
||||
|
||||
###################################################################################################
|
||||
#
|
||||
# Emits single instances of a CUTLASS device-wide operator
|
||||
#
|
||||
###################################################################################################
|
||||
|
||||
class EmitConv3xInstance:
|
||||
def __init__(self):
|
||||
_LOGGER.debug("*** EmitConv3xInstance::__init__")
|
||||
|
||||
# Define epilogue type first, so that the mainloop type
|
||||
# can use it with StageCountAutoCarveout.
|
||||
self.template = """
|
||||
|
||||
// CUTLASS >= 3 convolution ${conv_kind_name} kernel instance "${operation_name}"
|
||||
using ${operation_name}_epilogue =
|
||||
typename cutlass::epilogue::collective::CollectiveBuilder<
|
||||
${arch},
|
||||
${opcode_class_epi},
|
||||
${tile_shape}, // tile shape
|
||||
${cluster_shape}, // cluster shape
|
||||
${epi_tile_mn},
|
||||
${element_accumulator},
|
||||
${element_compute},
|
||||
${element_c}, ${layout_c}, 128 / cute::sizeof_bits_v<${element_c}>,
|
||||
${element_d}, ${layout_d}, 128 / cute::sizeof_bits_v<${element_d}>,
|
||||
${epilogue_schedule}
|
||||
// , class FusionOpOrCallbacks = cutlass::epilogue::fusion::LinearCombination<ElementD,ElementCompute>
|
||||
>::CollectiveOp;
|
||||
|
||||
using ${operation_name}_mainloop =
|
||||
typename cutlass::conv::collective::CollectiveBuilder<
|
||||
${arch},
|
||||
${opcode_class_main},
|
||||
${conv_kind}, // kFprop, kDgrad, or kWgrad
|
||||
${element_a}, ${layout_a}, 128 / cute::sizeof_bits_v<${element_a}>,
|
||||
${element_b}, ${layout_b}, 128 / cute::sizeof_bits_v<${element_b}>,
|
||||
${element_accumulator},
|
||||
${tile_shape}, // tile shape
|
||||
${cluster_shape}, // cluster shape
|
||||
${stages},
|
||||
${kernel_schedule}
|
||||
>::CollectiveOp;
|
||||
|
||||
// Unit tests call this "ConvKernel".
|
||||
// Conv operator ${operation_name}
|
||||
using ${operation_name}_base = cutlass::conv::kernel::ConvUniversal<
|
||||
${operation_name}_mainloop,
|
||||
${operation_name}_epilogue,
|
||||
${tile_scheduler}
|
||||
>;
|
||||
"""
|
||||
|
||||
def arch_number_to_type(self, arch: int) -> str:
|
||||
return f"cutlass::arch::Sm{arch}"
|
||||
|
||||
def tile_shape(self, operation) -> str:
|
||||
# For all three kinds of convolutions, the tile shape's K mode
|
||||
# differs from GEMM in that needs to be wrapped in a Shape.
|
||||
# For Wgrad convolutions specifically,
|
||||
# the N tile shape also needs to be wrapped in a Shape.
|
||||
m_template = 'cute::_${tile_shape_m}'
|
||||
if operation.conv_kind == ConvKind.Wgrad:
|
||||
n_template = 'cute::Shape<cute::_${tile_shape_n}>'
|
||||
else:
|
||||
n_template = 'cute::_${tile_shape_n}'
|
||||
k_template = 'cute::Shape<cute::_${tile_shape_k}>'
|
||||
|
||||
tile_shape_template = f'cute::Shape<{m_template}, {n_template}, {k_template}>'
|
||||
values = {
|
||||
'tile_shape_m': operation.tile_description.tile_shape[0],
|
||||
'tile_shape_n': operation.tile_description.tile_shape[1],
|
||||
'tile_shape_k': operation.tile_description.tile_shape[2]
|
||||
}
|
||||
return Template(tile_shape_template).substitute(values)
|
||||
|
||||
def cluster_shape(self, operation) -> str:
|
||||
m_template = 'cute::_${cluster_shape_m}'
|
||||
n_template = 'cute::_${cluster_shape_n}'
|
||||
k_template = 'cute::_${cluster_shape_k}'
|
||||
cluster_shape_template = f'cute::Shape<{m_template}, {n_template}, {k_template}>'
|
||||
values = {
|
||||
'cluster_shape_m': operation.tile_description.cluster_shape[0],
|
||||
'cluster_shape_n': operation.tile_description.cluster_shape[1],
|
||||
'cluster_shape_k': operation.tile_description.cluster_shape[2],
|
||||
}
|
||||
return Template(cluster_shape_template).substitute(values)
|
||||
|
||||
def stage_count(self, operation) -> str:
|
||||
# stages == 0 tells builder to pick the number of stages automatically
|
||||
namespace_prefix = 'cutlass::conv::collective::'
|
||||
if operation.tile_description.stages > 0:
|
||||
return f"{namespace_prefix}StageCount<{str(operation.tile_description.stages)}>"
|
||||
else:
|
||||
return f"{namespace_prefix}StageCountAutoCarveout<sizeof(typename {operation.procedural_name()}_epilogue::SharedStorage)>"
|
||||
|
||||
def emit(self, operation) -> str:
|
||||
_LOGGER.debug("*** EmitConv3xInstance::emit")
|
||||
_LOGGER.debug("*** operation: procedural_name()=" + operation.procedural_name())
|
||||
|
||||
# Identify the operation as CUTLASS 3 by its is_3x field
|
||||
if (not hasattr(operation, 'is_3x')) or (not operation.is_3x):
|
||||
raise RuntimeError("operation must be a CUTLASS 3 operation")
|
||||
|
||||
epi_tile_mn = "cutlass::epilogue::collective::EpilogueTileAuto"
|
||||
opcode_class_main = OpcodeClassTag[operation.tile_description.math_instruction.opcode_class]
|
||||
opcode_class_epi = opcode_class_main
|
||||
|
||||
tile_shape = operation.tile_description.tile_shape
|
||||
warp_count = operation.tile_description.warp_count
|
||||
epilogue_schedule = EpilogueScheduleTag[operation.epilogue_schedule]
|
||||
|
||||
# KernelScheduleTag and TileSchedulerTag both hard-code the
|
||||
# namespace qualification of KernelScheduleAuto as
|
||||
# "cutlass::gemm::collective::" (unless the tag is 'void').
|
||||
#
|
||||
# For TileSchedulerTag, this namespace is fine, since CUTLASS 3
|
||||
# convolutions use the same tile schedulers (from the same
|
||||
# cutlass::gemm::collective namespace) as GEMMs.
|
||||
kernel_schedule = KernelScheduleTag[operation.kernel_schedule].replace('gemm::', 'conv::')
|
||||
tile_scheduler = TileSchedulerTag[operation.tile_scheduler]
|
||||
opcode_class = OpcodeClassTag[operation.tile_description.math_instruction.opcode_class]
|
||||
|
||||
values = {
|
||||
'operation_name': operation.procedural_name(),
|
||||
'conv_kind': ConvKindTag[operation.conv_kind],
|
||||
'conv_kind_name': ConvKindNames[operation.conv_kind].capitalize(),
|
||||
'element_a': DataTypeTag[operation.A.element],
|
||||
'layout_a': LayoutTag[operation.A.layout],
|
||||
'align_a': int(operation.A.alignment),
|
||||
'element_b': DataTypeTag[operation.B.element],
|
||||
'layout_b': LayoutTag[operation.B.layout],
|
||||
'align_b': int(operation.B.alignment),
|
||||
'element_c': DataTypeTag[operation.C.element],
|
||||
'layout_c': LayoutTag[operation.C.layout],
|
||||
'align_c': int(operation.C.alignment),
|
||||
'element_d': DataTypeTag[operation.D.element],
|
||||
'layout_d': LayoutTag[operation.D.layout],
|
||||
'align_d': int(operation.D.alignment),
|
||||
'element_accumulator': DataTypeTag[operation.accumulator_type()],
|
||||
'opcode_class': opcode_class,
|
||||
'arch': self.arch_number_to_type(operation.arch),
|
||||
'tile_shape': self.tile_shape(operation),
|
||||
'cluster_shape': self.cluster_shape(operation),
|
||||
'opcode_class_epi': opcode_class_epi,
|
||||
'opcode_class_main': opcode_class_main,
|
||||
'epi_tile_mn': epi_tile_mn,
|
||||
'stages': self.stage_count(operation),
|
||||
'kernel_schedule': kernel_schedule,
|
||||
'epilogue_schedule': epilogue_schedule,
|
||||
'tile_scheduler': tile_scheduler,
|
||||
'element_compute': DataTypeTag[operation.element_compute]
|
||||
}
|
||||
return Template(self.template).substitute(values)
|
||||
|
||||
class EmitConv3xIncludes:
|
||||
def __init__(self):
|
||||
_LOGGER.debug("*** EmitConv3xIncludes::__init__")
|
||||
self.includes = ['conv_operation_3x.hpp',
|
||||
'cutlass/conv/device/conv_universal_adapter.hpp',
|
||||
'cutlass/conv/kernel/conv_universal.hpp',
|
||||
'cutlass/conv/collective/collective_builder.hpp',
|
||||
'cutlass/epilogue/collective/collective_builder.hpp']
|
||||
|
||||
def emit(self, operation) -> str:
|
||||
_LOGGER.debug("*** EmitConv3xIncludes::emit")
|
||||
return '\n'.join(f"#include \"{incl}\"" for incl in self.includes) + \
|
||||
"\n\n///////////////////////////////////////////////////////////////////////////////////////////////////"
|
||||
@ -37,6 +37,7 @@ Utilities for emitting GEMM kernels
|
||||
import collections
|
||||
import enum
|
||||
import functools
|
||||
import logging
|
||||
import operator
|
||||
import os.path
|
||||
import shutil
|
||||
@ -49,6 +50,8 @@ try:
|
||||
except ImportError:
|
||||
from library import *
|
||||
|
||||
_LOGGER = logging.getLogger(__name__)
|
||||
|
||||
###################################################################################################
|
||||
#
|
||||
# Data structure modeling a GEMM operation
|
||||
@ -139,7 +142,8 @@ class GemmOperation:
|
||||
|
||||
math_operations_map = {
|
||||
MathOperation.xor_popc: 'xor',
|
||||
MathOperation.and_popc: 'and'
|
||||
MathOperation.and_popc: 'and',
|
||||
MathOperation.multiply_add_fast_accum: 'fastaccum',
|
||||
}
|
||||
|
||||
tensor_ops = [
|
||||
@ -256,18 +260,14 @@ class GemmOperation:
|
||||
''' The full procedural name indicates architecture, extended name, tile size, and layout. '''
|
||||
opcode_class_name = OpcodeClassNames[self.tile_description.math_instruction.opcode_class]
|
||||
if self.arch >= 90:
|
||||
kernel_name_template = "cutlass{p}_sm{ar}_{op}_{ex}_{tbm}x{tbn}x{tbk}_{cm}x{cn}x{ck}_{l}_{s}_align{al}{t}{k}{e}"
|
||||
kernel_name_template = "cutlass{p}_sm{ar}_{op}_{ex}{ct}{cs}_{l}_{s}_align{al}{t}{k}{e}"
|
||||
return kernel_name_template.format(
|
||||
p = self.prefix,
|
||||
ar = self.arch,
|
||||
op = opcode_class_name,
|
||||
ex = self.extended_name_3x(),
|
||||
tbm = self.tile_description.tile_shape[0],
|
||||
tbn = self.tile_description.tile_shape[1],
|
||||
tbk = self.tile_description.tile_shape[2],
|
||||
cm = self.tile_description.cluster_shape[0],
|
||||
cn = self.tile_description.cluster_shape[1],
|
||||
ck = self.tile_description.cluster_shape[2],
|
||||
ct = '_' + 'x'.join([str(i) for i in self.tile_description.tile_shape]) if self.tile_description.tile_shape[0] > 0 else "",
|
||||
cs = '_' + 'x'.join([str(i) for i in self.tile_description.cluster_shape]),
|
||||
l = self.tile_description.stages,
|
||||
s = self.layout_name_3x(),
|
||||
al = str(max(self.A.alignment, self.B.alignment)),
|
||||
@ -725,8 +725,8 @@ class EmitGemmUniversal3xInstance:
|
||||
using ${operation_name}_epilogue =
|
||||
typename cutlass::epilogue::collective::CollectiveBuilder<
|
||||
${arch}, ${opcode_class_epi},
|
||||
cute::Shape<cute::_${tile_shape_m}, cute::_${tile_shape_n}, cute::_${tile_shape_k}>,
|
||||
cute::Shape<cute::_${cluster_m},cute::_${cluster_n},cute::_${cluster_k}>,
|
||||
cute::Shape<cute::_${tile_shape_epi_m}, cute::_${tile_shape_epi_n}, cute::_${tile_shape_epi_k}>,
|
||||
cute::Shape<${cluster_shape_m}, ${cluster_shape_n}, ${cluster_shape_k}>,
|
||||
${epi_tile_mn},
|
||||
${element_accumulator}, ${element_epilogue},
|
||||
${element_c}, ${layout_c}, ${align_c},
|
||||
@ -741,8 +741,8 @@ using ${operation_name}_mainloop =
|
||||
${element_a}, ${layout_a}, ${align_a},
|
||||
${element_b}, ${layout_b}, ${align_b},
|
||||
${element_accumulator},
|
||||
cute::Shape<cute::_${tile_shape_m}, cute::_${tile_shape_n}, cute::_${tile_shape_k}>,
|
||||
cute::Shape<cute::_${cluster_m},cute::_${cluster_n},cute::_${cluster_k}>,
|
||||
cute::Shape<cute::_${tile_shape_main_m}, cute::_${tile_shape_main_n}, cute::_${tile_shape_main_k}>,
|
||||
cute::Shape<${cluster_shape_m}, ${cluster_shape_n}, ${cluster_shape_k}>,
|
||||
${stages},
|
||||
${kernel_schedule}
|
||||
>::CollectiveOp;
|
||||
@ -773,19 +773,33 @@ ${compile_guard_end}
|
||||
|
||||
#
|
||||
def emit(self, operation):
|
||||
_LOGGER.debug("*** EmitGemmConfigurationLibrary::emit(operation)")
|
||||
_LOGGER.debug("*** operation.procedural_name(): " + operation.procedural_name())
|
||||
_LOGGER.debug("*** tile_shape: " + str(operation.tile_description.tile_shape))
|
||||
_LOGGER.debug("*** warp_count: " + str(operation.tile_description.warp_count))
|
||||
|
||||
opcode_class_main = operation.tile_description.math_instruction.opcode_class
|
||||
opcode_class_epi = opcode_class_main
|
||||
|
||||
tile_shape = operation.tile_description.tile_shape
|
||||
warp_count = operation.tile_description.warp_count
|
||||
instruction_shape = operation.tile_description.math_instruction.instruction_shape
|
||||
cluster_m = operation.tile_description.cluster_shape[0]
|
||||
cluster_n = operation.tile_description.cluster_shape[1]
|
||||
|
||||
tile_shape_main_m, tile_shape_main_n, tile_shape_main_k = tile_shape
|
||||
tile_shape_epi_m, tile_shape_epi_n, tile_shape_epi_k = tile_shape
|
||||
|
||||
# account for static/dynamic cluster shapes
|
||||
cta_m = tile_shape[0] // cluster_m if cluster_m > 0 else tile_shape[0]
|
||||
cta_n = tile_shape[1] // cluster_n if cluster_n > 0 else tile_shape[1]
|
||||
|
||||
# stage count set to zero indicates builder automatic stage selection
|
||||
if operation.tile_description.stages > 0:
|
||||
stage_count_string = f"cutlass::gemm::collective::StageCount<{str(operation.tile_description.stages)}>"
|
||||
else:
|
||||
stage_count_string = f"cutlass::gemm::collective::StageCountAutoCarveout<sizeof(typename {str(operation.procedural_name())}_epilogue::SharedStorage)>"
|
||||
warp_shape = [tile_shape[idx] // warp_count[idx] for idx in range(3)]
|
||||
stage_count_string = f"cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename {str(operation.procedural_name())}_epilogue::SharedStorage))>"
|
||||
|
||||
epi_tile_mn = "cutlass::epilogue::collective::EpilogueTileAuto"
|
||||
opcode_class_main = operation.tile_description.math_instruction.opcode_class
|
||||
opcode_class_epi = opcode_class_main
|
||||
|
||||
instance_layout_A, instance_layout_B, instance_layout_C , instance_layout_D = \
|
||||
(operation.A.layout, operation.B.layout, operation.C.layout, operation.D.layout)
|
||||
@ -806,9 +820,6 @@ ${compile_guard_end}
|
||||
element_a = DataTypeTag[operation.A.element]
|
||||
element_b = DataTypeTag[operation.B.element]
|
||||
epilogue_schedule_type = EpilogueScheduleTag[operation.epilogue_schedule]
|
||||
element_a = DataTypeTag[operation.A.element]
|
||||
element_b = DataTypeTag[operation.B.element]
|
||||
epilogue_schedule_type = EpilogueScheduleTag[operation.epilogue_schedule]
|
||||
values = {
|
||||
'operation_name': operation.procedural_name(),
|
||||
'operation_suffix': self.operation_suffix,
|
||||
@ -824,18 +835,18 @@ ${compile_guard_end}
|
||||
'opcode_class_main': OpcodeClassTag[opcode_class_main],
|
||||
'opcode_class_epi': OpcodeClassTag[opcode_class_epi],
|
||||
'arch': "cutlass::arch::Sm%d" % operation.arch,
|
||||
'tile_shape_m': str(operation.tile_description.tile_shape[0]),
|
||||
'tile_shape_n': str(operation.tile_description.tile_shape[1]),
|
||||
'tile_shape_k': str(operation.tile_description.tile_shape[2]),
|
||||
'cluster_m': str(operation.tile_description.cluster_shape[0]),
|
||||
'cluster_n': str(operation.tile_description.cluster_shape[1]),
|
||||
'cluster_k': str(operation.tile_description.cluster_shape[2]),
|
||||
'warp_shape_m': str(warp_shape[0]),
|
||||
'warp_shape_n': str(warp_shape[1]),
|
||||
'warp_shape_k': str(warp_shape[2]),
|
||||
'instruction_shape_m': str(operation.tile_description.math_instruction.instruction_shape[0]),
|
||||
'instruction_shape_n': str(operation.tile_description.math_instruction.instruction_shape[1]),
|
||||
'instruction_shape_k': str(operation.tile_description.math_instruction.instruction_shape[2]),
|
||||
'tile_shape_epi_m': str(tile_shape_epi_m),
|
||||
'tile_shape_epi_n': str(tile_shape_epi_n),
|
||||
'tile_shape_epi_k': str(tile_shape_epi_k),
|
||||
'tile_shape_main_m': str(tile_shape_main_m),
|
||||
'tile_shape_main_n': str(tile_shape_main_n),
|
||||
'tile_shape_main_k': str(tile_shape_main_k),
|
||||
'cluster_shape_m': 'cute::_' + str(operation.tile_description.cluster_shape[0]) if operation.tile_description.cluster_shape[0] > 0 else "int",
|
||||
'cluster_shape_n': 'cute::_' + str(operation.tile_description.cluster_shape[1]) if operation.tile_description.cluster_shape[1] > 0 else "int",
|
||||
'cluster_shape_k': 'cute::_' + str(operation.tile_description.cluster_shape[2]) if operation.tile_description.cluster_shape[2] > 0 else "int",
|
||||
'instruction_shape_m': str(instruction_shape[0]),
|
||||
'instruction_shape_n': str(instruction_shape[1]),
|
||||
'instruction_shape_k': str(instruction_shape[2]),
|
||||
'kernel_schedule' : str(KernelScheduleTag[operation.kernel_schedule]),
|
||||
'epilogue_schedule' : str(epilogue_schedule_type),
|
||||
'epi_tile_mn' : epi_tile_mn,
|
||||
@ -1227,6 +1238,10 @@ void initialize_${configuration_name}(Manifest &manifest) {
|
||||
"""
|
||||
|
||||
def __enter__(self):
|
||||
_LOGGER.debug("*** EmitGemmConfigurationLibrary::__enter__")
|
||||
_LOGGER.debug("*** configuration_path (file to write): " +
|
||||
str(self.configuration_path))
|
||||
|
||||
self.configuration_file = open(self.configuration_path, "w")
|
||||
self.configuration_file.write(self.header_template)
|
||||
self.configuration_file.write(self.separator)
|
||||
@ -1248,6 +1263,9 @@ void initialize_${configuration_name}(Manifest &manifest) {
|
||||
return self
|
||||
|
||||
def emit(self, operation):
|
||||
_LOGGER.debug("*** EmitGemmConfigurationLibrary::emit(operation)")
|
||||
_LOGGER.debug("*** operation.gemm_kind: " + str(operation.gemm_kind))
|
||||
|
||||
emitter = self.instance_emitter[operation.gemm_kind]()
|
||||
|
||||
for incl in emitter.includes:
|
||||
@ -1293,4 +1311,3 @@ void initialize_${configuration_name}(Manifest &manifest) {
|
||||
|
||||
###################################################################################################
|
||||
###################################################################################################
|
||||
|
||||
|
||||
@ -40,9 +40,22 @@ from itertools import product
|
||||
import logging
|
||||
import os.path
|
||||
import shutil
|
||||
|
||||
import sys
|
||||
import copy
|
||||
from typing import Any, Optional, Sequence, Tuple
|
||||
|
||||
_LOGGER = logging.getLogger(__name__)
|
||||
|
||||
def logging_prefix(indent_level: int = 0) -> str:
|
||||
"""String prefix for start of each debug log entry"""
|
||||
prefix = '*** '
|
||||
indent = ' '
|
||||
return f"{prefix}{indent_level * indent}"
|
||||
|
||||
def log_debug_line(line: str, indent_level: int = 0) -> None:
|
||||
"""Log one line of debug output"""
|
||||
prefix = logging_prefix(indent_level)
|
||||
_LOGGER.debug(prefix + line)
|
||||
|
||||
# Certain usecases of cutlass_library nearly always prefer to run as scripts with
|
||||
# relative imports, rather than via an installed Python package. An example of this
|
||||
@ -792,6 +805,359 @@ def CreateDepthwiseConv2dOperator(manifest, layout, tile_descriptions, data_type
|
||||
|
||||
return operations
|
||||
|
||||
class ConvOperation3x:
|
||||
"""All parameters of a CUTLASS 3 convolution operation.
|
||||
|
||||
Unlike CUTLASS 2 convolutions, CUTLASS 3 convolutions do not
|
||||
distinguish between 2-D and 3-D convolutions by kernel class name.
|
||||
Instead, for CUTLASS 3 convolutions, the tensor layouts encode
|
||||
whether the convolution is 2-D or 3-D. Thus, this class deduces
|
||||
the OperationKind (either Conv2d or Conv3d) from the layouts,
|
||||
rather than taking it as a constructor parameter.
|
||||
"""
|
||||
def __init__(self,
|
||||
conv_kind: ConvKind,
|
||||
tile_description: TileDescription,
|
||||
A: TensorDescription,
|
||||
B: TensorDescription,
|
||||
C: TensorDescription,
|
||||
element_compute: Optional[DataType] = None,
|
||||
D: Optional[TensorDescription] = None,
|
||||
kernel_schedule: KernelScheduleType = KernelScheduleType.ScheduleAuto,
|
||||
epilogue_schedule: EpilogueScheduleType = EpilogueScheduleType.ScheduleAuto,
|
||||
tile_scheduler: TileSchedulerType = TileSchedulerType.Default,
|
||||
log_indent_level: int = 1):
|
||||
log_debug_line(f'ConvOperation3x::init: conv_kind: {conv_kind}', log_indent_level)
|
||||
log_indent_level = log_indent_level + 1
|
||||
|
||||
self.conv_kind = conv_kind
|
||||
self.tile_description = tile_description
|
||||
self.A = A
|
||||
self.B = B
|
||||
self.C = C
|
||||
self.element_compute = C.element if element_compute is None else element_compute
|
||||
self.kernel_schedule = kernel_schedule
|
||||
self.epilogue_schedule = epilogue_schedule
|
||||
|
||||
self.arch = tile_description.minimum_compute_capability
|
||||
self.tile_scheduler = tile_scheduler
|
||||
if D == None:
|
||||
self.D = C
|
||||
else:
|
||||
self.D = D
|
||||
|
||||
self.is_3x = True
|
||||
self.group_mode = GroupMode.NoneGroup # CUTLASS 3 convolutions currently aren't grouped
|
||||
|
||||
operation_kind = None
|
||||
for layout in (A.layout, B.layout, C.layout):
|
||||
assert(isinstance(layout, LayoutType))
|
||||
new_operation_kind = convolution_tensor_layout_type_to_operation_kind(layout)
|
||||
if operation_kind is None:
|
||||
operation_kind = new_operation_kind
|
||||
else: # CUTLASS 3 convolutions don't permit mixing 2-D and 3-D layouts.
|
||||
assert(operation_kind == new_operation_kind)
|
||||
assert(operation_kind is not None)
|
||||
self.operation_kind = operation_kind
|
||||
|
||||
def __str__(self):
|
||||
return f"ConvOperation3x: operation_kind={self.operation_kind}, conv_kind={self.conv_kind}, tile_description={self.tile_description}"
|
||||
|
||||
def is_complex(self):
|
||||
complex_operators = [
|
||||
MathOperation.multiply_add_complex,
|
||||
MathOperation.multiply_add_complex_gaussian,
|
||||
MathOperation.multiply_add_complex_fast_f32
|
||||
]
|
||||
return self.tile_description.math_instruction.math_operation in complex_operators
|
||||
|
||||
def is_mixed_input(self):
|
||||
return self.A.element != self.B.element
|
||||
|
||||
def accumulator_type(self):
|
||||
accum = self.tile_description.math_instruction.element_accumulator
|
||||
if self.is_complex():
|
||||
return get_complex_from_real(accum)
|
||||
return accum
|
||||
|
||||
def short_math_name(self):
|
||||
prefix = ''
|
||||
if self.tile_description.math_instruction.math_operation == MathOperation.multiply_add_complex_gaussian:
|
||||
prefix = 'g'
|
||||
return prefix + ShortDataTypeNames[self.accumulator_type()]
|
||||
|
||||
def is_tensor_op(self):
|
||||
tensor_ops = [
|
||||
OpcodeClass.TensorOp,
|
||||
OpcodeClass.WmmaTensorOp
|
||||
]
|
||||
return self.tile_description.math_instruction.opcode_class in tensor_ops
|
||||
|
||||
def instruction_shape_string(self):
|
||||
math_operations_map = {
|
||||
MathOperation.xor_popc: 'xor',
|
||||
MathOperation.and_popc: 'and'
|
||||
}
|
||||
if self.is_tensor_op():
|
||||
is0, is1, is2 = self.tile_description.math_instruction.instruction_shape
|
||||
math_op = self.tile_description.math_instruction.math_operation
|
||||
math_op_string = math_operations_map[math_op] if math_op in math_operations_map.keys() else ''
|
||||
return f"{is0}x{is1}x{is2}{math_op_string}"
|
||||
else:
|
||||
return ''
|
||||
|
||||
def intermediate_type_string(self):
|
||||
'''
|
||||
Name of the distinct intermediate type used by the tensor operation,
|
||||
or the empty string if none.
|
||||
|
||||
Tensor ops (opcode_clas *TensorOp) may use an intermediate data type
|
||||
that differs from the element type of A or the accumulator type.
|
||||
'''
|
||||
if not self.is_tensor_op():
|
||||
return ''
|
||||
elif self.tile_description.math_instruction.element_a == self.A.element:
|
||||
return ''
|
||||
elif self.tile_description.math_instruction.element_a == self.tile_description.math_instruction.element_accumulator:
|
||||
return ''
|
||||
else:
|
||||
return DataTypeNames[self.tile_description.math_instruction.element_a]
|
||||
|
||||
def core_name(self):
|
||||
inst_shape = self.instruction_shape_string()
|
||||
intermediate_type = self.intermediate_type_string()
|
||||
conv_kind_name = ConvKindNames[self.conv_kind]
|
||||
return f"{self.short_math_name()}{inst_shape}{intermediate_type}{conv_kind_name}"
|
||||
|
||||
def extended_name(self):
|
||||
core_name = self.core_name()
|
||||
element_a = DataTypeNames[self.A.element]
|
||||
element_b = DataTypeNames[self.B.element]
|
||||
element_acc = DataTypeNames[self.tile_description.math_instruction.element_accumulator]
|
||||
element_c = DataTypeNames[self.C.element]
|
||||
element_d = DataTypeNames[self.D.element]
|
||||
return f"{core_name}_{element_a}_{element_b}_{element_acc}_{element_c}_{element_d}"
|
||||
|
||||
def is_complex(self):
|
||||
complex_operators = [
|
||||
MathOperation.multiply_add_complex,
|
||||
MathOperation.multiply_add_complex_gaussian,
|
||||
MathOperation.multiply_add_complex_fast_f32
|
||||
]
|
||||
return self.tile_description.math_instruction.math_operation in complex_operators
|
||||
|
||||
def layout_names(self):
|
||||
'''Layout strings for A and B, respectively'''
|
||||
if self.is_complex():
|
||||
return (ShortComplexLayoutNames[(self.A.layout, self.A.complex_transform)],
|
||||
ShortComplexLayoutNames[(self.B.layout, self.B.complex_transform)])
|
||||
else:
|
||||
return (ShortLayoutTypeNames[self.A.layout],
|
||||
ShortLayoutTypeNames[self.B.layout])
|
||||
|
||||
def extended_name(self):
|
||||
core_name = self.core_name()
|
||||
element_a = DataTypeNames[self.A.element]
|
||||
element_b = DataTypeNames[self.B.element]
|
||||
element_acc = DataTypeNames[self.tile_description.math_instruction.element_accumulator]
|
||||
element_c = DataTypeNames[self.C.element]
|
||||
element_d = DataTypeNames[self.D.element]
|
||||
layout_a, layout_b = self.layout_names()
|
||||
return f"{core_name}_{element_a}{layout_a}_{element_b}{layout_b}_{element_acc}_{element_c}_{element_d}"
|
||||
|
||||
def configuration_name(self):
|
||||
prefix = 'cutlass3x'
|
||||
opcode_class_name = OpcodeClassNames[self.tile_description.math_instruction.opcode_class]
|
||||
tbm = self.tile_description.tile_shape[0]
|
||||
tbn = self.tile_description.tile_shape[1]
|
||||
tbk = self.tile_description.tile_shape[2]
|
||||
cm = self.tile_description.cluster_shape[0]
|
||||
cn = self.tile_description.cluster_shape[1]
|
||||
ck = self.tile_description.cluster_shape[2]
|
||||
alignment = max(self.A.alignment, self.B.alignment)
|
||||
tile_scheduler = TileSchedulerSuffixes[self.tile_scheduler]
|
||||
kernel_schedule = KernelScheduleSuffixes[self.kernel_schedule]
|
||||
epilogue_schedule = EpilogueScheduleSuffixes[self.epilogue_schedule]
|
||||
|
||||
return f"{prefix}_{opcode_class_name}_{self.extended_name()}_{tbm}x{tbn}x{tbk}_{cm}x{cn}x{ck}_{self.tile_description.stages}_align{alignment}{tile_scheduler}{kernel_schedule}{epilogue_schedule}"
|
||||
|
||||
def procedural_name(self):
|
||||
return self.configuration_name()
|
||||
|
||||
def convolution_tensor_layout_type_to_operation_kind(layout: LayoutType) -> OperationKind:
|
||||
if layout == LayoutType.TensorNHWC or layout == LayoutType.TensorKCSR:
|
||||
return OperationKind.Conv2d
|
||||
elif layout == LayoutType.TensorNDHWC or layout == LayoutType.TensorKCSRT:
|
||||
return OperationKind.Conv3d
|
||||
else:
|
||||
raise RuntimeError(f'LayoutType {layout} does not have a corresponding OperationKind')
|
||||
|
||||
def CreateConvOperator3x(manifest: Manifest,
|
||||
dims_and_alignments: Sequence[Tuple[Tuple[int, int], Tuple[int, int], Tuple[int, int]]],
|
||||
tile_descriptions: Sequence[Sequence[TileDescription]],
|
||||
data_types,
|
||||
schedule_pairs: Sequence[Tuple[KernelScheduleType, KernelScheduleType]] = \
|
||||
[(KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto)],
|
||||
complex_transforms: Optional[Sequence[ComplexTransform]] = None,
|
||||
tile_schedulers: Sequence[TileSchedulerType] = [TileSchedulerType.Persistent],
|
||||
conv_kind: ConvKind = ConvKind.Fprop,
|
||||
log_indent_level: int = 1):
|
||||
"""
|
||||
Create zero or more CUTLASS 3 two-dimensional convolution operators.
|
||||
|
||||
Create a CUTLASS 3 two-dimensional convolution operator
|
||||
for all feasible combinations of the input parameters.
|
||||
Add the operators to the manifest.
|
||||
|
||||
dims_and_alignments: 3-level list. Each outer list term is a list [A, B, C].
|
||||
Each inner list (A, B, or C) has the form [num_spatial_dimensions, alignment].
|
||||
Both are integers; the first is the number of spatial dimensions
|
||||
(currently, only 2 or 3 are supported), and the second is the byte alignment.
|
||||
We deduce the operation_kind (either OperationKind.Conv2d or OperationKind.Conv3d)
|
||||
from num_spatial_dimensions.
|
||||
|
||||
This function doesn't take layouts, unlike the GEMM functions.
|
||||
CUTLASS 3 convolutions currently support three input layouts:
|
||||
|
||||
* TensorNWC for 1-D convolutions,
|
||||
* TensorNHWC for 2-D convolutions, and
|
||||
* TensorNDHWC for 3-D convolutions.
|
||||
|
||||
Output (C and D) layouts are the same as input layouts,
|
||||
except for Wgrad convolutions, where the layouts are
|
||||
|
||||
* TensorKCS for 1-D convolutions,
|
||||
* TensorKCSR for 2-D convolutions, and
|
||||
* TensorKCSRT for 3-D convolutions.
|
||||
|
||||
The output layouts are completely constrained by the input layouts
|
||||
and the convolution kind.
|
||||
|
||||
tile_descriptions: 2-level list.
|
||||
Outer level has one list per math instruction.
|
||||
Inner level has one TileDescription for each cluster shape.
|
||||
|
||||
data_types: Either a single data_type dictionary, or a list of them.
|
||||
Keys: 'a_type', 'b_type', 'c_type', 'd_type', 'acc_type', 'epi_type'
|
||||
|
||||
complex_transforms: Optional list of pairs.
|
||||
First element of each pair is the complex transform for A, and
|
||||
second element of each pair is the complex transform for B.
|
||||
|
||||
schedule_pairs: [(kernel_schedule, epilogue_schedule), ...]
|
||||
|
||||
conv_kind: Convolution kind (Fprop, Dgrad, or Wgrad).
|
||||
"""
|
||||
log_debug_line('CreateConvOperator3x', log_indent_level)
|
||||
log_indent_level = log_indent_level + 1
|
||||
log_debug_line(f'conv_kind: {conv_kind}', log_indent_level)
|
||||
|
||||
for triple in dims_and_alignments:
|
||||
spatial_dimensionality = None # to be determined by loop below
|
||||
assert(len(triple) == 3)
|
||||
for entry in triple: # [A, B, C]
|
||||
assert(len(entry) == 2)
|
||||
[dim, alignment] = entry
|
||||
assert(type(dim) is int)
|
||||
assert(dim == 2 or dim == 3)
|
||||
assert(type(alignment) is int)
|
||||
assert(alignment > 0)
|
||||
if spatial_dimensionality is None:
|
||||
spatial_dimensionality = dim
|
||||
else:
|
||||
# A, B, and C need to have the same spatial dimensionality
|
||||
assert(spatial_dimensionality == dim)
|
||||
|
||||
def input_and_output_layouts(spatial_dim: int, kind: ConvKind) -> Tuple[LayoutType, LayoutType]:
|
||||
if spatial_dim == 1:
|
||||
input_layout = LayoutType.TensorNWC
|
||||
if kind == ConvKind.Wgrad:
|
||||
output_layout = LayoutType.TensorKCS
|
||||
else:
|
||||
output_layout = input_layout
|
||||
elif spatial_dim == 2:
|
||||
input_layout = LayoutType.TensorNHWC
|
||||
if kind == ConvKind.Wgrad:
|
||||
output_layout = LayoutType.TensorKCSR
|
||||
else:
|
||||
output_layout = input_layout
|
||||
elif spatial_dim == 3:
|
||||
input_layout = LayoutType.TensorNDHWC
|
||||
if kind == ConvKind.Wgrad:
|
||||
output_layout = LayoutType.TensorKCSRT
|
||||
else:
|
||||
output_layout = input_layout
|
||||
else:
|
||||
assert(False)
|
||||
return (input_layout, output_layout)
|
||||
|
||||
def dims_to_layouts(A_B_C: Tuple[Tuple[int, int], Tuple[int, int], Tuple[int, int]]) -> \
|
||||
Tuple[Tuple[LayoutType, int], Tuple[LayoutType, int], Tuple[LayoutType, int]]:
|
||||
[A, B, C] = A_B_C
|
||||
[spatial_dim, alignment] = A
|
||||
[input_layout, output_layout] = input_and_output_layouts(spatial_dim, conv_kind)
|
||||
return ((input_layout, A[1]),
|
||||
(input_layout, B[1]),
|
||||
(output_layout, C[1]))
|
||||
|
||||
# layouts: list of triples (A, B, C).
|
||||
# Each of A, B, and C has the form [layout, alignment].
|
||||
layouts = [dims_to_layouts(A_B_C) for A_B_C in dims_and_alignments]
|
||||
|
||||
if type(data_types) is dict:
|
||||
data_types = [data_types]
|
||||
|
||||
for s in schedule_pairs:
|
||||
assert(len(s) == 2)
|
||||
|
||||
if complex_transforms is None:
|
||||
complex_transforms = [(ComplexTransform.none, ComplexTransform.none)]
|
||||
|
||||
# product produces a one-pass generator, so the loop must call it anew each time.
|
||||
def make_combinations():
|
||||
return product(
|
||||
layouts,
|
||||
tile_descriptions,
|
||||
data_types,
|
||||
complex_transforms,
|
||||
schedule_pairs,
|
||||
tile_schedulers
|
||||
)
|
||||
|
||||
operations = []
|
||||
for layout_triple, tile_description, data_type, complex_transform_pair, schedule_pair, tile_scheduler in make_combinations():
|
||||
A_layout, A_alignment = layout_triple[0]
|
||||
A_xform = complex_transform_pair[0]
|
||||
B_layout, B_alignment = layout_triple[1]
|
||||
B_xform = complex_transform_pair[1]
|
||||
C_layout, C_alignment = layout_triple[2]
|
||||
D_layout = C_layout
|
||||
D_alignment = C_alignment
|
||||
|
||||
A = TensorDescription(data_type["a_type"], A_layout, A_alignment, A_xform)
|
||||
B = TensorDescription(data_type["b_type"], B_layout, B_alignment, B_xform)
|
||||
C = TensorDescription(data_type["c_type"], C_layout, C_alignment)
|
||||
D = TensorDescription(data_type["d_type"], D_layout, D_alignment)
|
||||
element_compute = data_type.get("epi_type", data_type["acc_type"])
|
||||
kernel_schedule, epilogue_schedule = schedule_pair
|
||||
|
||||
operation = ConvOperation3x(conv_kind=conv_kind,
|
||||
tile_description=tile_description,
|
||||
A=A,
|
||||
B=B,
|
||||
C=C,
|
||||
element_compute=element_compute,
|
||||
D=D,
|
||||
kernel_schedule=kernel_schedule,
|
||||
epilogue_schedule=epilogue_schedule,
|
||||
tile_scheduler=tile_scheduler,
|
||||
log_indent_level=log_indent_level)
|
||||
log_debug_line(f'Created ConvOperation3x: {str(operation)}', log_indent_level)
|
||||
manifest.append(operation)
|
||||
operations.append(operation)
|
||||
|
||||
return operations
|
||||
|
||||
###################################################################################################
|
||||
###################################################################################################
|
||||
|
||||
@ -2233,8 +2599,8 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
|
||||
min_cc = 80
|
||||
max_cc = 1024
|
||||
|
||||
# For mixed-input alignment constraints are a list of lists, where the
|
||||
# inner list contains the alignment constraints for operands/matrices
|
||||
# For mixed-input alignment constraints are a list of lists, where the
|
||||
# inner list contains the alignment constraints for operands/matrices
|
||||
# [[alignA, alignB, alignC],..]
|
||||
alignment_constraints = [[16, 8, 8],]
|
||||
|
||||
@ -2277,7 +2643,7 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
|
||||
]
|
||||
|
||||
operations += CreateGemmOperator(manifest, layouts, tile_descriptions, \
|
||||
data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
|
||||
data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
|
||||
|
||||
for op in operations:
|
||||
if (DataTypeSize[op.C.element] == 16) and \
|
||||
@ -2320,8 +2686,8 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
|
||||
min_cc = 80
|
||||
max_cc = 1024
|
||||
|
||||
# For mixed-input alignment constraints are a list of lists, where the
|
||||
# inner list contains the alignment constraints for operands/matrices
|
||||
# For mixed-input alignment constraints are a list of lists, where the
|
||||
# inner list contains the alignment constraints for operands/matrices
|
||||
# [[alignA, alignB, alignC],..]
|
||||
alignment_constraints = [[8, 16, 8],]
|
||||
|
||||
@ -2346,8 +2712,8 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
|
||||
TileDescription([128, 16, 32], 5, [2, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 16, 32], 3, [2, 1, 1], math_inst, min_cc, max_cc),
|
||||
# 256x16
|
||||
TileDescription([256, 16, 32], 5, [2, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 16, 32], 3, [2, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 16, 32], 5, [2, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 16, 32], 3, [2, 1, 1], math_inst, min_cc, max_cc),
|
||||
]
|
||||
|
||||
data_type = [
|
||||
@ -2372,7 +2738,7 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
|
||||
]
|
||||
|
||||
operations = CreateGemmOperator(manifest, layouts, tile_descriptions, \
|
||||
data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
|
||||
data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
|
||||
|
||||
for op in operations:
|
||||
if op.tile_description.threadblock_shape[1] <= 32:
|
||||
@ -4326,6 +4692,241 @@ def GenerateSM80(manifest, cuda_version):
|
||||
|
||||
###################################################################################################
|
||||
|
||||
def GenerateSM89_TensorOp_16832_fp8(manifest, cuda_version):
|
||||
if (
|
||||
not CudaToolkitVersionSatisfies(cuda_version, 12, 4)
|
||||
):
|
||||
return
|
||||
|
||||
layouts = [
|
||||
(LayoutType.RowMajor, LayoutType.ColumnMajor, LayoutType.ColumnMajor)
|
||||
]
|
||||
|
||||
math_instructions = [
|
||||
MathInstruction(
|
||||
[16, 8, 32],
|
||||
DataType.e4m3, DataType.e4m3, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add),
|
||||
MathInstruction(
|
||||
[16, 8, 32],
|
||||
DataType.e4m3, DataType.e5m2, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add),
|
||||
MathInstruction(
|
||||
[16, 8, 32],
|
||||
DataType.e5m2, DataType.e4m3, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add),
|
||||
MathInstruction(
|
||||
[16, 8, 32],
|
||||
DataType.e5m2, DataType.e5m2, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add),
|
||||
MathInstruction(
|
||||
[16, 8, 32],
|
||||
DataType.e4m3, DataType.e4m3, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add_fast_accum),
|
||||
MathInstruction(
|
||||
[16, 8, 32],
|
||||
DataType.e4m3, DataType.e5m2, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add_fast_accum),
|
||||
MathInstruction(
|
||||
[16, 8, 32],
|
||||
DataType.e5m2, DataType.e4m3, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add_fast_accum),
|
||||
MathInstruction(
|
||||
[16, 8, 32],
|
||||
DataType.e5m2, DataType.e5m2, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add_fast_accum),
|
||||
]
|
||||
|
||||
min_cc = 89
|
||||
max_cc = 89
|
||||
|
||||
alignment_constraints = [16,]
|
||||
alignment_constraints_small_channels = [16, 8, 4]
|
||||
|
||||
for math_inst in math_instructions:
|
||||
tile_descriptions = [
|
||||
TileDescription([256, 128, 64], 3, [4, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 256, 64], 3, [2, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 128, 64], 6, [4, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 256, 64], 6, [2, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 64, 64], 3, [4, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 256, 64], 3, [1, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 64, 64], 4, [4, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 256, 64], 4, [1, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 32, 64], 4, [4, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 32, 256, 64], 4, [1, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 128, 64], 3, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 128, 64], 4, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 128, 64], 5, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 128, 64], 6, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 64, 64], 6, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 128, 64], 6, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 32, 64], 6, [4, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 32, 128, 64], 6, [1, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 64, 64], 6, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 64, 64], 10, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 128, 128], 3, [4, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 256, 128], 3, [2, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 64, 128], 4, [4, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 256, 128], 4, [1, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 32, 128], 4, [4, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 32, 256, 128], 4, [1, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 128, 128], 3, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 128, 128], 4, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 128, 128], 5, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 64, 128], 3, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 128, 128], 3, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 64, 128], 4, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 128, 128], 4, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 32, 128], 4, [4, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 32, 128, 128], 4, [1, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 64, 128], 5, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 64, 128], 6, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
]
|
||||
|
||||
data_types = [
|
||||
[
|
||||
math_inst.element_a,
|
||||
math_inst.element_b,
|
||||
DataType.f32,
|
||||
math_inst.element_accumulator
|
||||
],
|
||||
]
|
||||
|
||||
operations = []
|
||||
for data_type in data_types:
|
||||
operations += CreateGemmOperator(manifest, layouts, tile_descriptions, data_type,
|
||||
alignment_constraints, None, EpilogueFunctor.LinearCombination)
|
||||
|
||||
conv_layout = (LayoutType.TensorNHWC, LayoutType.TensorNHWC, LayoutType.TensorNHWC)
|
||||
operations += CreateConv2dOperator(manifest, conv_layout, tile_descriptions,
|
||||
data_type, alignment_constraints, [ConvKind.Fprop], EpilogueFunctor.LinearCombination)
|
||||
|
||||
operations += CreateConv2dFixedChannelsOperator(manifest, conv_layout, tile_descriptions,
|
||||
data_type, alignment_constraints_small_channels, [ConvKind.Fprop], EpilogueFunctor.LinearCombination)
|
||||
|
||||
for op in operations:
|
||||
if op.tile_description.threadblock_shape[1] >= 128:
|
||||
if op.tile_description.threadblock_shape[0] == 32:
|
||||
op.C.alignment = 8
|
||||
else:
|
||||
op.C.alignment = 16
|
||||
else:
|
||||
op.C.alignment = 8
|
||||
|
||||
#
|
||||
def GenerateSM89_SparseTensorOp_16864_fp8(manifest, cuda_version):
|
||||
|
||||
if (
|
||||
not CudaToolkitVersionSatisfies(cuda_version, 12, 4)
|
||||
):
|
||||
return
|
||||
|
||||
layouts = [
|
||||
(LayoutType.RowMajor, LayoutType.ColumnMajor, LayoutType.RowMajor)
|
||||
]
|
||||
|
||||
math_instructions = [
|
||||
MathInstruction(
|
||||
[16, 8, 64],
|
||||
DataType.e4m3, DataType.e4m3, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add),
|
||||
MathInstruction(
|
||||
[16, 8, 64],
|
||||
DataType.e4m3, DataType.e5m2, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add),
|
||||
MathInstruction(
|
||||
[16, 8, 64],
|
||||
DataType.e5m2, DataType.e4m3, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add),
|
||||
MathInstruction(
|
||||
[16, 8, 64],
|
||||
DataType.e5m2, DataType.e5m2, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add),
|
||||
MathInstruction(
|
||||
[16, 8, 64],
|
||||
DataType.e4m3, DataType.e4m3, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add_fast_accum),
|
||||
MathInstruction(
|
||||
[16, 8, 64],
|
||||
DataType.e4m3, DataType.e5m2, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add_fast_accum),
|
||||
MathInstruction(
|
||||
[16, 8, 64],
|
||||
DataType.e5m2, DataType.e4m3, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add_fast_accum),
|
||||
MathInstruction(
|
||||
[16, 8, 64],
|
||||
DataType.e5m2, DataType.e5m2, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add_fast_accum),
|
||||
]
|
||||
|
||||
min_cc = 89
|
||||
max_cc = 89
|
||||
|
||||
alignment_constraints = [16,]
|
||||
|
||||
for math_inst in math_instructions:
|
||||
tile_descriptions = [
|
||||
TileDescription([128, 64, 128], 3, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 128, 128], 3, [4, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 256, 128], 3, [2, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 128, 128], 3, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([256, 64, 128], 3, [4, 1, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 256, 128], 4, [1, 4, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 128, 128], 6, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 64, 128], 4, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 128, 256], 3, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([128, 64, 256], 4, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 128, 256], 3, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
TileDescription([ 64, 64, 256], 3, [2, 2, 1], math_inst, min_cc, max_cc),
|
||||
]
|
||||
|
||||
data_types = [
|
||||
[
|
||||
math_inst.element_a,
|
||||
math_inst.element_b,
|
||||
DataType.f32,
|
||||
math_inst.element_accumulator
|
||||
],
|
||||
]
|
||||
|
||||
operations = []
|
||||
for data_type in data_types:
|
||||
operations += CreateSparseGemmOperator(manifest, layouts, tile_descriptions, data_type,
|
||||
alignment_constraints, None, EpilogueFunctor.LinearCombination)
|
||||
|
||||
for op in operations:
|
||||
if op.tile_description.threadblock_shape[1] >= 128:
|
||||
op.C.alignment = 16
|
||||
else:
|
||||
op.C.alignment = 8
|
||||
|
||||
###################################################################################################
|
||||
|
||||
#
|
||||
def GenerateSM89(manifest, cuda_version):
|
||||
GenerateSM89_TensorOp_16832_fp8(manifest, cuda_version)
|
||||
GenerateSM89_SparseTensorOp_16864_fp8(manifest, cuda_version)
|
||||
|
||||
###################################################################################################
|
||||
|
||||
#
|
||||
def GenerateSM90_TensorOp_16b_WGMMA_gemm(manifest, cuda_version):
|
||||
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
|
||||
@ -4790,7 +5391,7 @@ def GenerateSM90_TensorOp_tf32_WGMMA_alignx_gemm(manifest, cuda_version):
|
||||
DataType.tf32, DataType.tf32, DataType.f32,
|
||||
OpcodeClass.TensorOp,
|
||||
MathOperation.multiply_add)
|
||||
|
||||
|
||||
min_cc = 90
|
||||
max_cc = 90
|
||||
|
||||
@ -4798,7 +5399,7 @@ def GenerateSM90_TensorOp_tf32_WGMMA_alignx_gemm(manifest, cuda_version):
|
||||
TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
|
||||
0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1])
|
||||
]
|
||||
|
||||
|
||||
tile_descriptions_small = [
|
||||
# TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
|
||||
# 0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1])
|
||||
@ -5395,7 +5996,7 @@ def GenerateSM90_TensorOp_fp8_WGMMA_alignx_gemm(manifest, cuda_version):
|
||||
]
|
||||
stream_k_schedules = []
|
||||
|
||||
|
||||
|
||||
for data_type in data_types:
|
||||
# With No-SMEM epilogues
|
||||
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, schedules)
|
||||
@ -6013,7 +6614,102 @@ def GenerateSM90_TensorOp_1684_symm_complex_gaussian(manifest, cuda_version):
|
||||
|
||||
###################################################################################################
|
||||
|
||||
#
|
||||
def GenerateSM90_Conv3x(manifest, cuda_version,
|
||||
log_indent_level: int = 0):
|
||||
"""
|
||||
Generate CUTLASS 3 convolution kernel(s) for SM90.
|
||||
|
||||
This is meant to be called from GenerateSM90.
|
||||
"""
|
||||
log_debug_line('GenerateSM90_Conv3x', log_indent_level)
|
||||
log_indent_level = log_indent_level + 1
|
||||
|
||||
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
|
||||
return
|
||||
|
||||
minimum_compute_capability = 90
|
||||
maximum_compute_capability = 90
|
||||
|
||||
spatial_dims = [2, 3]
|
||||
|
||||
def make_dims_and_alignments_triple(dim: int):
|
||||
byte_alignment_required_by_tma = 16
|
||||
return ((dim, byte_alignment_required_by_tma), # A
|
||||
(dim, byte_alignment_required_by_tma), # B
|
||||
(dim, byte_alignment_required_by_tma)) # C
|
||||
dims_and_alignments = [make_dims_and_alignments_triple(dim) for dim in spatial_dims]
|
||||
|
||||
def make_math_instruction(data_types: Tuple[DataType, DataType, DataType],
|
||||
instruction_shape: Tuple[int, int, int]) -> MathInstruction:
|
||||
default_opcode = OpcodeClass.TensorOp
|
||||
default_math_op = MathOperation.multiply_add
|
||||
[A_data_type, B_data_type, C_data_type] = data_types
|
||||
return MathInstruction(
|
||||
instruction_shape,
|
||||
A_data_type, B_data_type, C_data_type,
|
||||
default_opcode,
|
||||
default_math_op
|
||||
)
|
||||
data_types_and_instruction_shapes = [
|
||||
((DataType.f16, DataType.f16, DataType.f16), (64, 64, 16)),
|
||||
((DataType.f16, DataType.f16, DataType.f32), (64, 64, 16)),
|
||||
((DataType.bf16, DataType.bf16, DataType.f32), (64, 64, 16)),
|
||||
]
|
||||
math_instructions = map(lambda x: make_math_instruction(*x),
|
||||
data_types_and_instruction_shapes)
|
||||
|
||||
cluster_shapes = [
|
||||
[2, 1, 1],
|
||||
[1, 1, 1],
|
||||
]
|
||||
conv_kinds = [
|
||||
ConvKind.Fprop,
|
||||
ConvKind.Dgrad
|
||||
]
|
||||
mainloop_schedule = KernelScheduleType.ImplicitTmaWarpSpecializedSm90
|
||||
stages = 0 # zero means "deduce the number of stages automatically"
|
||||
|
||||
# tile_descriptions is a 2-level list.
|
||||
# Each inner list is for each cluster shape.
|
||||
for math_inst in math_instructions:
|
||||
tile_descriptions = []
|
||||
for cluster_shape in cluster_shapes:
|
||||
tile_shape = [
|
||||
math_inst.instruction_shape[0],
|
||||
math_inst.instruction_shape[1],
|
||||
math_inst.instruction_shape[2] * 4
|
||||
]
|
||||
warp_count = [4, 1, 1]
|
||||
tile_description = TileDescription(
|
||||
tile_shape, stages, warp_count, math_inst,
|
||||
minimum_compute_capability, maximum_compute_capability,
|
||||
cluster_shape)
|
||||
tile_descriptions.append(tile_description)
|
||||
|
||||
# It's typical to get the data types from the math instruction.
|
||||
data_type = {
|
||||
"a_type" : math_inst.element_a,
|
||||
"b_type" : math_inst.element_b,
|
||||
"c_type" : math_inst.element_accumulator,
|
||||
"d_type" : math_inst.element_accumulator,
|
||||
"acc_type" : math_inst.element_accumulator,
|
||||
"epi_type" : math_inst.element_accumulator
|
||||
}
|
||||
|
||||
for conv_kind in conv_kinds:
|
||||
epilogue_schedule = EpilogueScheduleType.TmaWarpSpecialized
|
||||
schedule_pairs = [
|
||||
(mainloop_schedule, epilogue_schedule)
|
||||
]
|
||||
CreateConvOperator3x(manifest,
|
||||
dims_and_alignments = dims_and_alignments,
|
||||
tile_descriptions = tile_descriptions,
|
||||
data_types = data_type,
|
||||
schedule_pairs = schedule_pairs,
|
||||
tile_schedulers = [TileSchedulerType.Default], # -> void
|
||||
conv_kind = conv_kind,
|
||||
log_indent_level = log_indent_level)
|
||||
|
||||
def GenerateSM90(manifest, cuda_version):
|
||||
GenerateSM90_TensorOp_16b_WGMMA_gemm(manifest, cuda_version)
|
||||
GenerateSM90_TensorOp_16b_WGMMA_alignx_gemm(manifest, cuda_version)
|
||||
@ -6035,6 +6731,7 @@ def GenerateSM90(manifest, cuda_version):
|
||||
GenerateSM90_TensorOp_1684_symm(manifest, cuda_version)
|
||||
GenerateSM90_TensorOp_1684_symm_complex(manifest, cuda_version)
|
||||
GenerateSM90_TensorOp_1684_symm_complex_gaussian(manifest, cuda_version)
|
||||
GenerateSM90_Conv3x(manifest, cuda_version)
|
||||
|
||||
###################################################################################################
|
||||
|
||||
@ -6094,6 +6791,7 @@ if __name__ == "__main__":
|
||||
GenerateSM70(manifest, args.cuda_version)
|
||||
GenerateSM75(manifest, args.cuda_version)
|
||||
GenerateSM80(manifest, args.cuda_version)
|
||||
GenerateSM89(manifest, args.cuda_version)
|
||||
GenerateSM90(manifest, args.cuda_version)
|
||||
if 'library' in args.generator_target.split(','):
|
||||
manifest.emit(GeneratorTarget.Library)
|
||||
|
||||
@ -39,12 +39,12 @@ import re
|
||||
|
||||
# The following block implements enum.auto() for Python 3.5 variants that don't include it such
|
||||
# as the default 3.5.2 on Ubuntu 16.04.
|
||||
#
|
||||
#
|
||||
# https://codereview.stackexchange.com/questions/177309/reimplementing-pythons-enum-auto-for-compatibility
|
||||
|
||||
try:
|
||||
from enum import auto as enum_auto
|
||||
except ImportError:
|
||||
except ImportError:
|
||||
__cutlass_library_auto_enum = 0
|
||||
def enum_auto() -> int:
|
||||
global __cutlass_library_auto_enum
|
||||
@ -298,10 +298,11 @@ class MathOperation(enum.Enum):
|
||||
multiply_add_complex_fast_f32 = enum_auto()
|
||||
multiply_add_complex = enum_auto()
|
||||
multiply_add_complex_gaussian = enum_auto()
|
||||
multiply_add_fast_accum = enum_auto()
|
||||
|
||||
#
|
||||
MathOperationTag = {
|
||||
MathOperation.multiply_add: 'cutlass::arch::OpMultiplyAdd',
|
||||
MathOperation.multiply_add: 'cutlass::arch::OpMultiplyAdd',
|
||||
MathOperation.multiply_add_saturate: 'cutlass::arch::OpMultiplyAddSaturate',
|
||||
MathOperation.multiply_add_mixed_input_upcast: 'cutlass::arch::OpMultiplyAddMixedInputUpcast',
|
||||
MathOperation.xor_popc: 'cutlass::arch::OpXorPopc',
|
||||
@ -312,6 +313,7 @@ MathOperationTag = {
|
||||
MathOperation.multiply_add_complex_fast_f32: 'cutlass::arch::OpMultiplyAddComplexFastF32',
|
||||
MathOperation.multiply_add_complex: 'cutlass::arch::OpMultiplyAddComplex',
|
||||
MathOperation.multiply_add_complex_gaussian: 'cutlass::arch::OpMultiplyAddGaussianComplex',
|
||||
MathOperation.multiply_add_fast_accum: 'cutlass::arch::OpMultiplyAddFastAccum',
|
||||
}
|
||||
|
||||
###################################################################################################
|
||||
@ -326,6 +328,7 @@ class LayoutType(enum.Enum):
|
||||
RowMajorInterleaved32 = enum_auto()
|
||||
ColumnMajorInterleaved64 = enum_auto()
|
||||
RowMajorInterleaved64 = enum_auto()
|
||||
TensorNWC = enum_auto()
|
||||
TensorNHWC = enum_auto()
|
||||
TensorNDHWC = enum_auto()
|
||||
TensorNCHW = enum_auto()
|
||||
@ -334,6 +337,9 @@ class LayoutType(enum.Enum):
|
||||
TensorNC64HW64 = enum_auto()
|
||||
TensorC32RSK32 = enum_auto()
|
||||
TensorC64RSK64 = enum_auto()
|
||||
TensorKCS = enum_auto()
|
||||
TensorKCSR = enum_auto()
|
||||
TensorKCSRT = enum_auto()
|
||||
|
||||
#
|
||||
LayoutTag = {
|
||||
@ -345,6 +351,7 @@ LayoutTag = {
|
||||
LayoutType.RowMajorInterleaved32: 'cutlass::layout::RowMajorInterleaved<32>',
|
||||
LayoutType.ColumnMajorInterleaved64: 'cutlass::layout::ColumnMajorInterleaved<64>',
|
||||
LayoutType.RowMajorInterleaved64: 'cutlass::layout::RowMajorInterleaved<64>',
|
||||
LayoutType.TensorNWC: 'cutlass::layout::TensorNWC',
|
||||
LayoutType.TensorNHWC: 'cutlass::layout::TensorNHWC',
|
||||
LayoutType.TensorNDHWC: 'cutlass::layout::TensorNDHWC',
|
||||
LayoutType.TensorNCHW: 'cutlass::layout::TensorNCHW',
|
||||
@ -353,6 +360,9 @@ LayoutTag = {
|
||||
LayoutType.TensorC32RSK32: 'cutlass::layout::TensorCxRSKx<32>',
|
||||
LayoutType.TensorNC64HW64: 'cutlass::layout::TensorNCxHWx<64>',
|
||||
LayoutType.TensorC64RSK64: 'cutlass::layout::TensorCxRSKx<64>',
|
||||
LayoutType.TensorKCS: 'cutlass::layout::TensorKCS',
|
||||
LayoutType.TensorKCSR: 'cutlass::layout::TensorKCSR',
|
||||
LayoutType.TensorKCSRT: 'cutlass::layout::TensorKCSRT'
|
||||
}
|
||||
|
||||
#
|
||||
@ -378,6 +388,7 @@ ShortLayoutTypeNames = {
|
||||
LayoutType.RowMajorInterleaved2: 't2',
|
||||
LayoutType.RowMajorInterleaved32: 't32',
|
||||
LayoutType.RowMajorInterleaved64: 't64',
|
||||
LayoutType.TensorNWC: 'nwc',
|
||||
LayoutType.TensorNHWC: 'nhwc',
|
||||
LayoutType.TensorNDHWC: 'ndhwc',
|
||||
LayoutType.TensorNCHW: 'nchw',
|
||||
@ -385,7 +396,10 @@ ShortLayoutTypeNames = {
|
||||
LayoutType.TensorNC32HW32: 'nc32hw32',
|
||||
LayoutType.TensorNC64HW64: 'nc64hw64',
|
||||
LayoutType.TensorC32RSK32: 'c32rsk32',
|
||||
LayoutType.TensorC64RSK64: 'c64rsk64'
|
||||
LayoutType.TensorC64RSK64: 'c64rsk64',
|
||||
LayoutType.TensorKCS: 'kcs',
|
||||
LayoutType.TensorKCSR: 'kcsr',
|
||||
LayoutType.TensorKCSRT: 'kcsrt'
|
||||
}
|
||||
|
||||
#
|
||||
@ -410,6 +424,7 @@ class KernelScheduleType(enum.Enum):
|
||||
TmaWarpSpecializedFP8FastAccum = enum_auto()
|
||||
TmaWarpSpecializedCooperativeFP8FastAccum = enum_auto()
|
||||
TmaWarpSpecializedPingpongFP8FastAccum = enum_auto()
|
||||
ImplicitTmaWarpSpecializedSm90 = enum_auto()
|
||||
#
|
||||
KernelScheduleTag = {
|
||||
KernelScheduleType.ScheduleAuto: 'cutlass::gemm::collective::KernelScheduleAuto',
|
||||
@ -424,6 +439,7 @@ KernelScheduleTag = {
|
||||
KernelScheduleType.TmaWarpSpecializedFP8FastAccum: 'cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum',
|
||||
KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum: 'cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum',
|
||||
KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum: 'cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum',
|
||||
KernelScheduleType.ImplicitTmaWarpSpecializedSm90: 'cutlass::conv::KernelImplicitTmaWarpSpecializedSm90',
|
||||
}
|
||||
|
||||
#
|
||||
@ -440,6 +456,7 @@ KernelScheduleSuffixes = {
|
||||
KernelScheduleType.TmaWarpSpecializedFP8FastAccum: '_warpspecialized_fp8_fastaccum',
|
||||
KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum: '_warpspecialized_cooperative_fp8_fastaccum',
|
||||
KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum: '_warpspecialized_pingpong_fp8_fastaccum',
|
||||
KernelScheduleType.ImplicitTmaWarpSpecializedSm90: '_warpspecialized',
|
||||
}
|
||||
|
||||
class EpilogueScheduleType(enum.Enum):
|
||||
@ -578,8 +595,8 @@ class OperationKind(enum.Enum):
|
||||
Rank2K = enum_auto()
|
||||
Trmm = enum_auto()
|
||||
Symm = enum_auto()
|
||||
Conv2d = enum_auto()
|
||||
Conv3d = enum_auto()
|
||||
Conv2d = enum_auto()
|
||||
Conv3d = enum_auto()
|
||||
|
||||
#
|
||||
OperationKindNames = {
|
||||
@ -588,11 +605,11 @@ OperationKindNames = {
|
||||
, OperationKind.Rank2K: 'rank_2k'
|
||||
, OperationKind.Trmm: 'trmm'
|
||||
, OperationKind.Symm: 'symm'
|
||||
, OperationKind.Conv2d: 'conv2d'
|
||||
, OperationKind.Conv3d: 'conv3d'
|
||||
, OperationKind.Conv2d: 'conv2d'
|
||||
, OperationKind.Conv3d: 'conv3d'
|
||||
}
|
||||
|
||||
#
|
||||
#
|
||||
class Target(enum.Enum):
|
||||
library = enum_auto()
|
||||
#
|
||||
@ -708,7 +725,7 @@ class SwizzlingFunctor(enum.Enum):
|
||||
StridedDgradIdentity4 = enum_auto()
|
||||
StridedDgradHorizontal = enum_auto()
|
||||
StreamK = enum_auto()
|
||||
|
||||
|
||||
#
|
||||
SwizzlingFunctorTag = {
|
||||
SwizzlingFunctor.Identity1: 'cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>',
|
||||
@ -834,11 +851,11 @@ GroupModeNames = {
|
||||
|
||||
#
|
||||
class MathInstruction:
|
||||
def __init__(self,
|
||||
def __init__(self,
|
||||
instruction_shape, \
|
||||
element_a, element_b, element_accumulator, \
|
||||
opcode_class, math_operation = MathOperation.multiply_add \
|
||||
):
|
||||
):
|
||||
|
||||
self.instruction_shape = instruction_shape
|
||||
self.element_a = element_a
|
||||
@ -887,15 +904,15 @@ class Direct2dConvFixedStrideDilationTileDescription:
|
||||
self.maximum_compute_capability = max_compute
|
||||
|
||||
def procedural_name(self):
|
||||
str_name = "%dx%dx%d_%dx%dx%dx%d_%d_filter%dx%d" % (self.threadblock_shape[0],
|
||||
self.threadblock_shape[1],
|
||||
str_name = "%dx%dx%d_%dx%dx%dx%d_%d_filter%dx%d" % (self.threadblock_shape[0],
|
||||
self.threadblock_shape[1],
|
||||
self.threadblock_shape[2],
|
||||
self.threadblock_output_shape[0],
|
||||
self.threadblock_output_shape[1],
|
||||
self.threadblock_output_shape[2],
|
||||
self.threadblock_output_shape[3],
|
||||
self.stages,
|
||||
self.filter_shape[0],
|
||||
self.stages,
|
||||
self.filter_shape[0],
|
||||
self.filter_shape[1])
|
||||
# Fixed Strided and dilation
|
||||
if self.stride != [-1, -1] and self.dilation != [-1, -1]:
|
||||
@ -920,15 +937,15 @@ class Direct2dConvFixedStrideDilationTileDescription:
|
||||
self.maximum_compute_capability = max_compute
|
||||
|
||||
def procedural_name(self):
|
||||
str_name = "%dx%dx%d_%dx%dx%dx%d_%d_filter%dx%d" % (self.threadblock_shape[0],
|
||||
self.threadblock_shape[1],
|
||||
str_name = "%dx%dx%d_%dx%dx%dx%d_%d_filter%dx%d" % (self.threadblock_shape[0],
|
||||
self.threadblock_shape[1],
|
||||
self.threadblock_shape[2],
|
||||
self.threadblock_output_shape[0],
|
||||
self.threadblock_output_shape[1],
|
||||
self.threadblock_output_shape[2],
|
||||
self.threadblock_output_shape[3],
|
||||
self.stages,
|
||||
self.filter_shape[0],
|
||||
self.stages,
|
||||
self.filter_shape[0],
|
||||
self.filter_shape[1])
|
||||
# Fixed Strided and dilation
|
||||
if self.stride != [-1, -1] and self.dilation != [-1, -1]:
|
||||
|
||||
@ -67,6 +67,26 @@ _LOGGER = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class EmitOperationKindAll:
|
||||
"""
|
||||
Emit the OperationKind-level CUTLASS library initialization code.
|
||||
The code is generated in the {generated_path}/{operation_kind} directory
|
||||
(e.g., tools/library/generated/gemm in the build directory,
|
||||
for OperationKind=Gemm), in the all_{operation_kind}_operations.cu file
|
||||
(e.g., all_gemm_operations.cu for OperationKind=Gemm).
|
||||
That file declares several functions in namespace cutlass::library.
|
||||
The functions all have this form,
|
||||
|
||||
void initialize_{configuration_name}(Manifest& manifest);
|
||||
|
||||
The file also _defines_ the following function in that namespace.
|
||||
|
||||
void initialize_all_{operation_kind}_operations(Manifest& manifest);
|
||||
|
||||
That function calls all of the functions declared in this file.
|
||||
Those functions are defined in subdirectories
|
||||
(which this class does not create).
|
||||
"""
|
||||
|
||||
def __init__(self, generated_path, kind, args):
|
||||
self.generated_path = generated_path
|
||||
self.kind = kind
|
||||
@ -109,10 +129,15 @@ void initialize_all_${operation_name}_operations(Manifest &manifest) {
|
||||
|
||||
#
|
||||
def __enter__(self):
|
||||
_LOGGER.debug("*** EmitOperationKindAll::__enter__")
|
||||
|
||||
self.operation_path = os.path.join(self.generated_path, OperationKindNames[self.kind])
|
||||
_LOGGER.debug('*** operation_path (directory to create): ' +
|
||||
str(self.operation_path));
|
||||
os.makedirs(self.operation_path, exist_ok=True)
|
||||
|
||||
self.top_level_path = os.path.join(self.operation_path, f"all_{OperationKindNames[self.kind]}_operations.cu")
|
||||
_LOGGER.debug(f"*** top_level_path (file to write): {str(self.top_level_path)}")
|
||||
|
||||
self.top_level_file = open(self.top_level_path, "w")
|
||||
self.top_level_file.write(self.header_template)
|
||||
@ -125,13 +150,22 @@ void initialize_all_${operation_name}_operations(Manifest &manifest) {
|
||||
|
||||
#
|
||||
def emit(self, operations):
|
||||
_LOGGER.debug('*** EmitOperationKindAll::emit')
|
||||
_LOGGER.debug(f"*** len(operations): {len(operations)}")
|
||||
_LOGGER.debug(f"*** min_cc list: {sorted(min_cc for min_cc, _ in operations.items())}")
|
||||
|
||||
for min_cc, configurations in sorted(operations.items()):
|
||||
_LOGGER.debug(f"*** min_cc={min_cc}")
|
||||
|
||||
for configuration_name, _ in configurations.items():
|
||||
_LOGGER.debug(f"*** configuration_name={configuration_name}")
|
||||
self.configurations.append(configuration_name)
|
||||
self.top_level_file.write(SubstituteTemplate(self.configuration_prototype_template, {'configuration_name': configuration_name} ))
|
||||
|
||||
#
|
||||
def __exit__(self, exception_type, exception_value, traceback):
|
||||
_LOGGER.debug("*** EmitOperationKindAll::__exit__")
|
||||
|
||||
self.top_level_file.write(SubstituteTemplate(self.entry_template, {'operation_name': OperationKindNames[self.kind]}))
|
||||
|
||||
for configuration_name in self.configurations:
|
||||
@ -142,6 +176,37 @@ void initialize_all_${operation_name}_operations(Manifest &manifest) {
|
||||
|
||||
|
||||
class EmitOperationKindLibrary:
|
||||
"""
|
||||
Emit the CUTLASS library initialization code for each OperationKind.
|
||||
The code is generated in the directory
|
||||
{generated_path}/{operation_kind}/{min_cc}
|
||||
(e.g., tools/library/generated/gemm/90 in the build directory,
|
||||
for min_cc=90 and OperationKind=Gemm), in the file
|
||||
all_sm{min_cc}_{operation_kind}_operations.cu
|
||||
(e.g., all_sm90_gemm_operations.cu for min_cc=90 and OperationKind=Gemm).
|
||||
The min_cc variable here indicates the minimum GPU architecture version
|
||||
that the things to be initialized require.
|
||||
For example, min_cc=90 indicates sm90.
|
||||
|
||||
That file declares several functions in namespace cutlass::library.
|
||||
The functions all have this form,
|
||||
|
||||
void initialize_all_sm{min_cc}_{subclass_name}_{extended_name}_operations(Manifest& manifest);
|
||||
|
||||
where extended_name is operation.extended_name() for all the operations
|
||||
given to the emit method (which see below). (All operations for a given
|
||||
configuration_name are guaranteed to have the same extended_name().)
|
||||
|
||||
The file also _defines_ the following function in that namespace.
|
||||
|
||||
void initialize_all_sm{min_cc}__{operation_kind}_operations(Manifest& manifest);
|
||||
|
||||
That function calls all of the functions declared in this file.
|
||||
Those functions are defined in subdirectories.
|
||||
The mapping from OperationKind to emitter handles the details
|
||||
of what happens in each of those subdirectories.
|
||||
"""
|
||||
|
||||
def __init__(self, generated_path, min_cc, kind, args):
|
||||
self.generated_path = generated_path
|
||||
self.min_cc = min_cc
|
||||
@ -194,10 +259,17 @@ void initialize_all_sm${min_cc}_${subclass_name}_${operation_name}_operations(Ma
|
||||
|
||||
#
|
||||
def __enter__(self):
|
||||
_LOGGER.debug("*** EmitOperationKindLibrary::__enter__")
|
||||
_LOGGER.debug(f"*** generated_path: {str(self.generated_path)}")
|
||||
_LOGGER.debug(f"*** OperationKindNames[kind]: {OperationKindNames[self.kind]}")
|
||||
_LOGGER.debug(f"*** min_cc: {self.min_cc}")
|
||||
|
||||
self.operation_path = os.path.join(self.generated_path, OperationKindNames[self.kind], str(self.min_cc))
|
||||
_LOGGER.debug(f"*** operation_path (directory to make): {str(self.operation_path)}")
|
||||
os.makedirs(self.operation_path)
|
||||
|
||||
self.top_level_path = os.path.join(self.operation_path, f"all_sm{self.min_cc}_{OperationKindNames[self.kind]}_operations.cu")
|
||||
_LOGGER.debug(f"*** top_level_path (file to write): {str(self.top_level_path)}")
|
||||
|
||||
self.top_level_file = open(self.top_level_path, "w")
|
||||
self.top_level_file.write(self.header_template)
|
||||
@ -216,16 +288,21 @@ void initialize_all_sm${min_cc}_${subclass_name}_${operation_name}_operations(Ma
|
||||
|
||||
#
|
||||
def emit(self, configuration_name, operations):
|
||||
_LOGGER.debug("*** EmitOperationKindLibrary::emit")
|
||||
_LOGGER.debug(f"*** configuration_name: {configuration_name}")
|
||||
|
||||
assert len(operations) > 0
|
||||
|
||||
# The extended name for all operations of a given configuration_name is guaranteed
|
||||
# to be the same because extended_name() is used in defining configuration_name. Thus,
|
||||
# we can safely use the extended_name() of the first operation.
|
||||
extended_name = operations[0].extended_name()
|
||||
_LOGGER.debug('*** extended_name (for all ops): ' + extended_name)
|
||||
|
||||
# Create a directory for operations with this subclass if it does not exist
|
||||
if extended_name not in self.subclass_files:
|
||||
subclass_path = os.path.join(self.operation_path, extended_name)
|
||||
_LOGGER.debug(f"*** subclass_path: {str(subclass_path)}")
|
||||
os.mkdir(subclass_path)
|
||||
|
||||
self.subclass_configurations[extended_name] = []
|
||||
@ -233,16 +310,23 @@ void initialize_all_sm${min_cc}_${subclass_name}_${operation_name}_operations(Ma
|
||||
# Open a new top-level file for this sub class
|
||||
subclass_top_level_path = os.path.join(
|
||||
subclass_path, f"all_sm{self.min_cc}_{extended_name}_{OperationKindNames[self.kind]}_operations.cu")
|
||||
_LOGGER.debug('*** subclass_top_level_path (min_cc, extended_name, ' +
|
||||
'OperationKind): ' + str(subclass_top_level_path))
|
||||
|
||||
self.subclass_files[extended_name] = open(subclass_top_level_path, "w")
|
||||
self.subclass_files[extended_name].write(self.header_template)
|
||||
|
||||
self.source_files[extended_name] = [subclass_top_level_path]
|
||||
|
||||
subclass_dir = os.path.dirname(self.subclass_files[extended_name].name)
|
||||
_LOGGER.debug('*** subclass_dir: ' + str(subclass_dir))
|
||||
|
||||
with self.emitters[self.kind](subclass_dir, configuration_name) as configuration_emitter:
|
||||
for operation in operations:
|
||||
configuration_emitter.emit(operation)
|
||||
|
||||
_LOGGER.debug('*** configuration_emitter.configuration_path: ' +
|
||||
str(configuration_emitter.configuration_path))
|
||||
self.source_files[extended_name].append(configuration_emitter.configuration_path)
|
||||
|
||||
self.subclass_configurations[extended_name].append(configuration_name)
|
||||
@ -250,6 +334,7 @@ void initialize_all_sm${min_cc}_${subclass_name}_${operation_name}_operations(Ma
|
||||
|
||||
#
|
||||
def __exit__(self, exception_type, exception_value, traceback):
|
||||
_LOGGER.debug("*** EmitOperationKindLibrary::__exit__")
|
||||
for subclass_name, subclass_file in sorted(self.subclass_files.items()):
|
||||
subclass_cfg = {
|
||||
'min_cc': str(self.min_cc),
|
||||
@ -290,6 +375,29 @@ void initialize_all_sm${min_cc}_${subclass_name}_${operation_name}_operations(Ma
|
||||
self.top_level_file.close()
|
||||
|
||||
class EmitInterfaceLibrary:
|
||||
"""
|
||||
Emit the topmost-level CUTLASS library initialization code.
|
||||
The code is generated in the generated_path directory
|
||||
(e.g., tools/library/generated in the build directory),
|
||||
in the initialize_all.cpp file.
|
||||
That file declares several functions in namespace cutlass::library.
|
||||
The functions all have this form,
|
||||
|
||||
void initialize_all_{operation_kind}_operations(Manifest& manifest);
|
||||
|
||||
where {operation_kind} abbreviates the "kind" of operation
|
||||
(e.g., gemm for matrix-matrix multiply, conv2d for 2-d convolution,
|
||||
or trmm for triangular solve with multiple right-hand sides).
|
||||
The definitions of these functions live in subdirectories.
|
||||
|
||||
The file also _defines_ the following function in that namespace.
|
||||
|
||||
void initialize_all(Manifest& manifest);
|
||||
|
||||
That function first prepares the manifest, and then
|
||||
calls all of the functions declared in this file.
|
||||
"""
|
||||
|
||||
def __init__(self, generated_path, operation_count, args):
|
||||
self.generated_path = generated_path
|
||||
self.args = args
|
||||
@ -335,7 +443,10 @@ ${fn_calls}
|
||||
|
||||
#
|
||||
def __enter__(self):
|
||||
_LOGGER.debug("*** EmitInterfaceLibrary::__enter__")
|
||||
|
||||
self.top_level_path = os.path.join(self.generated_path, 'initialize_all.cpp')
|
||||
_LOGGER.debug("*** top_level_path: " + str(self.top_level_path))
|
||||
|
||||
self.top_level_file = open(self.top_level_path, "w")
|
||||
self.top_level_file.write(self.top_level_hdr_template)
|
||||
@ -346,6 +457,9 @@ ${fn_calls}
|
||||
|
||||
#
|
||||
def emit(self, operation_name):
|
||||
_LOGGER.debug("*** EmitInterfaceLibrary::emit")
|
||||
_LOGGER.debug("*** operation_name: " + operation_name)
|
||||
|
||||
self.prototypes.append(SubstituteTemplate(
|
||||
"\t\tvoid initialize_all_${operation_kind}_operations(Manifest &manifest);",
|
||||
{'operation_kind': operation_name}))
|
||||
@ -356,6 +470,8 @@ ${fn_calls}
|
||||
|
||||
#
|
||||
def __exit__(self, exception_type, exception_value, traceback):
|
||||
_LOGGER.debug("*** EmitInterfaceLibrary::__exit__")
|
||||
|
||||
self.top_level_file.write(SubstituteTemplate(self.top_level_prologue, {'prototypes':"\n".join(self.prototypes)}))
|
||||
|
||||
# Write out initialize_all method
|
||||
@ -398,8 +514,14 @@ class Manifest:
|
||||
self.kernel_filter = self.args.kernels
|
||||
self.curr_build_dir = args.curr_build_dir
|
||||
|
||||
# A common user error is to use commas instead of semicolons.
|
||||
if ',' in args.architectures:
|
||||
raise RuntimeError("The list of architectures (CMake option CUTLASS_NVCC_ARCHS) must be semicolon-delimited.\nDon't use commas to separate the architectures; use semicolons.\nYou specified the list as: " + args.architectures)
|
||||
architectures = args.architectures.split(';') if len(args.architectures) else ['50',]
|
||||
architectures = [x if x != '90a' else '90' for x in architectures]
|
||||
|
||||
arch_conditional_cc = ['90a']
|
||||
architectures = [x if x not in arch_conditional_cc else x.split('a')[0] for x in architectures]
|
||||
|
||||
self.compute_capabilities = [int(x) for x in architectures]
|
||||
|
||||
if args.filter_by_cc in ['false', 'False', '0']:
|
||||
@ -681,8 +803,7 @@ class Manifest:
|
||||
for min_cc, configurations in sorted(ops.items()):
|
||||
with operation_emitters[target](generated_path, min_cc, operation_kind, self.args) as operation_kind_emitter:
|
||||
for configuration_name, operations in configurations.items():
|
||||
_LOGGER.info("Emitting {config} with {num_ops} operations.".format(
|
||||
config = configuration_name, num_ops = len(operations)))
|
||||
_LOGGER.info(f"Emitting {configuration_name} with {len(operations)} operation{'' if len(operations) == 1 else 's'}.")
|
||||
operation_kind_emitter.emit(configuration_name, operations)
|
||||
|
||||
for subclass, files in operation_kind_emitter.source_files.items():
|
||||
|
||||
@ -36,7 +36,7 @@ from setuptools import setup
|
||||
def perform_setup():
|
||||
setup(
|
||||
name='cutlass_library',
|
||||
version='3.4.1',
|
||||
version='3.5.0',
|
||||
description='CUTLASS library generation scripts',
|
||||
packages=['cutlass_library']
|
||||
)
|
||||
|
||||
@ -36,7 +36,7 @@ from setuptools import setup
|
||||
def perform_setup():
|
||||
setup(
|
||||
name='pycute',
|
||||
version='3.4.1',
|
||||
version='3.5.0',
|
||||
description='Python implementation of CuTe',
|
||||
packages=['pycute'],
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user