CUTLASS 3.5.0 (#1411)

This commit is contained in:
Vijay Thakkar
2024-03-19 17:51:04 -04:00
committed by GitHub
parent ffa34e7075
commit 629f4653c3
468 changed files with 48730 additions and 7253 deletions

View File

@ -1,12 +1,14 @@
![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
# Python packages associated with CUTLASS
This directory contains Python packages that are associated with CUTLASS:
* `cutlass`: the CUTLASS Python interface, which enables one to compile and run CUTLASS kernels from within Python
* `cutlass_library`: utilities used for enumerating and emitting C++ code for CUTLASS kernels
## CUTLASS Python Interface
The CUTLASS Python interface enables one to compile and run CUTLASS operations from within Python.
```python
@ -19,34 +21,46 @@ plan.run(A, B, C, D)
```
### Overview
The CUTLASS Python interface aims to provide an ease-of-use interface for using CUTLASS via Python. Toward this goal,
the CUTLASS Python interface attempts to:
* Present high-level interfaces for operators that require only few parameters
* Select sensible default configurations for an operator given the parameters that have been specified
* Enumerate configurations for users that are known to work in a given setting
* Reduce the occurrence of C++ compile-time errors in favor of descriptive Python exceptions
* Make it easy to export CUTLASS kernels to framework extensions (e.g., PyTorch CUDA extensions)
The CUTLASS Python interface prioritizes ease of use.
It has the following features that support this goal.
* It presents high-level interfaces for operators, that require only few parameters.
* It selects sensible default configurations for an operator given the parameters that have been specified.
* It enumerates configurations for users that are known to work in a given setting.
* It favors emitting descriptive Python run-time exceptions instead of C++ compile-time errors, where possible.
* It simplifies exporting CUTLASS kernels to framework extensions (e.g., PyTorch CUDA extensions).
#### Non-goals
The CUTLASS Python interface does not intended to:
The CUTLASS Python interface does not intend to:
**Select optimal kernel configurations.**
As an ease-of-use interface, the default selections for operator parameters made by the CUTLASS Python interface may
not achieve the highest possible performance in all scenarios. Users wishing to achieve the highest performance possible
should consider profile different combinations of configuration parameters, or use a library such as [cuBLAS](https://developer.nvidia.com/cublas)
that contains heuristics for selecting kernels.
1. select optimal kernel configurations,
2. act as a fast container for CUTLASS kernels, or
3. act as a Python-to-CUDA-kernel just-in-time (JIT) compilation engine.
**Act as a fast container for CUTLASS kernels.**
The CUTLASS Python interface does not strive to minimize overhead in its Python functions surrounding the running of a kernel.
Those wishing to deploy a CUTLASS kernel should consider either using the C++ emitted by the Python interface directly, or using
one of the CUTLASS emitters for automatically creating a framework extension for the kernel (e.g., a PyTorch CUDA extension).
Regarding selection of optimal kernel configurations,
the interface favors ease-of-use over maximum configurability.
Thus, its default selections for operator parameters may
not achieve the highest possible performance in all scenarios. Users wishing to achieve the highest performance possible should either
**Act as a Python-to-CUDA-kernel JIT compilation engine.**
The CUTLASS Python interface intends to enable one to use CUTLASS via Python. It can be used by frameworks for JIT compiling
* select parameters by profiling different combinations of them, or
* use a library such as [cuBLAS](https://developer.nvidia.com/cublas)
that contains heuristics for selecting kernels.
Regarding acting as a fast container for CUTLASS kernels:
the interface does not strive to minimize overhead in its Python functions surrounding the running of a kernel.
Those wishing to deploy a CUTLASS kernel should either
* use the C++ emitted by the Python interface directly, or
* use one of the CUTLASS emitters for automatically creating a framework extension for the kernel (e.g., a PyTorch CUDA extension).
Regarding acting as a Python-to-CUDA-kernel JIT compilation engine:
the interface enables use of CUTLASS in Python code.
It can be used by frameworks for JIT compiling
Python to CUDA kernels, but does not set out to be such a framework.
#### Comparison to PyCUTLASS
The CUTLASS Python interface builds atop CUTLASS's [PyCUTLASS](https://github.com/NVIDIA/cutlass/tree/v3.0.0/tools/library/scripts/pycutlass) library. PyCUTLASS enables
one to declare, compile, and run GEMMs, convolutions, and grouped GEMM operators with nearly the same configuration
space as CUTLASS's C++ interface. While this flexibility enables one to achieve the similar levels of functionality
@ -73,17 +87,21 @@ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.08-py3 -p 8888:8888
The CUTLASS Python interface has been tested with CUDA 11.8, 12.0, and 12.1 on Python 3.8 and 3.9.
#### Optional environment variables
Prior to installing the CUTLASS Python interface, one may optionally set the following environment variables:
* `CUTLASS_PATH`: the path to the cloned CUTLASS repository
* `CUDA_INSTALL_PATH`: the path to the installation of CUDA
If these environment variables are not set, the installation process will infer them to be the following:
* `CUTLASS_PATH`: either one directory level above the current directory (i.e., `$(pwd)/..`) if installed locally or in the `source` directory of the location in which `cutlass_library` was installed
* `CUDA_INSTALL_PATH`: the directory holding `/bin/nvcc` for the first version of `nvcc` on `$PATH` (i.e., `which nvcc | awk -F'/bin/nvcc' '{print $1}'`)
**NOTE:** The version of `cuda-python` installed must match the CUDA version in `CUDA_INSTALL_PATH`.
#### Installation
Stable releases of the CUTLASS Python interface are available via the `nvidia-cutlass` PyPI package. Any other packages with the name `cutlass` are not affiliated with NVIDIA CUTLASS.
```bash
pip install nvidia-cutlass
@ -94,7 +112,7 @@ The CUTLASS Python interface can also be installed from source by navigating to
pip install .
```
If you would like to be able to make changes to CUTLASS Python interface and have them reflected when using the interface, perform:
If you would like to be able to make changes to the CUTLASS Python interface and have them reflected when using the interface, perform:
```bash
pip install -e .
```
@ -118,6 +136,7 @@ Currently, the following operations can be exported to a PyTorch CUDA extension:
* Conv2d
### Examples
Jupyter notebook examples of using the CUTLASS Python interface are located in [examples/python](/examples/python).
To launch these notebooks from this directory, run:
@ -126,9 +145,10 @@ jupyter-lab ../examples/python
```
### Building documentation
The CUTLASS Python interface uses [Sphinx](https://www.sphinx-doc.org/en/master/) for documentation.
Building the documentation requires additional packages. These can be installed via:
Building the documentation requires additional packages. The following commands will install them.
```bash
sudo apt-get install pandoc
pip install --upgrade Sphinx furo pandoc myst-parser sphinx-copybutton nbsphinx nbsphinx-link sphinx-inline-tabs
@ -137,7 +157,7 @@ pip install --upgrade Sphinx furo pandoc myst-parser sphinx-copybutton nbsphinx
To build documentation, you must first have installed the CUTLASS Python interface via the
[installation instructions](#installation).
Documentation can then be built via the following commands:
Documentation can then be built via the following commands.
```bash
sphinx-apidoc -o docs_src/source/ cutlass/ cutlass/backend*
cd docs_src
@ -146,6 +166,7 @@ mv _build/* ../docs
```
## CUTLASS library package
[cutlass_library](/python/cutlass_library) contains utilities for enumerating and emitting CUTLASS C++ kernels.
It is used by the CUTLASS CMake system to construct a library of kernels that can be profiled using the CUTLASS profiler.

View File

@ -121,7 +121,7 @@ def get_option_registry():
this._option_registry = OptionRegistry(device_cc())
return this._option_registry
this.__version__ = '3.4.1'
this.__version__ = '3.5.0'
from cutlass.backend import create_memory_pool
from cutlass.emit.pytorch import pytorch

View File

@ -244,7 +244,7 @@ def get_gemm_arguments_3x(mainloop_arguments, epilogue_functor, scheduler_args,
class _HardwareInfo(ctypes.Structure):
_fields_ = [
("device_id", ctypes.c_int),
("sm_count", ctypes.c_int)
("sm_count", ctypes.c_int),
]
class _GemmArguments(ctypes.Structure):

View File

@ -122,7 +122,7 @@ class LinearCombination(EpilogueFunctorBase):
:param element_output: data type used to load and store tensors
:param epilogue_vector_length: number of elements computed per operation.
Usually it is 128/sizeof_bits<ElementOutput_>, but we use 64 and 32 sometimes
Usually it is 128/sizeof_bits_v<ElementOutput_>, but we use 64 and 32 sometimes
when there are not enough data to store
:param element_accumulator: Accumulator data type
@ -207,7 +207,7 @@ class LinearCombinationClamp(LinearCombination):
:param element_output: data type used to load and store tensors
:param epilogue_vector_length: number of elements computed per operation.
Usually it is 128/sizeof_bits<ElementOutput_>, but we use 64 and 32 sometimes
Usually it is 128/sizeof_bits_v<ElementOutput_>, but we use 64 and 32 sometimes
when there are not enough data to store
:param element_accumulator: Accumulator data type
@ -260,7 +260,7 @@ class FastLinearCombinationClamp(EpilogueFunctorBase):
:param element_output: data type used to load and store tensors
:param epilogue_vector_length: number of elements computed per operation.
Usually it is 128/sizeof_bits<ElementOutput_>, but we use 64 and 32 sometimes
Usually it is 128/sizeof_bits_v<ElementOutput_>, but we use 64 and 32 sometimes
when there are not enough data to store
"""
@ -310,7 +310,7 @@ class LinearCombinationGeneric(LinearCombination):
:param element_output: data type used to load and store tensors
:param epilogue_vector_length: number of elements computed per operation.
Usually it is 128/sizeof_bits<ElementOutput_>, but we use 64 and 32 sometimes
Usually it is 128/sizeof_bits_v<ElementOutput_>, but we use 64 and 32 sometimes
when there are not enough data to store
:param element_accumulator: Accumulator data type

View File

@ -299,7 +299,7 @@ class Sm90ColumnReductionImpl(ColumnReductionImpl):
self._type_decl = f"""
using {self.name_camel} = cutlass::epilogue::fusion::Sm90ColReduction<
{op_tag(self.reg_reduce_fn)}, {op_tag(self.gmem_reduce_fn)}, 0,
{op_tag(self.reg_reduce_fn)}, {op_tag(self.reg_reduce_fn)}, {op_tag(self.gmem_reduce_fn)}, 0,
typename EpilogueDescriptor::TileShape, {DataTypeTag[self.element]},
{DataTypeTag[self.element_compute]}, {FloatRoundStyleTag[self.round_style]},
{self.stride_mnl}
@ -321,7 +321,7 @@ class Sm90RowReductionImpl(RowReductionImpl):
self._type_decl = f"""
using {self.name_camel} = cutlass::epilogue::fusion::Sm90RowReduction<
{op_tag(self.reg_reduce_fn)}, {op_tag(self.gmem_reduce_fn)}, 0 /* Stages */,
{op_tag(self.reg_reduce_fn)}, {op_tag(self.reg_reduce_fn)}, {op_tag(self.gmem_reduce_fn)}, 0 /* Stages */,
typename EpilogueDescriptor::TileShape, {DataTypeTag[self.element]},
{DataTypeTag[self.element_compute]}, {FloatRoundStyleTag[self.round_style]},
{self.stride_mnl}

View File

@ -565,7 +565,9 @@ class GemmArguments3x(GemmArguments2x):
)
# Set hardware info
hw_info_ = hw_info(0, device_sm_count())
hw_info_ = hw_info(
0, device_sm_count(),
)
self.arguments = argument_type(
int(self.gemm_mode),
@ -1300,7 +1302,7 @@ using DeviceKernel = cutlass::gemm::device::GemmUniversalAdapter<${operation_nam
# Support built-in epilogue functors or user-defined functions
if operation.tile_description.stages is None or operation.tile_description.stages == 0:
stage_count_type = "cutlass::gemm::collective::StageCountAutoCarveout<sizeof(typename CollectiveEpilogue::SharedStorage)>"
stage_count_type = "cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>"
else:
stage_count_type = "_" + str(operation.tile_description.stages)

View File

@ -35,16 +35,22 @@ Utilities for emitting Conv2d kernels
"""
import enum
import logging
import os.path
import shutil
from string import Template
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
from cutlass_library.conv3x_emitter import EmitConv3xInstance, EmitConv3xIncludes
except ImportError:
from library import *
from conv3x_emitter import EmitConv3xInstance, EmitConv3xIncludes
_LOGGER = logging.getLogger(__name__)
###################################################################################################
@ -174,6 +180,8 @@ class Conv2dOperation:
class EmitConv2dInstance:
def __init__(self):
# Emitter for CUTLASS 3 convolution operations
self.conv3x_emitter = EmitConv3xInstance()
self.template = """
// Conv2d${conv_kind_name} ${iterator_algorithm_name} kernel instance "${operation_name}"
using ${operation_name}_base =
@ -277,7 +285,18 @@ class EmitConv2dInstance:
>::Kernel;
"""
def arch_number_to_type(self, arch: int):
return f"cutlass::arch::Sm{arch}"
def emit(self, operation):
_LOGGER.debug("*** EmitConv2dInstance::emit")
_LOGGER.debug("*** operation: procedural_name()=" + operation.procedural_name())
if hasattr(operation, 'is_3x') and operation.is_3x:
_LOGGER.debug("*** CUTLASS 3 operation")
return self.conv3x_emitter.emit(operation)
_LOGGER.debug("*** CUTLASS 2 operation")
warp_shape = [int(operation.tile_description.threadblock_shape[idx] / operation.tile_description.warp_count[idx]) for idx in range(3)]
@ -320,9 +339,11 @@ class EmitConv2dInstance:
}
if operation.group_mode == GroupMode.NoneGroup:
_LOGGER.debug("*** group_mode=NoneGroup")
return SubstituteTemplate(self.template, values)
elif operation.group_mode == GroupMode.Depthwise:
_LOGGER.debug("*** group_mode=Depthwise")
values['group_mode'] = GroupModeTag[operation.group_mode]
# Setup other template params
values['threadblock_output_shape_n'] = str(operation.tile_description.threadblock_output_shape[0])
@ -343,6 +364,7 @@ class EmitConv2dInstance:
return SubstituteTemplate(self.template_depthwise_direct_conv, values)
else:
_LOGGER.debug("*** group_mode=" + GroupModeTag[operation.group_mode])
values['group_mode'] = GroupModeTag[operation.group_mode]
return SubstituteTemplate(self.template_group_conv, values)
@ -354,6 +376,7 @@ class EmitConv2dInstance:
#
def GenerateConv2dTensorOp(manifest, tile_descriptions, min_cc, align = 128):
_LOGGER.debug("*** GenerateConv2dTensorOp")
for tile in tile_descriptions:
for conv_kind in [ConvKind.Fprop, ConvKind.Dgrad, ConvKind.Wgrad]:
@ -372,6 +395,24 @@ def GenerateConv2dTensorOp(manifest, tile_descriptions, min_cc, align = 128):
manifest.append(Conv2dOperation(conv_kind, min_cc, tile, A, B, C, tile.math_instruction.element_accumulator))
class EmitConv2dIncludes:
'''Emit includes that are specific to the operation.'''
def __init__(self):
self.includes = ['conv2d_operation.h']
self.emitter_3x = EmitConv3xIncludes()
def operation_is_3x(self, operation) -> bool:
"""Whether operation is a CUTLASS 3 convolution (as opposed to CUTLASS 2)"""
return hasattr(operation, 'is_3x') and operation.is_3x
def emit(self, operation) -> str:
if self.operation_is_3x(operation):
return self.emitter_3x.emit(operation)
return '\n'.join(f"#include \"{incl}\"" for incl in self.includes) + \
"\n\n///////////////////////////////////////////////////////////////////////////////////////////////////"
###################################################################################################
#
# Emitters functions for all targets
@ -384,17 +425,8 @@ class EmitConv2dConfigurationLibrary:
self.configuration_path = os.path.join(operation_path, "%s.cu" % configuration_name)
self.instance_emitter = EmitConv2dInstance()
self.includes_emitter = EmitConv2dIncludes()
self.instance_template = """
${operation_instance}
// Derived class
struct ${operation_name} :
public ${operation_name}_base { };
///////////////////////////////////////////////////////////////////////////////////////////////////
"""
self.header_template = """
/*
Generated by conv2d_operation.py - Do not edit.
@ -407,9 +439,17 @@ struct ${operation_name} :
#include "cutlass/library/manifest.h"
#include "library_internal.h"
#include "conv2d_operation.h"
"""
self.instance_template = """
${stub_begin}
${operation_instance}
// Derived class
struct ${operation_name} :
public ${operation_name}_base { };
${stub_end}
///////////////////////////////////////////////////////////////////////////////////////////////////
"""
self.configuration_header = """
@ -419,32 +459,22 @@ namespace library {
// Initialize all instances
void initialize_${configuration_name}(Manifest &manifest) {
"""
self.configuration_instance = """
using Operation_${operation_name} = cutlass::conv::device::ImplicitGemmConvolution<
self.configuration_instance = """${stub_begin}
using Operation_${operation_name} = cutlass::conv::device::${kernel_name}<
${operation_name}>;
manifest.append(new cutlass::library::Conv2dOperation<
Operation_${operation_name}>(
"${operation_name}"));
manifest.append(new cutlass::library::${operation_wrapper}<
Operation_${operation_name}
>(
"${operation_name}"
));
${stub_end}
"""
self.configuration_direct_conv_instance = """
using Operation_${operation_name} = cutlass::conv::device::DirectConvolution<
${operation_name}>;
self.configuration_epilogue = "}\n"
manifest.append(new cutlass::library::DirectConv2dOperation<
Operation_${operation_name}>(
"${operation_name}"));
"""
self.configuration_epilogue = """
}
"""
self.epilogue_template = """
///////////////////////////////////////////////////////////////////////////////////////////////////
@ -456,42 +486,131 @@ void initialize_${configuration_name}(Manifest &manifest) {
"""
#
def operation_is_3x(self, operation):
"""Whether operation is a CUTLASS 3 convolution (as opposed to CUTLASS 2)"""
return hasattr(operation, 'is_3x') and operation.is_3x
def __enter__(self):
"""
Open the configuration_file, and write the "header" C++ code to it.
The "header" consists of a comment (that this is generated code,
so it should not be edited), and includes that are common
to all kinds of kernels.
"""
_LOGGER.debug('*** EmitConv2dConfigurationLibrary::__enter__')
_LOGGER.debug('*** configuration_path (file to write): ' +
str(self.configuration_path))
_LOGGER.debug('*** configuration_name: ' + self.configuration_name)
self.configuration_file = open(self.configuration_path, "w")
self.configuration_file.write(SubstituteTemplate(self.header_template, {
'configuration_name': self.configuration_name
}))
self.operations = []
return self
#
def emit(self, operation):
"""
Write three pieces of C++ code to the configuration_file
(that was opened by the __enter__ method above):
1. the header includes that are specific to the operation
(CUTLASS 2 vs. CUTLASS 3);
2. the "operation instance" (a "using" declaration ending in "_base"); and
3. the "operation name" (declaration and definition of a derived class
of the above operation instance).
The "using" declaration turns a C++ class name, possibly namespace-qualified,
possibly also with angle brackets, into a C-style, easily demangled identifier.
"""
_LOGGER.debug('*** EmitConv2dConfigurationLibrary::emit')
_LOGGER.debug('*** operation.procedural_name(): ' + operation.procedural_name())
self.operations.append(operation)
self.configuration_file.write(SubstituteTemplate(self.instance_template, {
self.configuration_file.write(self.includes_emitter.emit(operation))
stub_begin = ''
stub_end = ''
# It can be useful to stub (comment) out instantiations for testing.
# In this case, one need only set is_stub to True.
is_stub = False
if is_stub:
stub_begin = "// STUB for now\n#if 0"
stub_end = '#endif // 0'
self.configuration_file.write(Template(self.instance_template).substitute({
'configuration_name': self.configuration_name,
'operation_name': operation.procedural_name(),
'operation_instance': self.instance_emitter.emit(operation)
'operation_instance': self.instance_emitter.emit(operation),
'stub_begin': stub_begin,
'stub_end': stub_end
}))
#
def __exit__(self, exception_type, exception_value, traceback):
"""
Write the rest of the C++ code to the configuration_file, and close the file.
The "rest of the C++ code" has the following components.
1. Configuration header: Open the namespace(s), and open the definition
of the "initialize_${configuration_name}" registration function
that registers the operation with the Manifest.
("Registration" helps turn C++ compile-time polymorphism
(via template parameters) into a run-time choice of parameters.)
2. Configuration instance: In the body of the registration function,
make a "using" declaration Operation_${operation_name} for the
operation type (which uses operation_name as its template argument).
Then, tell the manifest about the operation via a "manifest.append" call.
The argument of the call is a new instance of
"SomethingOperation<Operation_${operation_name}>"
(replace Something with a specific name).
3. Configuration epilogue: Close the definition of the registration function.
4. Epilogue template: Close the namespace(s).
"""
_LOGGER.debug('*** EmitConv2dConfigurationLibrary::__exit__')
_LOGGER.debug('*** configuration_path (file to write): ' +
str(self.configuration_path))
_LOGGER.debug('*** configuration_name: ' + self.configuration_name)
self.configuration_file.write(SubstituteTemplate(self.configuration_header, {
'configuration_name': self.configuration_name
}))
for operation in self.operations:
stub_begin = ''
stub_end = ''
# It can be useful to stub (comment) out instantiations for testing.
# In this case, one need only set is_stub to True.
is_stub = False
if is_stub:
stub_begin = "// STUB for now\n#if 0"
stub_end = "#endif // 0"
if operation.group_mode == GroupMode.Depthwise:
self.configuration_file.write(SubstituteTemplate(self.configuration_direct_conv_instance, {
'configuration_name': self.configuration_name,
'operation_name': operation.procedural_name()
}))
kernel_name = 'DirectConvolution'
operation_wrapper = 'DirectConv2dOperation'
else:
self.configuration_file.write(SubstituteTemplate(self.configuration_instance, {
'configuration_name': self.configuration_name,
'operation_name': operation.procedural_name()
}))
kernel_name = 'ImplicitGemmConvolution'
operation_wrapper = 'Conv2dOperation'
if self.operation_is_3x(operation):
kernel_name = 'ConvUniversalAdapter'
operation_wrapper = 'ConvOperation3x'
self.configuration_file.write(SubstituteTemplate(self.configuration_instance, {
'configuration_name': self.configuration_name,
'operation_name': operation.procedural_name(),
'kernel_name': kernel_name,
'operation_wrapper': operation_wrapper,
'stub_begin': stub_begin,
'stub_end': stub_end
}))
self.configuration_file.write(self.configuration_epilogue)
self.configuration_file.write(self.epilogue_template)

View File

@ -35,16 +35,22 @@ Utilities for emitting Conv3d kernels
"""
import enum
import logging
import os.path
import shutil
from string import Template
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
from cutlass_library.conv3x_emitter import EmitConv3xInstance, EmitConv3xIncludes
except ImportError:
from library import *
from conv3x_emitter import EmitConv3xInstance, EmitConv3xIncludes
_LOGGER = logging.getLogger(__name__)
###################################################################################################
@ -148,6 +154,8 @@ class Conv3dOperation:
class EmitConv3dInstance:
def __init__(self):
# Emitter for CUTLASS 3 convolution operations
self.conv3x_emitter = EmitConv3xInstance()
self.template = """
// Conv3d${conv_kind_name} ${iterator_algorithm_name} kernel instance "${operation_name}"
using ${operation_name}_base =
@ -178,8 +186,15 @@ class EmitConv3dInstance:
>::Kernel;
"""
def emit(self, operation):
_LOGGER.debug("*** EmitConv3dInstance::emit")
_LOGGER.debug("*** operation: procedural_name()=" + operation.procedural_name())
if hasattr(operation, 'is_3x') and operation.is_3x:
_LOGGER.debug("*** CUTLASS 3 operation")
return self.conv3x_emitter.emit(operation)
_LOGGER.debug("*** CUTLASS 2 operation")
warp_shape = [int(operation.tile_description.threadblock_shape[idx] / operation.tile_description.warp_count[idx]) for idx in range(3)]
@ -245,6 +260,24 @@ def GenerateConv3dTensorOp(manifest, tile_descriptions, min_cc, align = 128):
manifest.append(Conv3dOperation(conv_kind, min_cc, tile, A, B, C, tile.math_instruction.element_accumulator))
class EmitConv3dIncludes:
'''Emit includes that are specific to the operation.'''
def __init__(self):
self.includes = ['conv3d_operation.h']
self.emitter_3x = EmitConv3xIncludes()
def operation_is_3x(self, operation) -> bool:
"""Whether operation is a CUTLASS 3 convolution (as opposed to CUTLASS 2)"""
return hasattr(operation, 'is_3x') and operation.is_3x
def emit(self, operation) -> str:
if self.operation_is_3x(operation):
return self.emitter_3x.emit(operation)
return '\n'.join(f"#include \"{incl}\"" for incl in self.includes) + \
"\n\n///////////////////////////////////////////////////////////////////////////////////////////////////"
###################################################################################################
#
# Emitters functions for all targets
@ -257,17 +290,8 @@ class EmitConv3dConfigurationLibrary:
self.configuration_path = os.path.join(operation_path, "%s.cu" % configuration_name)
self.instance_emitter = EmitConv3dInstance()
self.includes_emitter = EmitConv3dIncludes()
self.instance_template = """
${operation_instance}
// Derived class
struct ${operation_name} :
public ${operation_name}_base { };
///////////////////////////////////////////////////////////////////////////////////////////////////
"""
self.header_template = """
/*
Generated by conv3d_operation.py - Do not edit.
@ -280,9 +304,17 @@ struct ${operation_name} :
#include "cutlass/library/manifest.h"
#include "library_internal.h"
#include "conv3d_operation.h"
"""
self.instance_template = """
${stub_begin}
${operation_instance}
// Derived class
struct ${operation_name} :
public ${operation_name}_base { };
${stub_end}
///////////////////////////////////////////////////////////////////////////////////////////////////
"""
self.configuration_header = """
@ -292,22 +324,22 @@ namespace library {
// Initialize all instances
void initialize_${configuration_name}(Manifest &manifest) {
"""
self.configuration_instance = """
using Operation_${operation_name} = cutlass::conv::device::ImplicitGemmConvolution<
self.configuration_instance = """${stub_begin}
using Operation_${operation_name} = cutlass::conv::device::${kernel_name}<
${operation_name}>;
manifest.append(new cutlass::library::Conv3dOperation<
Operation_${operation_name}>(
"${operation_name}"));
manifest.append(new cutlass::library::${operation_wrapper}<
Operation_${operation_name}
>(
"${operation_name}"
));
${stub_end}
"""
self.configuration_epilogue = """
}
"""
self.configuration_epilogue = "}\n"
self.epilogue_template = """
///////////////////////////////////////////////////////////////////////////////////////////////////
@ -319,35 +351,126 @@ void initialize_${configuration_name}(Manifest &manifest) {
"""
#
def operation_is_3x(self, operation):
"""Whether operation is a CUTLASS 3 convolution (as opposed to CUTLASS 2)"""
return hasattr(operation, 'is_3x') and operation.is_3x
def __enter__(self):
"""
Open the configuration_file, and write the "header" C++ code to it.
The "header" consists of a comment (that this is generated code,
so it should not be edited), and includes that are common
to both the CUTLASS 2 and the CUTLASS 3 cases.
"""
_LOGGER.debug('*** EmitConv3dConfigurationLibrary::__enter__')
_LOGGER.debug('*** configuration_path (file to write): ' +
str(self.configuration_path))
_LOGGER.debug('*** configuration_name: ' + self.configuration_name)
self.configuration_file = open(self.configuration_path, "w")
self.configuration_file.write(SubstituteTemplate(self.header_template, {
'configuration_name': self.configuration_name
}))
self.operations = []
return self
#
def emit(self, operation):
"""
Write three pieces of C++ code to the configuration_file
(that was opened by the __enter__ method above):
1. the header includes that are specific to the operation
(CUTLASS 2 vs. CUTLASS 3);
2. the "operation instance" (a "using" declaration ending in "_base"); and
3. the "operation name" (declaration and definition of a derived class
of the above operation instance).
The "using" declaration turns a C++ class name, possibly namespace-qualified,
possibly also with angle brackets, into a C-style, easily demangled identifier.
"""
_LOGGER.debug('*** EmitConv3dConfigurationLibrary::emit')
_LOGGER.debug('*** operation.procedural_name(): ' + operation.procedural_name())
self.operations.append(operation)
self.configuration_file.write(SubstituteTemplate(self.instance_template, {
self.configuration_file.write(self.includes_emitter.emit(operation))
stub_begin = ''
stub_end = ''
# It can be useful to stub (comment) out instantiations for testing.
# In this case, one need only set is_stub to True.
is_stub = False
if is_stub:
stub_begin = "// STUB for now\n#if 0"
stub_end = '#endif // 0'
self.configuration_file.write(Template(self.instance_template).substitute({
'configuration_name': self.configuration_name,
'operation_name': operation.procedural_name(),
'operation_instance': self.instance_emitter.emit(operation)
'operation_instance': self.instance_emitter.emit(operation),
'stub_begin': stub_begin,
'stub_end': stub_end
}))
#
def __exit__(self, exception_type, exception_value, traceback):
"""
Write the rest of the C++ code to the configuration_file, and close the file.
The "rest of the C++ code" has the following components.
1. Configuration header: Open the namespace(s), and open the definition
of the "initialize_${configuration_name}" registration function
that registers the operation with the Manifest.
("Registration" helps turn C++ compile-time polymorphism
(via template parameters) into a run-time choice of parameters.)
2. Configuration instance: In the body of the registration function,
make a "using" declaration Operation_${operation_name} for the
operation type (which uses operation_name as its template argument).
Then, tell the manifest about the operation via a "manifest.append" call.
The argument of the call is a new instance of
"SomethingOperation<Operation_${operation_name}>"
(replace Something with a specific name).
3. Configuration epilogue: Close the definition of the registration function.
4. Epilogue template: Close the namespace(s).
"""
_LOGGER.debug('*** EmitConv3dConfigurationLibrary::__exit__')
_LOGGER.debug('*** configuration_path (file to write): ' +
str(self.configuration_path))
_LOGGER.debug('*** configuration_name: ' + self.configuration_name)
self.configuration_file.write(SubstituteTemplate(self.configuration_header, {
'configuration_name': self.configuration_name
}))
for operation in self.operations:
stub_begin = ''
stub_end = ''
# It can be useful to stub (comment) out instantiations for testing.
# In this case, one need only set is_stub to True.
is_stub = False
if is_stub:
stub_begin = "// STUB for now\n#if 0"
stub_end = "#endif // 0"
kernel_name = 'ImplicitGemmConvolution'
operation_wrapper = 'Conv3dOperation'
if self.operation_is_3x(operation):
kernel_name = 'ConvUniversalAdapter'
operation_wrapper = 'ConvOperation3x'
self.configuration_file.write(SubstituteTemplate(self.configuration_instance, {
'configuration_name': self.configuration_name,
'operation_name': operation.procedural_name()
'operation_name': operation.procedural_name(),
'kernel_name': kernel_name,
'operation_wrapper': operation_wrapper,
'stub_begin': stub_begin,
'stub_end': stub_end
}))
self.configuration_file.write(self.configuration_epilogue)
@ -357,4 +480,3 @@ void initialize_${configuration_name}(Manifest &manifest) {
###################################################################################################
###################################################################################################

View File

@ -0,0 +1,220 @@
#################################################################################################
#
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
"""
Utilities for emitting CUTLASS >= 3 convolution kernels
"""
import enum
import os.path
import shutil
import logging
from string import Template
try:
import builtins
if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
raise ImportError("Disabling attempt to import cutlass_library")
from cutlass_library.library import *
except ImportError:
from library import *
_LOGGER = logging.getLogger(__name__)
###################################################################################################
#
# Emits single instances of a CUTLASS device-wide operator
#
###################################################################################################
class EmitConv3xInstance:
def __init__(self):
_LOGGER.debug("*** EmitConv3xInstance::__init__")
# Define epilogue type first, so that the mainloop type
# can use it with StageCountAutoCarveout.
self.template = """
// CUTLASS >= 3 convolution ${conv_kind_name} kernel instance "${operation_name}"
using ${operation_name}_epilogue =
typename cutlass::epilogue::collective::CollectiveBuilder<
${arch},
${opcode_class_epi},
${tile_shape}, // tile shape
${cluster_shape}, // cluster shape
${epi_tile_mn},
${element_accumulator},
${element_compute},
${element_c}, ${layout_c}, 128 / cute::sizeof_bits_v<${element_c}>,
${element_d}, ${layout_d}, 128 / cute::sizeof_bits_v<${element_d}>,
${epilogue_schedule}
// , class FusionOpOrCallbacks = cutlass::epilogue::fusion::LinearCombination<ElementD,ElementCompute>
>::CollectiveOp;
using ${operation_name}_mainloop =
typename cutlass::conv::collective::CollectiveBuilder<
${arch},
${opcode_class_main},
${conv_kind}, // kFprop, kDgrad, or kWgrad
${element_a}, ${layout_a}, 128 / cute::sizeof_bits_v<${element_a}>,
${element_b}, ${layout_b}, 128 / cute::sizeof_bits_v<${element_b}>,
${element_accumulator},
${tile_shape}, // tile shape
${cluster_shape}, // cluster shape
${stages},
${kernel_schedule}
>::CollectiveOp;
// Unit tests call this "ConvKernel".
// Conv operator ${operation_name}
using ${operation_name}_base = cutlass::conv::kernel::ConvUniversal<
${operation_name}_mainloop,
${operation_name}_epilogue,
${tile_scheduler}
>;
"""
def arch_number_to_type(self, arch: int) -> str:
return f"cutlass::arch::Sm{arch}"
def tile_shape(self, operation) -> str:
# For all three kinds of convolutions, the tile shape's K mode
# differs from GEMM in that needs to be wrapped in a Shape.
# For Wgrad convolutions specifically,
# the N tile shape also needs to be wrapped in a Shape.
m_template = 'cute::_${tile_shape_m}'
if operation.conv_kind == ConvKind.Wgrad:
n_template = 'cute::Shape<cute::_${tile_shape_n}>'
else:
n_template = 'cute::_${tile_shape_n}'
k_template = 'cute::Shape<cute::_${tile_shape_k}>'
tile_shape_template = f'cute::Shape<{m_template}, {n_template}, {k_template}>'
values = {
'tile_shape_m': operation.tile_description.tile_shape[0],
'tile_shape_n': operation.tile_description.tile_shape[1],
'tile_shape_k': operation.tile_description.tile_shape[2]
}
return Template(tile_shape_template).substitute(values)
def cluster_shape(self, operation) -> str:
m_template = 'cute::_${cluster_shape_m}'
n_template = 'cute::_${cluster_shape_n}'
k_template = 'cute::_${cluster_shape_k}'
cluster_shape_template = f'cute::Shape<{m_template}, {n_template}, {k_template}>'
values = {
'cluster_shape_m': operation.tile_description.cluster_shape[0],
'cluster_shape_n': operation.tile_description.cluster_shape[1],
'cluster_shape_k': operation.tile_description.cluster_shape[2],
}
return Template(cluster_shape_template).substitute(values)
def stage_count(self, operation) -> str:
# stages == 0 tells builder to pick the number of stages automatically
namespace_prefix = 'cutlass::conv::collective::'
if operation.tile_description.stages > 0:
return f"{namespace_prefix}StageCount<{str(operation.tile_description.stages)}>"
else:
return f"{namespace_prefix}StageCountAutoCarveout<sizeof(typename {operation.procedural_name()}_epilogue::SharedStorage)>"
def emit(self, operation) -> str:
_LOGGER.debug("*** EmitConv3xInstance::emit")
_LOGGER.debug("*** operation: procedural_name()=" + operation.procedural_name())
# Identify the operation as CUTLASS 3 by its is_3x field
if (not hasattr(operation, 'is_3x')) or (not operation.is_3x):
raise RuntimeError("operation must be a CUTLASS 3 operation")
epi_tile_mn = "cutlass::epilogue::collective::EpilogueTileAuto"
opcode_class_main = OpcodeClassTag[operation.tile_description.math_instruction.opcode_class]
opcode_class_epi = opcode_class_main
tile_shape = operation.tile_description.tile_shape
warp_count = operation.tile_description.warp_count
epilogue_schedule = EpilogueScheduleTag[operation.epilogue_schedule]
# KernelScheduleTag and TileSchedulerTag both hard-code the
# namespace qualification of KernelScheduleAuto as
# "cutlass::gemm::collective::" (unless the tag is 'void').
#
# For TileSchedulerTag, this namespace is fine, since CUTLASS 3
# convolutions use the same tile schedulers (from the same
# cutlass::gemm::collective namespace) as GEMMs.
kernel_schedule = KernelScheduleTag[operation.kernel_schedule].replace('gemm::', 'conv::')
tile_scheduler = TileSchedulerTag[operation.tile_scheduler]
opcode_class = OpcodeClassTag[operation.tile_description.math_instruction.opcode_class]
values = {
'operation_name': operation.procedural_name(),
'conv_kind': ConvKindTag[operation.conv_kind],
'conv_kind_name': ConvKindNames[operation.conv_kind].capitalize(),
'element_a': DataTypeTag[operation.A.element],
'layout_a': LayoutTag[operation.A.layout],
'align_a': int(operation.A.alignment),
'element_b': DataTypeTag[operation.B.element],
'layout_b': LayoutTag[operation.B.layout],
'align_b': int(operation.B.alignment),
'element_c': DataTypeTag[operation.C.element],
'layout_c': LayoutTag[operation.C.layout],
'align_c': int(operation.C.alignment),
'element_d': DataTypeTag[operation.D.element],
'layout_d': LayoutTag[operation.D.layout],
'align_d': int(operation.D.alignment),
'element_accumulator': DataTypeTag[operation.accumulator_type()],
'opcode_class': opcode_class,
'arch': self.arch_number_to_type(operation.arch),
'tile_shape': self.tile_shape(operation),
'cluster_shape': self.cluster_shape(operation),
'opcode_class_epi': opcode_class_epi,
'opcode_class_main': opcode_class_main,
'epi_tile_mn': epi_tile_mn,
'stages': self.stage_count(operation),
'kernel_schedule': kernel_schedule,
'epilogue_schedule': epilogue_schedule,
'tile_scheduler': tile_scheduler,
'element_compute': DataTypeTag[operation.element_compute]
}
return Template(self.template).substitute(values)
class EmitConv3xIncludes:
def __init__(self):
_LOGGER.debug("*** EmitConv3xIncludes::__init__")
self.includes = ['conv_operation_3x.hpp',
'cutlass/conv/device/conv_universal_adapter.hpp',
'cutlass/conv/kernel/conv_universal.hpp',
'cutlass/conv/collective/collective_builder.hpp',
'cutlass/epilogue/collective/collective_builder.hpp']
def emit(self, operation) -> str:
_LOGGER.debug("*** EmitConv3xIncludes::emit")
return '\n'.join(f"#include \"{incl}\"" for incl in self.includes) + \
"\n\n///////////////////////////////////////////////////////////////////////////////////////////////////"

View File

@ -37,6 +37,7 @@ Utilities for emitting GEMM kernels
import collections
import enum
import functools
import logging
import operator
import os.path
import shutil
@ -49,6 +50,8 @@ try:
except ImportError:
from library import *
_LOGGER = logging.getLogger(__name__)
###################################################################################################
#
# Data structure modeling a GEMM operation
@ -139,7 +142,8 @@ class GemmOperation:
math_operations_map = {
MathOperation.xor_popc: 'xor',
MathOperation.and_popc: 'and'
MathOperation.and_popc: 'and',
MathOperation.multiply_add_fast_accum: 'fastaccum',
}
tensor_ops = [
@ -256,18 +260,14 @@ class GemmOperation:
''' The full procedural name indicates architecture, extended name, tile size, and layout. '''
opcode_class_name = OpcodeClassNames[self.tile_description.math_instruction.opcode_class]
if self.arch >= 90:
kernel_name_template = "cutlass{p}_sm{ar}_{op}_{ex}_{tbm}x{tbn}x{tbk}_{cm}x{cn}x{ck}_{l}_{s}_align{al}{t}{k}{e}"
kernel_name_template = "cutlass{p}_sm{ar}_{op}_{ex}{ct}{cs}_{l}_{s}_align{al}{t}{k}{e}"
return kernel_name_template.format(
p = self.prefix,
ar = self.arch,
op = opcode_class_name,
ex = self.extended_name_3x(),
tbm = self.tile_description.tile_shape[0],
tbn = self.tile_description.tile_shape[1],
tbk = self.tile_description.tile_shape[2],
cm = self.tile_description.cluster_shape[0],
cn = self.tile_description.cluster_shape[1],
ck = self.tile_description.cluster_shape[2],
ct = '_' + 'x'.join([str(i) for i in self.tile_description.tile_shape]) if self.tile_description.tile_shape[0] > 0 else "",
cs = '_' + 'x'.join([str(i) for i in self.tile_description.cluster_shape]),
l = self.tile_description.stages,
s = self.layout_name_3x(),
al = str(max(self.A.alignment, self.B.alignment)),
@ -725,8 +725,8 @@ class EmitGemmUniversal3xInstance:
using ${operation_name}_epilogue =
typename cutlass::epilogue::collective::CollectiveBuilder<
${arch}, ${opcode_class_epi},
cute::Shape<cute::_${tile_shape_m}, cute::_${tile_shape_n}, cute::_${tile_shape_k}>,
cute::Shape<cute::_${cluster_m},cute::_${cluster_n},cute::_${cluster_k}>,
cute::Shape<cute::_${tile_shape_epi_m}, cute::_${tile_shape_epi_n}, cute::_${tile_shape_epi_k}>,
cute::Shape<${cluster_shape_m}, ${cluster_shape_n}, ${cluster_shape_k}>,
${epi_tile_mn},
${element_accumulator}, ${element_epilogue},
${element_c}, ${layout_c}, ${align_c},
@ -741,8 +741,8 @@ using ${operation_name}_mainloop =
${element_a}, ${layout_a}, ${align_a},
${element_b}, ${layout_b}, ${align_b},
${element_accumulator},
cute::Shape<cute::_${tile_shape_m}, cute::_${tile_shape_n}, cute::_${tile_shape_k}>,
cute::Shape<cute::_${cluster_m},cute::_${cluster_n},cute::_${cluster_k}>,
cute::Shape<cute::_${tile_shape_main_m}, cute::_${tile_shape_main_n}, cute::_${tile_shape_main_k}>,
cute::Shape<${cluster_shape_m}, ${cluster_shape_n}, ${cluster_shape_k}>,
${stages},
${kernel_schedule}
>::CollectiveOp;
@ -773,19 +773,33 @@ ${compile_guard_end}
#
def emit(self, operation):
_LOGGER.debug("*** EmitGemmConfigurationLibrary::emit(operation)")
_LOGGER.debug("*** operation.procedural_name(): " + operation.procedural_name())
_LOGGER.debug("*** tile_shape: " + str(operation.tile_description.tile_shape))
_LOGGER.debug("*** warp_count: " + str(operation.tile_description.warp_count))
opcode_class_main = operation.tile_description.math_instruction.opcode_class
opcode_class_epi = opcode_class_main
tile_shape = operation.tile_description.tile_shape
warp_count = operation.tile_description.warp_count
instruction_shape = operation.tile_description.math_instruction.instruction_shape
cluster_m = operation.tile_description.cluster_shape[0]
cluster_n = operation.tile_description.cluster_shape[1]
tile_shape_main_m, tile_shape_main_n, tile_shape_main_k = tile_shape
tile_shape_epi_m, tile_shape_epi_n, tile_shape_epi_k = tile_shape
# account for static/dynamic cluster shapes
cta_m = tile_shape[0] // cluster_m if cluster_m > 0 else tile_shape[0]
cta_n = tile_shape[1] // cluster_n if cluster_n > 0 else tile_shape[1]
# stage count set to zero indicates builder automatic stage selection
if operation.tile_description.stages > 0:
stage_count_string = f"cutlass::gemm::collective::StageCount<{str(operation.tile_description.stages)}>"
else:
stage_count_string = f"cutlass::gemm::collective::StageCountAutoCarveout<sizeof(typename {str(operation.procedural_name())}_epilogue::SharedStorage)>"
warp_shape = [tile_shape[idx] // warp_count[idx] for idx in range(3)]
stage_count_string = f"cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename {str(operation.procedural_name())}_epilogue::SharedStorage))>"
epi_tile_mn = "cutlass::epilogue::collective::EpilogueTileAuto"
opcode_class_main = operation.tile_description.math_instruction.opcode_class
opcode_class_epi = opcode_class_main
instance_layout_A, instance_layout_B, instance_layout_C , instance_layout_D = \
(operation.A.layout, operation.B.layout, operation.C.layout, operation.D.layout)
@ -806,9 +820,6 @@ ${compile_guard_end}
element_a = DataTypeTag[operation.A.element]
element_b = DataTypeTag[operation.B.element]
epilogue_schedule_type = EpilogueScheduleTag[operation.epilogue_schedule]
element_a = DataTypeTag[operation.A.element]
element_b = DataTypeTag[operation.B.element]
epilogue_schedule_type = EpilogueScheduleTag[operation.epilogue_schedule]
values = {
'operation_name': operation.procedural_name(),
'operation_suffix': self.operation_suffix,
@ -824,18 +835,18 @@ ${compile_guard_end}
'opcode_class_main': OpcodeClassTag[opcode_class_main],
'opcode_class_epi': OpcodeClassTag[opcode_class_epi],
'arch': "cutlass::arch::Sm%d" % operation.arch,
'tile_shape_m': str(operation.tile_description.tile_shape[0]),
'tile_shape_n': str(operation.tile_description.tile_shape[1]),
'tile_shape_k': str(operation.tile_description.tile_shape[2]),
'cluster_m': str(operation.tile_description.cluster_shape[0]),
'cluster_n': str(operation.tile_description.cluster_shape[1]),
'cluster_k': str(operation.tile_description.cluster_shape[2]),
'warp_shape_m': str(warp_shape[0]),
'warp_shape_n': str(warp_shape[1]),
'warp_shape_k': str(warp_shape[2]),
'instruction_shape_m': str(operation.tile_description.math_instruction.instruction_shape[0]),
'instruction_shape_n': str(operation.tile_description.math_instruction.instruction_shape[1]),
'instruction_shape_k': str(operation.tile_description.math_instruction.instruction_shape[2]),
'tile_shape_epi_m': str(tile_shape_epi_m),
'tile_shape_epi_n': str(tile_shape_epi_n),
'tile_shape_epi_k': str(tile_shape_epi_k),
'tile_shape_main_m': str(tile_shape_main_m),
'tile_shape_main_n': str(tile_shape_main_n),
'tile_shape_main_k': str(tile_shape_main_k),
'cluster_shape_m': 'cute::_' + str(operation.tile_description.cluster_shape[0]) if operation.tile_description.cluster_shape[0] > 0 else "int",
'cluster_shape_n': 'cute::_' + str(operation.tile_description.cluster_shape[1]) if operation.tile_description.cluster_shape[1] > 0 else "int",
'cluster_shape_k': 'cute::_' + str(operation.tile_description.cluster_shape[2]) if operation.tile_description.cluster_shape[2] > 0 else "int",
'instruction_shape_m': str(instruction_shape[0]),
'instruction_shape_n': str(instruction_shape[1]),
'instruction_shape_k': str(instruction_shape[2]),
'kernel_schedule' : str(KernelScheduleTag[operation.kernel_schedule]),
'epilogue_schedule' : str(epilogue_schedule_type),
'epi_tile_mn' : epi_tile_mn,
@ -1227,6 +1238,10 @@ void initialize_${configuration_name}(Manifest &manifest) {
"""
def __enter__(self):
_LOGGER.debug("*** EmitGemmConfigurationLibrary::__enter__")
_LOGGER.debug("*** configuration_path (file to write): " +
str(self.configuration_path))
self.configuration_file = open(self.configuration_path, "w")
self.configuration_file.write(self.header_template)
self.configuration_file.write(self.separator)
@ -1248,6 +1263,9 @@ void initialize_${configuration_name}(Manifest &manifest) {
return self
def emit(self, operation):
_LOGGER.debug("*** EmitGemmConfigurationLibrary::emit(operation)")
_LOGGER.debug("*** operation.gemm_kind: " + str(operation.gemm_kind))
emitter = self.instance_emitter[operation.gemm_kind]()
for incl in emitter.includes:
@ -1293,4 +1311,3 @@ void initialize_${configuration_name}(Manifest &manifest) {
###################################################################################################
###################################################################################################

View File

@ -40,9 +40,22 @@ from itertools import product
import logging
import os.path
import shutil
import sys
import copy
from typing import Any, Optional, Sequence, Tuple
_LOGGER = logging.getLogger(__name__)
def logging_prefix(indent_level: int = 0) -> str:
"""String prefix for start of each debug log entry"""
prefix = '*** '
indent = ' '
return f"{prefix}{indent_level * indent}"
def log_debug_line(line: str, indent_level: int = 0) -> None:
"""Log one line of debug output"""
prefix = logging_prefix(indent_level)
_LOGGER.debug(prefix + line)
# Certain usecases of cutlass_library nearly always prefer to run as scripts with
# relative imports, rather than via an installed Python package. An example of this
@ -792,6 +805,359 @@ def CreateDepthwiseConv2dOperator(manifest, layout, tile_descriptions, data_type
return operations
class ConvOperation3x:
"""All parameters of a CUTLASS 3 convolution operation.
Unlike CUTLASS 2 convolutions, CUTLASS 3 convolutions do not
distinguish between 2-D and 3-D convolutions by kernel class name.
Instead, for CUTLASS 3 convolutions, the tensor layouts encode
whether the convolution is 2-D or 3-D. Thus, this class deduces
the OperationKind (either Conv2d or Conv3d) from the layouts,
rather than taking it as a constructor parameter.
"""
def __init__(self,
conv_kind: ConvKind,
tile_description: TileDescription,
A: TensorDescription,
B: TensorDescription,
C: TensorDescription,
element_compute: Optional[DataType] = None,
D: Optional[TensorDescription] = None,
kernel_schedule: KernelScheduleType = KernelScheduleType.ScheduleAuto,
epilogue_schedule: EpilogueScheduleType = EpilogueScheduleType.ScheduleAuto,
tile_scheduler: TileSchedulerType = TileSchedulerType.Default,
log_indent_level: int = 1):
log_debug_line(f'ConvOperation3x::init: conv_kind: {conv_kind}', log_indent_level)
log_indent_level = log_indent_level + 1
self.conv_kind = conv_kind
self.tile_description = tile_description
self.A = A
self.B = B
self.C = C
self.element_compute = C.element if element_compute is None else element_compute
self.kernel_schedule = kernel_schedule
self.epilogue_schedule = epilogue_schedule
self.arch = tile_description.minimum_compute_capability
self.tile_scheduler = tile_scheduler
if D == None:
self.D = C
else:
self.D = D
self.is_3x = True
self.group_mode = GroupMode.NoneGroup # CUTLASS 3 convolutions currently aren't grouped
operation_kind = None
for layout in (A.layout, B.layout, C.layout):
assert(isinstance(layout, LayoutType))
new_operation_kind = convolution_tensor_layout_type_to_operation_kind(layout)
if operation_kind is None:
operation_kind = new_operation_kind
else: # CUTLASS 3 convolutions don't permit mixing 2-D and 3-D layouts.
assert(operation_kind == new_operation_kind)
assert(operation_kind is not None)
self.operation_kind = operation_kind
def __str__(self):
return f"ConvOperation3x: operation_kind={self.operation_kind}, conv_kind={self.conv_kind}, tile_description={self.tile_description}"
def is_complex(self):
complex_operators = [
MathOperation.multiply_add_complex,
MathOperation.multiply_add_complex_gaussian,
MathOperation.multiply_add_complex_fast_f32
]
return self.tile_description.math_instruction.math_operation in complex_operators
def is_mixed_input(self):
return self.A.element != self.B.element
def accumulator_type(self):
accum = self.tile_description.math_instruction.element_accumulator
if self.is_complex():
return get_complex_from_real(accum)
return accum
def short_math_name(self):
prefix = ''
if self.tile_description.math_instruction.math_operation == MathOperation.multiply_add_complex_gaussian:
prefix = 'g'
return prefix + ShortDataTypeNames[self.accumulator_type()]
def is_tensor_op(self):
tensor_ops = [
OpcodeClass.TensorOp,
OpcodeClass.WmmaTensorOp
]
return self.tile_description.math_instruction.opcode_class in tensor_ops
def instruction_shape_string(self):
math_operations_map = {
MathOperation.xor_popc: 'xor',
MathOperation.and_popc: 'and'
}
if self.is_tensor_op():
is0, is1, is2 = self.tile_description.math_instruction.instruction_shape
math_op = self.tile_description.math_instruction.math_operation
math_op_string = math_operations_map[math_op] if math_op in math_operations_map.keys() else ''
return f"{is0}x{is1}x{is2}{math_op_string}"
else:
return ''
def intermediate_type_string(self):
'''
Name of the distinct intermediate type used by the tensor operation,
or the empty string if none.
Tensor ops (opcode_clas *TensorOp) may use an intermediate data type
that differs from the element type of A or the accumulator type.
'''
if not self.is_tensor_op():
return ''
elif self.tile_description.math_instruction.element_a == self.A.element:
return ''
elif self.tile_description.math_instruction.element_a == self.tile_description.math_instruction.element_accumulator:
return ''
else:
return DataTypeNames[self.tile_description.math_instruction.element_a]
def core_name(self):
inst_shape = self.instruction_shape_string()
intermediate_type = self.intermediate_type_string()
conv_kind_name = ConvKindNames[self.conv_kind]
return f"{self.short_math_name()}{inst_shape}{intermediate_type}{conv_kind_name}"
def extended_name(self):
core_name = self.core_name()
element_a = DataTypeNames[self.A.element]
element_b = DataTypeNames[self.B.element]
element_acc = DataTypeNames[self.tile_description.math_instruction.element_accumulator]
element_c = DataTypeNames[self.C.element]
element_d = DataTypeNames[self.D.element]
return f"{core_name}_{element_a}_{element_b}_{element_acc}_{element_c}_{element_d}"
def is_complex(self):
complex_operators = [
MathOperation.multiply_add_complex,
MathOperation.multiply_add_complex_gaussian,
MathOperation.multiply_add_complex_fast_f32
]
return self.tile_description.math_instruction.math_operation in complex_operators
def layout_names(self):
'''Layout strings for A and B, respectively'''
if self.is_complex():
return (ShortComplexLayoutNames[(self.A.layout, self.A.complex_transform)],
ShortComplexLayoutNames[(self.B.layout, self.B.complex_transform)])
else:
return (ShortLayoutTypeNames[self.A.layout],
ShortLayoutTypeNames[self.B.layout])
def extended_name(self):
core_name = self.core_name()
element_a = DataTypeNames[self.A.element]
element_b = DataTypeNames[self.B.element]
element_acc = DataTypeNames[self.tile_description.math_instruction.element_accumulator]
element_c = DataTypeNames[self.C.element]
element_d = DataTypeNames[self.D.element]
layout_a, layout_b = self.layout_names()
return f"{core_name}_{element_a}{layout_a}_{element_b}{layout_b}_{element_acc}_{element_c}_{element_d}"
def configuration_name(self):
prefix = 'cutlass3x'
opcode_class_name = OpcodeClassNames[self.tile_description.math_instruction.opcode_class]
tbm = self.tile_description.tile_shape[0]
tbn = self.tile_description.tile_shape[1]
tbk = self.tile_description.tile_shape[2]
cm = self.tile_description.cluster_shape[0]
cn = self.tile_description.cluster_shape[1]
ck = self.tile_description.cluster_shape[2]
alignment = max(self.A.alignment, self.B.alignment)
tile_scheduler = TileSchedulerSuffixes[self.tile_scheduler]
kernel_schedule = KernelScheduleSuffixes[self.kernel_schedule]
epilogue_schedule = EpilogueScheduleSuffixes[self.epilogue_schedule]
return f"{prefix}_{opcode_class_name}_{self.extended_name()}_{tbm}x{tbn}x{tbk}_{cm}x{cn}x{ck}_{self.tile_description.stages}_align{alignment}{tile_scheduler}{kernel_schedule}{epilogue_schedule}"
def procedural_name(self):
return self.configuration_name()
def convolution_tensor_layout_type_to_operation_kind(layout: LayoutType) -> OperationKind:
if layout == LayoutType.TensorNHWC or layout == LayoutType.TensorKCSR:
return OperationKind.Conv2d
elif layout == LayoutType.TensorNDHWC or layout == LayoutType.TensorKCSRT:
return OperationKind.Conv3d
else:
raise RuntimeError(f'LayoutType {layout} does not have a corresponding OperationKind')
def CreateConvOperator3x(manifest: Manifest,
dims_and_alignments: Sequence[Tuple[Tuple[int, int], Tuple[int, int], Tuple[int, int]]],
tile_descriptions: Sequence[Sequence[TileDescription]],
data_types,
schedule_pairs: Sequence[Tuple[KernelScheduleType, KernelScheduleType]] = \
[(KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto)],
complex_transforms: Optional[Sequence[ComplexTransform]] = None,
tile_schedulers: Sequence[TileSchedulerType] = [TileSchedulerType.Persistent],
conv_kind: ConvKind = ConvKind.Fprop,
log_indent_level: int = 1):
"""
Create zero or more CUTLASS 3 two-dimensional convolution operators.
Create a CUTLASS 3 two-dimensional convolution operator
for all feasible combinations of the input parameters.
Add the operators to the manifest.
dims_and_alignments: 3-level list. Each outer list term is a list [A, B, C].
Each inner list (A, B, or C) has the form [num_spatial_dimensions, alignment].
Both are integers; the first is the number of spatial dimensions
(currently, only 2 or 3 are supported), and the second is the byte alignment.
We deduce the operation_kind (either OperationKind.Conv2d or OperationKind.Conv3d)
from num_spatial_dimensions.
This function doesn't take layouts, unlike the GEMM functions.
CUTLASS 3 convolutions currently support three input layouts:
* TensorNWC for 1-D convolutions,
* TensorNHWC for 2-D convolutions, and
* TensorNDHWC for 3-D convolutions.
Output (C and D) layouts are the same as input layouts,
except for Wgrad convolutions, where the layouts are
* TensorKCS for 1-D convolutions,
* TensorKCSR for 2-D convolutions, and
* TensorKCSRT for 3-D convolutions.
The output layouts are completely constrained by the input layouts
and the convolution kind.
tile_descriptions: 2-level list.
Outer level has one list per math instruction.
Inner level has one TileDescription for each cluster shape.
data_types: Either a single data_type dictionary, or a list of them.
Keys: 'a_type', 'b_type', 'c_type', 'd_type', 'acc_type', 'epi_type'
complex_transforms: Optional list of pairs.
First element of each pair is the complex transform for A, and
second element of each pair is the complex transform for B.
schedule_pairs: [(kernel_schedule, epilogue_schedule), ...]
conv_kind: Convolution kind (Fprop, Dgrad, or Wgrad).
"""
log_debug_line('CreateConvOperator3x', log_indent_level)
log_indent_level = log_indent_level + 1
log_debug_line(f'conv_kind: {conv_kind}', log_indent_level)
for triple in dims_and_alignments:
spatial_dimensionality = None # to be determined by loop below
assert(len(triple) == 3)
for entry in triple: # [A, B, C]
assert(len(entry) == 2)
[dim, alignment] = entry
assert(type(dim) is int)
assert(dim == 2 or dim == 3)
assert(type(alignment) is int)
assert(alignment > 0)
if spatial_dimensionality is None:
spatial_dimensionality = dim
else:
# A, B, and C need to have the same spatial dimensionality
assert(spatial_dimensionality == dim)
def input_and_output_layouts(spatial_dim: int, kind: ConvKind) -> Tuple[LayoutType, LayoutType]:
if spatial_dim == 1:
input_layout = LayoutType.TensorNWC
if kind == ConvKind.Wgrad:
output_layout = LayoutType.TensorKCS
else:
output_layout = input_layout
elif spatial_dim == 2:
input_layout = LayoutType.TensorNHWC
if kind == ConvKind.Wgrad:
output_layout = LayoutType.TensorKCSR
else:
output_layout = input_layout
elif spatial_dim == 3:
input_layout = LayoutType.TensorNDHWC
if kind == ConvKind.Wgrad:
output_layout = LayoutType.TensorKCSRT
else:
output_layout = input_layout
else:
assert(False)
return (input_layout, output_layout)
def dims_to_layouts(A_B_C: Tuple[Tuple[int, int], Tuple[int, int], Tuple[int, int]]) -> \
Tuple[Tuple[LayoutType, int], Tuple[LayoutType, int], Tuple[LayoutType, int]]:
[A, B, C] = A_B_C
[spatial_dim, alignment] = A
[input_layout, output_layout] = input_and_output_layouts(spatial_dim, conv_kind)
return ((input_layout, A[1]),
(input_layout, B[1]),
(output_layout, C[1]))
# layouts: list of triples (A, B, C).
# Each of A, B, and C has the form [layout, alignment].
layouts = [dims_to_layouts(A_B_C) for A_B_C in dims_and_alignments]
if type(data_types) is dict:
data_types = [data_types]
for s in schedule_pairs:
assert(len(s) == 2)
if complex_transforms is None:
complex_transforms = [(ComplexTransform.none, ComplexTransform.none)]
# product produces a one-pass generator, so the loop must call it anew each time.
def make_combinations():
return product(
layouts,
tile_descriptions,
data_types,
complex_transforms,
schedule_pairs,
tile_schedulers
)
operations = []
for layout_triple, tile_description, data_type, complex_transform_pair, schedule_pair, tile_scheduler in make_combinations():
A_layout, A_alignment = layout_triple[0]
A_xform = complex_transform_pair[0]
B_layout, B_alignment = layout_triple[1]
B_xform = complex_transform_pair[1]
C_layout, C_alignment = layout_triple[2]
D_layout = C_layout
D_alignment = C_alignment
A = TensorDescription(data_type["a_type"], A_layout, A_alignment, A_xform)
B = TensorDescription(data_type["b_type"], B_layout, B_alignment, B_xform)
C = TensorDescription(data_type["c_type"], C_layout, C_alignment)
D = TensorDescription(data_type["d_type"], D_layout, D_alignment)
element_compute = data_type.get("epi_type", data_type["acc_type"])
kernel_schedule, epilogue_schedule = schedule_pair
operation = ConvOperation3x(conv_kind=conv_kind,
tile_description=tile_description,
A=A,
B=B,
C=C,
element_compute=element_compute,
D=D,
kernel_schedule=kernel_schedule,
epilogue_schedule=epilogue_schedule,
tile_scheduler=tile_scheduler,
log_indent_level=log_indent_level)
log_debug_line(f'Created ConvOperation3x: {str(operation)}', log_indent_level)
manifest.append(operation)
operations.append(operation)
return operations
###################################################################################################
###################################################################################################
@ -2233,8 +2599,8 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
min_cc = 80
max_cc = 1024
# For mixed-input alignment constraints are a list of lists, where the
# inner list contains the alignment constraints for operands/matrices
# For mixed-input alignment constraints are a list of lists, where the
# inner list contains the alignment constraints for operands/matrices
# [[alignA, alignB, alignC],..]
alignment_constraints = [[16, 8, 8],]
@ -2277,7 +2643,7 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
]
operations += CreateGemmOperator(manifest, layouts, tile_descriptions, \
data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
for op in operations:
if (DataTypeSize[op.C.element] == 16) and \
@ -2320,8 +2686,8 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
min_cc = 80
max_cc = 1024
# For mixed-input alignment constraints are a list of lists, where the
# inner list contains the alignment constraints for operands/matrices
# For mixed-input alignment constraints are a list of lists, where the
# inner list contains the alignment constraints for operands/matrices
# [[alignA, alignB, alignC],..]
alignment_constraints = [[8, 16, 8],]
@ -2346,8 +2712,8 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
TileDescription([128, 16, 32], 5, [2, 1, 1], math_inst, min_cc, max_cc),
TileDescription([128, 16, 32], 3, [2, 1, 1], math_inst, min_cc, max_cc),
# 256x16
TileDescription([256, 16, 32], 5, [2, 1, 1], math_inst, min_cc, max_cc),
TileDescription([256, 16, 32], 3, [2, 1, 1], math_inst, min_cc, max_cc),
TileDescription([256, 16, 32], 5, [2, 1, 1], math_inst, min_cc, max_cc),
TileDescription([256, 16, 32], 3, [2, 1, 1], math_inst, min_cc, max_cc),
]
data_type = [
@ -2372,7 +2738,7 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
]
operations = CreateGemmOperator(manifest, layouts, tile_descriptions, \
data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
for op in operations:
if op.tile_description.threadblock_shape[1] <= 32:
@ -4326,6 +4692,241 @@ def GenerateSM80(manifest, cuda_version):
###################################################################################################
def GenerateSM89_TensorOp_16832_fp8(manifest, cuda_version):
if (
not CudaToolkitVersionSatisfies(cuda_version, 12, 4)
):
return
layouts = [
(LayoutType.RowMajor, LayoutType.ColumnMajor, LayoutType.ColumnMajor)
]
math_instructions = [
MathInstruction(
[16, 8, 32],
DataType.e4m3, DataType.e4m3, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[16, 8, 32],
DataType.e4m3, DataType.e5m2, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[16, 8, 32],
DataType.e5m2, DataType.e4m3, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[16, 8, 32],
DataType.e5m2, DataType.e5m2, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[16, 8, 32],
DataType.e4m3, DataType.e4m3, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add_fast_accum),
MathInstruction(
[16, 8, 32],
DataType.e4m3, DataType.e5m2, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add_fast_accum),
MathInstruction(
[16, 8, 32],
DataType.e5m2, DataType.e4m3, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add_fast_accum),
MathInstruction(
[16, 8, 32],
DataType.e5m2, DataType.e5m2, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add_fast_accum),
]
min_cc = 89
max_cc = 89
alignment_constraints = [16,]
alignment_constraints_small_channels = [16, 8, 4]
for math_inst in math_instructions:
tile_descriptions = [
TileDescription([256, 128, 64], 3, [4, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 256, 64], 3, [2, 4, 1], math_inst, min_cc, max_cc),
TileDescription([256, 128, 64], 6, [4, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 256, 64], 6, [2, 4, 1], math_inst, min_cc, max_cc),
TileDescription([256, 64, 64], 3, [4, 1, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 256, 64], 3, [1, 4, 1], math_inst, min_cc, max_cc),
TileDescription([256, 64, 64], 4, [4, 1, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 256, 64], 4, [1, 4, 1], math_inst, min_cc, max_cc),
TileDescription([256, 32, 64], 4, [4, 1, 1], math_inst, min_cc, max_cc),
TileDescription([ 32, 256, 64], 4, [1, 4, 1], math_inst, min_cc, max_cc),
TileDescription([128, 128, 64], 3, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 128, 64], 4, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 128, 64], 5, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 128, 64], 6, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 64, 64], 6, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 128, 64], 6, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 32, 64], 6, [4, 1, 1], math_inst, min_cc, max_cc),
TileDescription([ 32, 128, 64], 6, [1, 4, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 64, 64], 6, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 64, 64], 10, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([256, 128, 128], 3, [4, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 256, 128], 3, [2, 4, 1], math_inst, min_cc, max_cc),
TileDescription([256, 64, 128], 4, [4, 1, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 256, 128], 4, [1, 4, 1], math_inst, min_cc, max_cc),
TileDescription([256, 32, 128], 4, [4, 1, 1], math_inst, min_cc, max_cc),
TileDescription([ 32, 256, 128], 4, [1, 4, 1], math_inst, min_cc, max_cc),
TileDescription([128, 128, 128], 3, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 128, 128], 4, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 128, 128], 5, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 64, 128], 3, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 128, 128], 3, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 64, 128], 4, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 128, 128], 4, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 32, 128], 4, [4, 1, 1], math_inst, min_cc, max_cc),
TileDescription([ 32, 128, 128], 4, [1, 4, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 64, 128], 5, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 64, 128], 6, [2, 2, 1], math_inst, min_cc, max_cc),
]
data_types = [
[
math_inst.element_a,
math_inst.element_b,
DataType.f32,
math_inst.element_accumulator
],
]
operations = []
for data_type in data_types:
operations += CreateGemmOperator(manifest, layouts, tile_descriptions, data_type,
alignment_constraints, None, EpilogueFunctor.LinearCombination)
conv_layout = (LayoutType.TensorNHWC, LayoutType.TensorNHWC, LayoutType.TensorNHWC)
operations += CreateConv2dOperator(manifest, conv_layout, tile_descriptions,
data_type, alignment_constraints, [ConvKind.Fprop], EpilogueFunctor.LinearCombination)
operations += CreateConv2dFixedChannelsOperator(manifest, conv_layout, tile_descriptions,
data_type, alignment_constraints_small_channels, [ConvKind.Fprop], EpilogueFunctor.LinearCombination)
for op in operations:
if op.tile_description.threadblock_shape[1] >= 128:
if op.tile_description.threadblock_shape[0] == 32:
op.C.alignment = 8
else:
op.C.alignment = 16
else:
op.C.alignment = 8
#
def GenerateSM89_SparseTensorOp_16864_fp8(manifest, cuda_version):
if (
not CudaToolkitVersionSatisfies(cuda_version, 12, 4)
):
return
layouts = [
(LayoutType.RowMajor, LayoutType.ColumnMajor, LayoutType.RowMajor)
]
math_instructions = [
MathInstruction(
[16, 8, 64],
DataType.e4m3, DataType.e4m3, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[16, 8, 64],
DataType.e4m3, DataType.e5m2, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[16, 8, 64],
DataType.e5m2, DataType.e4m3, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[16, 8, 64],
DataType.e5m2, DataType.e5m2, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add),
MathInstruction(
[16, 8, 64],
DataType.e4m3, DataType.e4m3, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add_fast_accum),
MathInstruction(
[16, 8, 64],
DataType.e4m3, DataType.e5m2, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add_fast_accum),
MathInstruction(
[16, 8, 64],
DataType.e5m2, DataType.e4m3, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add_fast_accum),
MathInstruction(
[16, 8, 64],
DataType.e5m2, DataType.e5m2, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add_fast_accum),
]
min_cc = 89
max_cc = 89
alignment_constraints = [16,]
for math_inst in math_instructions:
tile_descriptions = [
TileDescription([128, 64, 128], 3, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([256, 128, 128], 3, [4, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 256, 128], 3, [2, 4, 1], math_inst, min_cc, max_cc),
TileDescription([128, 128, 128], 3, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([256, 64, 128], 3, [4, 1, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 256, 128], 4, [1, 4, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 128, 128], 6, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 64, 128], 4, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 128, 256], 3, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([128, 64, 256], 4, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 128, 256], 3, [2, 2, 1], math_inst, min_cc, max_cc),
TileDescription([ 64, 64, 256], 3, [2, 2, 1], math_inst, min_cc, max_cc),
]
data_types = [
[
math_inst.element_a,
math_inst.element_b,
DataType.f32,
math_inst.element_accumulator
],
]
operations = []
for data_type in data_types:
operations += CreateSparseGemmOperator(manifest, layouts, tile_descriptions, data_type,
alignment_constraints, None, EpilogueFunctor.LinearCombination)
for op in operations:
if op.tile_description.threadblock_shape[1] >= 128:
op.C.alignment = 16
else:
op.C.alignment = 8
###################################################################################################
#
def GenerateSM89(manifest, cuda_version):
GenerateSM89_TensorOp_16832_fp8(manifest, cuda_version)
GenerateSM89_SparseTensorOp_16864_fp8(manifest, cuda_version)
###################################################################################################
#
def GenerateSM90_TensorOp_16b_WGMMA_gemm(manifest, cuda_version):
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
@ -4790,7 +5391,7 @@ def GenerateSM90_TensorOp_tf32_WGMMA_alignx_gemm(manifest, cuda_version):
DataType.tf32, DataType.tf32, DataType.f32,
OpcodeClass.TensorOp,
MathOperation.multiply_add)
min_cc = 90
max_cc = 90
@ -4798,7 +5399,7 @@ def GenerateSM90_TensorOp_tf32_WGMMA_alignx_gemm(manifest, cuda_version):
TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1])
]
tile_descriptions_small = [
# TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
# 0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1])
@ -5395,7 +5996,7 @@ def GenerateSM90_TensorOp_fp8_WGMMA_alignx_gemm(manifest, cuda_version):
]
stream_k_schedules = []
for data_type in data_types:
# With No-SMEM epilogues
CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, schedules)
@ -6013,7 +6614,102 @@ def GenerateSM90_TensorOp_1684_symm_complex_gaussian(manifest, cuda_version):
###################################################################################################
#
def GenerateSM90_Conv3x(manifest, cuda_version,
log_indent_level: int = 0):
"""
Generate CUTLASS 3 convolution kernel(s) for SM90.
This is meant to be called from GenerateSM90.
"""
log_debug_line('GenerateSM90_Conv3x', log_indent_level)
log_indent_level = log_indent_level + 1
if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
return
minimum_compute_capability = 90
maximum_compute_capability = 90
spatial_dims = [2, 3]
def make_dims_and_alignments_triple(dim: int):
byte_alignment_required_by_tma = 16
return ((dim, byte_alignment_required_by_tma), # A
(dim, byte_alignment_required_by_tma), # B
(dim, byte_alignment_required_by_tma)) # C
dims_and_alignments = [make_dims_and_alignments_triple(dim) for dim in spatial_dims]
def make_math_instruction(data_types: Tuple[DataType, DataType, DataType],
instruction_shape: Tuple[int, int, int]) -> MathInstruction:
default_opcode = OpcodeClass.TensorOp
default_math_op = MathOperation.multiply_add
[A_data_type, B_data_type, C_data_type] = data_types
return MathInstruction(
instruction_shape,
A_data_type, B_data_type, C_data_type,
default_opcode,
default_math_op
)
data_types_and_instruction_shapes = [
((DataType.f16, DataType.f16, DataType.f16), (64, 64, 16)),
((DataType.f16, DataType.f16, DataType.f32), (64, 64, 16)),
((DataType.bf16, DataType.bf16, DataType.f32), (64, 64, 16)),
]
math_instructions = map(lambda x: make_math_instruction(*x),
data_types_and_instruction_shapes)
cluster_shapes = [
[2, 1, 1],
[1, 1, 1],
]
conv_kinds = [
ConvKind.Fprop,
ConvKind.Dgrad
]
mainloop_schedule = KernelScheduleType.ImplicitTmaWarpSpecializedSm90
stages = 0 # zero means "deduce the number of stages automatically"
# tile_descriptions is a 2-level list.
# Each inner list is for each cluster shape.
for math_inst in math_instructions:
tile_descriptions = []
for cluster_shape in cluster_shapes:
tile_shape = [
math_inst.instruction_shape[0],
math_inst.instruction_shape[1],
math_inst.instruction_shape[2] * 4
]
warp_count = [4, 1, 1]
tile_description = TileDescription(
tile_shape, stages, warp_count, math_inst,
minimum_compute_capability, maximum_compute_capability,
cluster_shape)
tile_descriptions.append(tile_description)
# It's typical to get the data types from the math instruction.
data_type = {
"a_type" : math_inst.element_a,
"b_type" : math_inst.element_b,
"c_type" : math_inst.element_accumulator,
"d_type" : math_inst.element_accumulator,
"acc_type" : math_inst.element_accumulator,
"epi_type" : math_inst.element_accumulator
}
for conv_kind in conv_kinds:
epilogue_schedule = EpilogueScheduleType.TmaWarpSpecialized
schedule_pairs = [
(mainloop_schedule, epilogue_schedule)
]
CreateConvOperator3x(manifest,
dims_and_alignments = dims_and_alignments,
tile_descriptions = tile_descriptions,
data_types = data_type,
schedule_pairs = schedule_pairs,
tile_schedulers = [TileSchedulerType.Default], # -> void
conv_kind = conv_kind,
log_indent_level = log_indent_level)
def GenerateSM90(manifest, cuda_version):
GenerateSM90_TensorOp_16b_WGMMA_gemm(manifest, cuda_version)
GenerateSM90_TensorOp_16b_WGMMA_alignx_gemm(manifest, cuda_version)
@ -6035,6 +6731,7 @@ def GenerateSM90(manifest, cuda_version):
GenerateSM90_TensorOp_1684_symm(manifest, cuda_version)
GenerateSM90_TensorOp_1684_symm_complex(manifest, cuda_version)
GenerateSM90_TensorOp_1684_symm_complex_gaussian(manifest, cuda_version)
GenerateSM90_Conv3x(manifest, cuda_version)
###################################################################################################
@ -6094,6 +6791,7 @@ if __name__ == "__main__":
GenerateSM70(manifest, args.cuda_version)
GenerateSM75(manifest, args.cuda_version)
GenerateSM80(manifest, args.cuda_version)
GenerateSM89(manifest, args.cuda_version)
GenerateSM90(manifest, args.cuda_version)
if 'library' in args.generator_target.split(','):
manifest.emit(GeneratorTarget.Library)

View File

@ -39,12 +39,12 @@ import re
# The following block implements enum.auto() for Python 3.5 variants that don't include it such
# as the default 3.5.2 on Ubuntu 16.04.
#
#
# https://codereview.stackexchange.com/questions/177309/reimplementing-pythons-enum-auto-for-compatibility
try:
from enum import auto as enum_auto
except ImportError:
except ImportError:
__cutlass_library_auto_enum = 0
def enum_auto() -> int:
global __cutlass_library_auto_enum
@ -298,10 +298,11 @@ class MathOperation(enum.Enum):
multiply_add_complex_fast_f32 = enum_auto()
multiply_add_complex = enum_auto()
multiply_add_complex_gaussian = enum_auto()
multiply_add_fast_accum = enum_auto()
#
MathOperationTag = {
MathOperation.multiply_add: 'cutlass::arch::OpMultiplyAdd',
MathOperation.multiply_add: 'cutlass::arch::OpMultiplyAdd',
MathOperation.multiply_add_saturate: 'cutlass::arch::OpMultiplyAddSaturate',
MathOperation.multiply_add_mixed_input_upcast: 'cutlass::arch::OpMultiplyAddMixedInputUpcast',
MathOperation.xor_popc: 'cutlass::arch::OpXorPopc',
@ -312,6 +313,7 @@ MathOperationTag = {
MathOperation.multiply_add_complex_fast_f32: 'cutlass::arch::OpMultiplyAddComplexFastF32',
MathOperation.multiply_add_complex: 'cutlass::arch::OpMultiplyAddComplex',
MathOperation.multiply_add_complex_gaussian: 'cutlass::arch::OpMultiplyAddGaussianComplex',
MathOperation.multiply_add_fast_accum: 'cutlass::arch::OpMultiplyAddFastAccum',
}
###################################################################################################
@ -326,6 +328,7 @@ class LayoutType(enum.Enum):
RowMajorInterleaved32 = enum_auto()
ColumnMajorInterleaved64 = enum_auto()
RowMajorInterleaved64 = enum_auto()
TensorNWC = enum_auto()
TensorNHWC = enum_auto()
TensorNDHWC = enum_auto()
TensorNCHW = enum_auto()
@ -334,6 +337,9 @@ class LayoutType(enum.Enum):
TensorNC64HW64 = enum_auto()
TensorC32RSK32 = enum_auto()
TensorC64RSK64 = enum_auto()
TensorKCS = enum_auto()
TensorKCSR = enum_auto()
TensorKCSRT = enum_auto()
#
LayoutTag = {
@ -345,6 +351,7 @@ LayoutTag = {
LayoutType.RowMajorInterleaved32: 'cutlass::layout::RowMajorInterleaved<32>',
LayoutType.ColumnMajorInterleaved64: 'cutlass::layout::ColumnMajorInterleaved<64>',
LayoutType.RowMajorInterleaved64: 'cutlass::layout::RowMajorInterleaved<64>',
LayoutType.TensorNWC: 'cutlass::layout::TensorNWC',
LayoutType.TensorNHWC: 'cutlass::layout::TensorNHWC',
LayoutType.TensorNDHWC: 'cutlass::layout::TensorNDHWC',
LayoutType.TensorNCHW: 'cutlass::layout::TensorNCHW',
@ -353,6 +360,9 @@ LayoutTag = {
LayoutType.TensorC32RSK32: 'cutlass::layout::TensorCxRSKx<32>',
LayoutType.TensorNC64HW64: 'cutlass::layout::TensorNCxHWx<64>',
LayoutType.TensorC64RSK64: 'cutlass::layout::TensorCxRSKx<64>',
LayoutType.TensorKCS: 'cutlass::layout::TensorKCS',
LayoutType.TensorKCSR: 'cutlass::layout::TensorKCSR',
LayoutType.TensorKCSRT: 'cutlass::layout::TensorKCSRT'
}
#
@ -378,6 +388,7 @@ ShortLayoutTypeNames = {
LayoutType.RowMajorInterleaved2: 't2',
LayoutType.RowMajorInterleaved32: 't32',
LayoutType.RowMajorInterleaved64: 't64',
LayoutType.TensorNWC: 'nwc',
LayoutType.TensorNHWC: 'nhwc',
LayoutType.TensorNDHWC: 'ndhwc',
LayoutType.TensorNCHW: 'nchw',
@ -385,7 +396,10 @@ ShortLayoutTypeNames = {
LayoutType.TensorNC32HW32: 'nc32hw32',
LayoutType.TensorNC64HW64: 'nc64hw64',
LayoutType.TensorC32RSK32: 'c32rsk32',
LayoutType.TensorC64RSK64: 'c64rsk64'
LayoutType.TensorC64RSK64: 'c64rsk64',
LayoutType.TensorKCS: 'kcs',
LayoutType.TensorKCSR: 'kcsr',
LayoutType.TensorKCSRT: 'kcsrt'
}
#
@ -410,6 +424,7 @@ class KernelScheduleType(enum.Enum):
TmaWarpSpecializedFP8FastAccum = enum_auto()
TmaWarpSpecializedCooperativeFP8FastAccum = enum_auto()
TmaWarpSpecializedPingpongFP8FastAccum = enum_auto()
ImplicitTmaWarpSpecializedSm90 = enum_auto()
#
KernelScheduleTag = {
KernelScheduleType.ScheduleAuto: 'cutlass::gemm::collective::KernelScheduleAuto',
@ -424,6 +439,7 @@ KernelScheduleTag = {
KernelScheduleType.TmaWarpSpecializedFP8FastAccum: 'cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum',
KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum: 'cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum',
KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum: 'cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum',
KernelScheduleType.ImplicitTmaWarpSpecializedSm90: 'cutlass::conv::KernelImplicitTmaWarpSpecializedSm90',
}
#
@ -440,6 +456,7 @@ KernelScheduleSuffixes = {
KernelScheduleType.TmaWarpSpecializedFP8FastAccum: '_warpspecialized_fp8_fastaccum',
KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum: '_warpspecialized_cooperative_fp8_fastaccum',
KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum: '_warpspecialized_pingpong_fp8_fastaccum',
KernelScheduleType.ImplicitTmaWarpSpecializedSm90: '_warpspecialized',
}
class EpilogueScheduleType(enum.Enum):
@ -578,8 +595,8 @@ class OperationKind(enum.Enum):
Rank2K = enum_auto()
Trmm = enum_auto()
Symm = enum_auto()
Conv2d = enum_auto()
Conv3d = enum_auto()
Conv2d = enum_auto()
Conv3d = enum_auto()
#
OperationKindNames = {
@ -588,11 +605,11 @@ OperationKindNames = {
, OperationKind.Rank2K: 'rank_2k'
, OperationKind.Trmm: 'trmm'
, OperationKind.Symm: 'symm'
, OperationKind.Conv2d: 'conv2d'
, OperationKind.Conv3d: 'conv3d'
, OperationKind.Conv2d: 'conv2d'
, OperationKind.Conv3d: 'conv3d'
}
#
#
class Target(enum.Enum):
library = enum_auto()
#
@ -708,7 +725,7 @@ class SwizzlingFunctor(enum.Enum):
StridedDgradIdentity4 = enum_auto()
StridedDgradHorizontal = enum_auto()
StreamK = enum_auto()
#
SwizzlingFunctorTag = {
SwizzlingFunctor.Identity1: 'cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>',
@ -834,11 +851,11 @@ GroupModeNames = {
#
class MathInstruction:
def __init__(self,
def __init__(self,
instruction_shape, \
element_a, element_b, element_accumulator, \
opcode_class, math_operation = MathOperation.multiply_add \
):
):
self.instruction_shape = instruction_shape
self.element_a = element_a
@ -887,15 +904,15 @@ class Direct2dConvFixedStrideDilationTileDescription:
self.maximum_compute_capability = max_compute
def procedural_name(self):
str_name = "%dx%dx%d_%dx%dx%dx%d_%d_filter%dx%d" % (self.threadblock_shape[0],
self.threadblock_shape[1],
str_name = "%dx%dx%d_%dx%dx%dx%d_%d_filter%dx%d" % (self.threadblock_shape[0],
self.threadblock_shape[1],
self.threadblock_shape[2],
self.threadblock_output_shape[0],
self.threadblock_output_shape[1],
self.threadblock_output_shape[2],
self.threadblock_output_shape[3],
self.stages,
self.filter_shape[0],
self.stages,
self.filter_shape[0],
self.filter_shape[1])
# Fixed Strided and dilation
if self.stride != [-1, -1] and self.dilation != [-1, -1]:
@ -920,15 +937,15 @@ class Direct2dConvFixedStrideDilationTileDescription:
self.maximum_compute_capability = max_compute
def procedural_name(self):
str_name = "%dx%dx%d_%dx%dx%dx%d_%d_filter%dx%d" % (self.threadblock_shape[0],
self.threadblock_shape[1],
str_name = "%dx%dx%d_%dx%dx%dx%d_%d_filter%dx%d" % (self.threadblock_shape[0],
self.threadblock_shape[1],
self.threadblock_shape[2],
self.threadblock_output_shape[0],
self.threadblock_output_shape[1],
self.threadblock_output_shape[2],
self.threadblock_output_shape[3],
self.stages,
self.filter_shape[0],
self.stages,
self.filter_shape[0],
self.filter_shape[1])
# Fixed Strided and dilation
if self.stride != [-1, -1] and self.dilation != [-1, -1]:

View File

@ -67,6 +67,26 @@ _LOGGER = logging.getLogger(__name__)
class EmitOperationKindAll:
"""
Emit the OperationKind-level CUTLASS library initialization code.
The code is generated in the {generated_path}/{operation_kind} directory
(e.g., tools/library/generated/gemm in the build directory,
for OperationKind=Gemm), in the all_{operation_kind}_operations.cu file
(e.g., all_gemm_operations.cu for OperationKind=Gemm).
That file declares several functions in namespace cutlass::library.
The functions all have this form,
void initialize_{configuration_name}(Manifest& manifest);
The file also _defines_ the following function in that namespace.
void initialize_all_{operation_kind}_operations(Manifest& manifest);
That function calls all of the functions declared in this file.
Those functions are defined in subdirectories
(which this class does not create).
"""
def __init__(self, generated_path, kind, args):
self.generated_path = generated_path
self.kind = kind
@ -109,10 +129,15 @@ void initialize_all_${operation_name}_operations(Manifest &manifest) {
#
def __enter__(self):
_LOGGER.debug("*** EmitOperationKindAll::__enter__")
self.operation_path = os.path.join(self.generated_path, OperationKindNames[self.kind])
_LOGGER.debug('*** operation_path (directory to create): ' +
str(self.operation_path));
os.makedirs(self.operation_path, exist_ok=True)
self.top_level_path = os.path.join(self.operation_path, f"all_{OperationKindNames[self.kind]}_operations.cu")
_LOGGER.debug(f"*** top_level_path (file to write): {str(self.top_level_path)}")
self.top_level_file = open(self.top_level_path, "w")
self.top_level_file.write(self.header_template)
@ -125,13 +150,22 @@ void initialize_all_${operation_name}_operations(Manifest &manifest) {
#
def emit(self, operations):
_LOGGER.debug('*** EmitOperationKindAll::emit')
_LOGGER.debug(f"*** len(operations): {len(operations)}")
_LOGGER.debug(f"*** min_cc list: {sorted(min_cc for min_cc, _ in operations.items())}")
for min_cc, configurations in sorted(operations.items()):
_LOGGER.debug(f"*** min_cc={min_cc}")
for configuration_name, _ in configurations.items():
_LOGGER.debug(f"*** configuration_name={configuration_name}")
self.configurations.append(configuration_name)
self.top_level_file.write(SubstituteTemplate(self.configuration_prototype_template, {'configuration_name': configuration_name} ))
#
def __exit__(self, exception_type, exception_value, traceback):
_LOGGER.debug("*** EmitOperationKindAll::__exit__")
self.top_level_file.write(SubstituteTemplate(self.entry_template, {'operation_name': OperationKindNames[self.kind]}))
for configuration_name in self.configurations:
@ -142,6 +176,37 @@ void initialize_all_${operation_name}_operations(Manifest &manifest) {
class EmitOperationKindLibrary:
"""
Emit the CUTLASS library initialization code for each OperationKind.
The code is generated in the directory
{generated_path}/{operation_kind}/{min_cc}
(e.g., tools/library/generated/gemm/90 in the build directory,
for min_cc=90 and OperationKind=Gemm), in the file
all_sm{min_cc}_{operation_kind}_operations.cu
(e.g., all_sm90_gemm_operations.cu for min_cc=90 and OperationKind=Gemm).
The min_cc variable here indicates the minimum GPU architecture version
that the things to be initialized require.
For example, min_cc=90 indicates sm90.
That file declares several functions in namespace cutlass::library.
The functions all have this form,
void initialize_all_sm{min_cc}_{subclass_name}_{extended_name}_operations(Manifest& manifest);
where extended_name is operation.extended_name() for all the operations
given to the emit method (which see below). (All operations for a given
configuration_name are guaranteed to have the same extended_name().)
The file also _defines_ the following function in that namespace.
void initialize_all_sm{min_cc}__{operation_kind}_operations(Manifest& manifest);
That function calls all of the functions declared in this file.
Those functions are defined in subdirectories.
The mapping from OperationKind to emitter handles the details
of what happens in each of those subdirectories.
"""
def __init__(self, generated_path, min_cc, kind, args):
self.generated_path = generated_path
self.min_cc = min_cc
@ -194,10 +259,17 @@ void initialize_all_sm${min_cc}_${subclass_name}_${operation_name}_operations(Ma
#
def __enter__(self):
_LOGGER.debug("*** EmitOperationKindLibrary::__enter__")
_LOGGER.debug(f"*** generated_path: {str(self.generated_path)}")
_LOGGER.debug(f"*** OperationKindNames[kind]: {OperationKindNames[self.kind]}")
_LOGGER.debug(f"*** min_cc: {self.min_cc}")
self.operation_path = os.path.join(self.generated_path, OperationKindNames[self.kind], str(self.min_cc))
_LOGGER.debug(f"*** operation_path (directory to make): {str(self.operation_path)}")
os.makedirs(self.operation_path)
self.top_level_path = os.path.join(self.operation_path, f"all_sm{self.min_cc}_{OperationKindNames[self.kind]}_operations.cu")
_LOGGER.debug(f"*** top_level_path (file to write): {str(self.top_level_path)}")
self.top_level_file = open(self.top_level_path, "w")
self.top_level_file.write(self.header_template)
@ -216,16 +288,21 @@ void initialize_all_sm${min_cc}_${subclass_name}_${operation_name}_operations(Ma
#
def emit(self, configuration_name, operations):
_LOGGER.debug("*** EmitOperationKindLibrary::emit")
_LOGGER.debug(f"*** configuration_name: {configuration_name}")
assert len(operations) > 0
# The extended name for all operations of a given configuration_name is guaranteed
# to be the same because extended_name() is used in defining configuration_name. Thus,
# we can safely use the extended_name() of the first operation.
extended_name = operations[0].extended_name()
_LOGGER.debug('*** extended_name (for all ops): ' + extended_name)
# Create a directory for operations with this subclass if it does not exist
if extended_name not in self.subclass_files:
subclass_path = os.path.join(self.operation_path, extended_name)
_LOGGER.debug(f"*** subclass_path: {str(subclass_path)}")
os.mkdir(subclass_path)
self.subclass_configurations[extended_name] = []
@ -233,16 +310,23 @@ void initialize_all_sm${min_cc}_${subclass_name}_${operation_name}_operations(Ma
# Open a new top-level file for this sub class
subclass_top_level_path = os.path.join(
subclass_path, f"all_sm{self.min_cc}_{extended_name}_{OperationKindNames[self.kind]}_operations.cu")
_LOGGER.debug('*** subclass_top_level_path (min_cc, extended_name, ' +
'OperationKind): ' + str(subclass_top_level_path))
self.subclass_files[extended_name] = open(subclass_top_level_path, "w")
self.subclass_files[extended_name].write(self.header_template)
self.source_files[extended_name] = [subclass_top_level_path]
subclass_dir = os.path.dirname(self.subclass_files[extended_name].name)
_LOGGER.debug('*** subclass_dir: ' + str(subclass_dir))
with self.emitters[self.kind](subclass_dir, configuration_name) as configuration_emitter:
for operation in operations:
configuration_emitter.emit(operation)
_LOGGER.debug('*** configuration_emitter.configuration_path: ' +
str(configuration_emitter.configuration_path))
self.source_files[extended_name].append(configuration_emitter.configuration_path)
self.subclass_configurations[extended_name].append(configuration_name)
@ -250,6 +334,7 @@ void initialize_all_sm${min_cc}_${subclass_name}_${operation_name}_operations(Ma
#
def __exit__(self, exception_type, exception_value, traceback):
_LOGGER.debug("*** EmitOperationKindLibrary::__exit__")
for subclass_name, subclass_file in sorted(self.subclass_files.items()):
subclass_cfg = {
'min_cc': str(self.min_cc),
@ -290,6 +375,29 @@ void initialize_all_sm${min_cc}_${subclass_name}_${operation_name}_operations(Ma
self.top_level_file.close()
class EmitInterfaceLibrary:
"""
Emit the topmost-level CUTLASS library initialization code.
The code is generated in the generated_path directory
(e.g., tools/library/generated in the build directory),
in the initialize_all.cpp file.
That file declares several functions in namespace cutlass::library.
The functions all have this form,
void initialize_all_{operation_kind}_operations(Manifest& manifest);
where {operation_kind} abbreviates the "kind" of operation
(e.g., gemm for matrix-matrix multiply, conv2d for 2-d convolution,
or trmm for triangular solve with multiple right-hand sides).
The definitions of these functions live in subdirectories.
The file also _defines_ the following function in that namespace.
void initialize_all(Manifest& manifest);
That function first prepares the manifest, and then
calls all of the functions declared in this file.
"""
def __init__(self, generated_path, operation_count, args):
self.generated_path = generated_path
self.args = args
@ -335,7 +443,10 @@ ${fn_calls}
#
def __enter__(self):
_LOGGER.debug("*** EmitInterfaceLibrary::__enter__")
self.top_level_path = os.path.join(self.generated_path, 'initialize_all.cpp')
_LOGGER.debug("*** top_level_path: " + str(self.top_level_path))
self.top_level_file = open(self.top_level_path, "w")
self.top_level_file.write(self.top_level_hdr_template)
@ -346,6 +457,9 @@ ${fn_calls}
#
def emit(self, operation_name):
_LOGGER.debug("*** EmitInterfaceLibrary::emit")
_LOGGER.debug("*** operation_name: " + operation_name)
self.prototypes.append(SubstituteTemplate(
"\t\tvoid initialize_all_${operation_kind}_operations(Manifest &manifest);",
{'operation_kind': operation_name}))
@ -356,6 +470,8 @@ ${fn_calls}
#
def __exit__(self, exception_type, exception_value, traceback):
_LOGGER.debug("*** EmitInterfaceLibrary::__exit__")
self.top_level_file.write(SubstituteTemplate(self.top_level_prologue, {'prototypes':"\n".join(self.prototypes)}))
# Write out initialize_all method
@ -398,8 +514,14 @@ class Manifest:
self.kernel_filter = self.args.kernels
self.curr_build_dir = args.curr_build_dir
# A common user error is to use commas instead of semicolons.
if ',' in args.architectures:
raise RuntimeError("The list of architectures (CMake option CUTLASS_NVCC_ARCHS) must be semicolon-delimited.\nDon't use commas to separate the architectures; use semicolons.\nYou specified the list as: " + args.architectures)
architectures = args.architectures.split(';') if len(args.architectures) else ['50',]
architectures = [x if x != '90a' else '90' for x in architectures]
arch_conditional_cc = ['90a']
architectures = [x if x not in arch_conditional_cc else x.split('a')[0] for x in architectures]
self.compute_capabilities = [int(x) for x in architectures]
if args.filter_by_cc in ['false', 'False', '0']:
@ -681,8 +803,7 @@ class Manifest:
for min_cc, configurations in sorted(ops.items()):
with operation_emitters[target](generated_path, min_cc, operation_kind, self.args) as operation_kind_emitter:
for configuration_name, operations in configurations.items():
_LOGGER.info("Emitting {config} with {num_ops} operations.".format(
config = configuration_name, num_ops = len(operations)))
_LOGGER.info(f"Emitting {configuration_name} with {len(operations)} operation{'' if len(operations) == 1 else 's'}.")
operation_kind_emitter.emit(configuration_name, operations)
for subclass, files in operation_kind_emitter.source_files.items():

View File

@ -36,7 +36,7 @@ from setuptools import setup
def perform_setup():
setup(
name='cutlass_library',
version='3.4.1',
version='3.5.0',
description='CUTLASS library generation scripts',
packages=['cutlass_library']
)

View File

@ -36,7 +36,7 @@ from setuptools import setup
def perform_setup():
setup(
name='pycute',
version='3.4.1',
version='3.5.0',
description='Python implementation of CuTe',
packages=['pycute'],
)