CUTLASS 3.0.0 (#786)

* CUTLASS 3.0.0
2023-01-23 17:55:28 -08:00
parent 66d9cddc83
commit 277bd6e537
377 changed files with 76396 additions and 1186 deletions
--- a/tools/library/scripts/pycutlass/README.md
+++ b/tools/library/scripts/pycutlass/README.md
@ -81,13 +81,24 @@ The tiling size of above operations can also be customized.
 ## Installation

 ### Using Docker
-You can run the PyCUTLASS on NGC PyTorch container. 
+We recommend using one of our provided Docker images for using PyCUTLASS.
+
+**To run CUTLASS 3 GEMM kernels targetting the NVIDIA Hopper architecture via PyCUTLASS,** you can use an included [Dockerfile](docker/Dockerfile-cuda12.0) based on the NGC CUDA 12.0 container:
 ```shell
-docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.09-py3
+docker build -t pycutlass-cuda12.0:latest -f docker/Dockerfile-cuda12.0 .
+docker run --gpus all -it --rm pycutlass-cuda12.0:latest
+```
+Note that this Docker container does not include CuPy or PyTorch, and, thus, will not be able to run PyCUTLASS examples that
+leverage these packages.
+
+**To run CUTLASS 2.x kernels targetting pre-SM90 architectures via PyCUTLASS,** you can use an included [Dockerfile](docker/Dockerfile-cuda11.8-pytorch) based on an NGC PyTorch container:
+```shell
+docker build -t pycutlass-cuda11.8-pytorch:latest -f docker/Dockerfile-cuda11.8-pytorch .
+docker run --gpus all -it --rm pycutlass-cuda11.8-pytorch:latest
 ```

 ### Environment variables
-PyCUTLASSS requires two environment variables:
+PyCUTLASS requires two environment variables:
 * `CUTLASS_PATH`: the root directory of CUTLASS. You can set this from the location at which you cloned CUTLASS via: `export CUTLASS_PATH=$(pwd)`.
 * `CUDA_INSTALL_PATH`: the directory where cuda toolkit is installed. If running in bash with `nvcc` installed under a CUDA toolkit, you can set this to the location of your `nvcc` installation via: `export CUDA_INSTALL_PATH=$(which nvcc | awk -F'/bin/nvcc' '{print $1}')`

--- a/tools/library/scripts/pycutlass/build.sh
+++ b/tools/library/scripts/pycutlass/build.sh
@ -1,4 +1,36 @@
-pip install pybind11
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+pip install -U pybind11
 git clone https://github.com/google/googletest.git
-python setup.py install
+python setup.py develop --user
 python setup.py rmm
--- a/tools/library/scripts/pycutlass/build_doc.sh
+++ b/tools/library/scripts/pycutlass/build_doc.sh
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 pip install enum-tools
 pip install sphinx-toolbox
 pip install m2r2
--- a/tools/library/scripts/pycutlass/docker/Dockerfile-cuda11.8-pytorch
+++ b/tools/library/scripts/pycutlass/docker/Dockerfile-cuda11.8-pytorch
@ -0,0 +1,40 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+FROM nvcr.io/nvidia/pytorch:22.11-py3
+
+RUN chmod ugo+rwx /home
+RUN pip uninstall -y rmm
+RUN pip install rmm-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
+ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
+ENV LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
+ENV CUDA_INSTALL_PATH=/usr/local/cuda
--- a/tools/library/scripts/pycutlass/docker/Dockerfile-cuda12.0
+++ b/tools/library/scripts/pycutlass/docker/Dockerfile-cuda12.0
@ -0,0 +1,46 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+FROM nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu20.04
+
+RUN apt-get update
+RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
+RUN apt-get install -y git cmake vim python3 python3-pip
+RUN ln -s /usr/bin/python3 /usr/bin/python
+RUN chmod ugo+rwx /home
+RUN pip install numpy==1.23
+RUN pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
+RUN pip install cuml-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
+RUN pip install cugraph-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
+ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
+ENV LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu/:$LIBRARY_PATH
+ENV CUDA_INSTALL_PATH=/usr/local/cuda
--- a/tools/library/scripts/pycutlass/setup.py
+++ b/tools/library/scripts/pycutlass/setup.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 import distutils.cmd
 from setuptools import setup
 import setuptools.command.build_py
@ -15,7 +47,7 @@ class BuildRMM(distutils.cmd.Command):
            import rmm
        except ImportError:
            print("installing rmm")
-            os.system("git clone -b branch-22.08 --recurse-submodules https://github.com/rapidsai/rmm.git")
+            os.system("git clone -b branch-22.10 --recurse-submodules https://github.com/rapidsai/rmm.git")
            os.chdir("./rmm")
            os.system("./build.sh librmm rmm")
            os.chdir("./python")
@ -43,7 +75,11 @@ try:
        Pybind11Extension("cutlass",
                          ["src/cpp/cutlass.cpp"],
                          include_dirs=include_dirs,
-                          extra_compile_args=["-fpermissive", "-w"])
+                          extra_compile_args=["-fpermissive", "-w", "-std=c++17"]),
+        Pybind11Extension("cute",
+                          ["src/cpp/cute.cpp"],
+                          include_dirs=include_dirs,
+                          extra_compile_args=["-fpermissive", "-w", "-std=c++17"])
    ]
 except ImportError:
    pass
@ -65,7 +101,7 @@ setup(
    install_requires=[
        "numpy<1.23",
        'pybind11',
-        'cuda-python<11.7.0',
+        'cuda-python>=11.8.0',
        'typeguard',
        'bfloat16',
        'typing',
--- a/tools/library/scripts/pycutlass/src/cpp/cute.cpp
+++ b/tools/library/scripts/pycutlass/src/cpp/cute.cpp
@ -0,0 +1,54 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+   \brief binding CuTe C++ APIs to Python
+*/
+
+#include <pybind11/pybind11.h>
+#include <pybind11/stl_bind.h>
+
+#include "cute/arch/mma_sm90_gmma.hpp"
+
+namespace py = pybind11;
+
+
+PYBIND11_MODULE(cute, m) {
+
+    // module doc
+    m.doc() = "CuTe C++ bindings";
+
+    py::enum_<cute::GMMA::Major>(m, "GMMAMajor",
+        R"pbdoc(classification of CuTe GMMA tensor major specification)pbdoc")
+        .value("K", cute::GMMA::Major::K,
+            R"pbdoc(Tensor is contiguous in reduction dimension)pbdoc")
+        .value("MN", cute::GMMA::Major::MN,
+            R"pbdoc(Tensor is contiguous in non-reduction dimension)pbdoc");
+}
--- a/tools/library/scripts/pycutlass/src/cpp/cutlass.cpp
+++ b/tools/library/scripts/pycutlass/src/cpp/cutlass.cpp
@ -29,8 +29,9 @@
 *
 **************************************************************************************************/
 /* \file
-   \brief binding cutlass C++ APIs to python
+   \brief binding CUTLASS C++ APIs to Python
 */
+
 #include <pybind11/pybind11.h>
 #include <pybind11/stl_bind.h>

--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_generic.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_generic.h
@ -34,6 +34,7 @@
  \brief A generic wrapper around an epilogue visitor operation
 */

+
 #pragma once

 #include "cutlass/cutlass.h"
--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/binary_ops.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/binary_ops.h
@ -30,8 +30,8 @@
 **************************************************************************************************/

 /*! \file
-
-  \brief Binary operations to be used within the epilogue visitor model.
+  
+  \brief A file contains the binary ops
 */

 #pragma once
@ -44,7 +44,7 @@ namespace cutlass {
 /////////////////////////////////////////////////////////////////////////////////////////////////


-/// Elementwise addition of two arrays
+/// Scalar multiplication
 template <typename T, int N>
 struct VectorAdd {

--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/unary_ops.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/unary_ops.h
@ -30,8 +30,8 @@
 **************************************************************************************************/

 /*! \file
-
-  \brief Unary operations to be used within the epilogue visitor model.
+  
+  \brief A file contains the unary ops
 */

 #pragma once
--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_accumulator.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_accumulator.h
@ -30,8 +30,8 @@
 **************************************************************************************************/

 /*! \file
-
-  \brief Epilogue visitor operation that simply returns the accumulator
+  
+  \brief A file contains the epilogue visitor Op with accumulator
 */

 #pragma once
--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_binary.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_binary.h
@ -30,8 +30,8 @@
 **************************************************************************************************/

 /*! \file
-
-  \brief Epilogue visitor operator performing a binary operation between two visitor nodes
+  
+  \brief A file contains the epilogue visitor Op with Binary op
 */

 #pragma once
@ -84,7 +84,6 @@ public:
    /// Fragment type of accumulator
    using AccumulatorAccessType = Array<ElementAccumulator, kElementsPerAccess>;

-    /// Combination Op TODO: generalize this
    using BinaryOp = BinaryOp_<ElementCompute, kElementsPerAccess>;

    static_assert(kElementsPerAccess==VisitAccessTypeA::kElements, "kElementsPerAccess mismatches with Visitor A");
--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_column_broadcast.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_column_broadcast.h
@ -30,8 +30,8 @@
 **************************************************************************************************/

 /*! \file
-
-  \brief Epilogue visitor operation that broadcasts a vector to all columns
+  
+  \brief A file contains the epilogue visitor Op with broadcasting vector to all columns
 */

 #pragma once
--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_column_reduction.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_column_reduction.h
@ -30,8 +30,8 @@
 **************************************************************************************************/

 /*! \file
-
-  \brief Epilogue visitor operation that performs a column-wise reduction within a threadblock
+  
+  \brief A file contains the epilogue visitor Op with reduction over columns in CTA
 */

 #pragma once
@ -68,7 +68,6 @@ public:

    static int const kElementsPerAccess = OutputTileIterator::kElementsPerAccess;

-    // TODO: generalize the reduction op
    using ReductionOp = cutlass::plus<Array<ElementReductionAccumulator, kElementsPerAccess>>;
    using ReductionOpScalar = cutlass::plus<ElementReductionAccumulator>;
    using ElementOutput = typename OutputTileIterator::Element;
--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_linear_combination.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_linear_combination.h
@ -30,8 +30,8 @@
 **************************************************************************************************/

 /*! \file
-
-  \brief Epilogue visitor operation that performs a linear combination of two visitor nodes
+  
+  \brief A file contains the epilogue visitor Op with Linear Combination
 */

 #pragma once
@ -82,7 +82,7 @@ public:
    /// Fragment type of accumulator
    using AccumulatorAccessType = Array<ElementAccumulator, kElementsPerAccess>;

-    /// Combination Op TODO: generalize this
+    /// Combination Op
    using CombinationOp = cutlass::plus<VisitAccessType>;

    static_assert(kElementsPerAccess==VisitAccessTypeA::kElements, "kElementsPerAccess mismatches with Visitor A");
--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_row_broadcast.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_row_broadcast.h
@ -30,8 +30,8 @@
 **************************************************************************************************/

 /*! \file
-
-  \brief Epilogue visitor operation that broadcasts a vector to all rows
+  
+  \brief A file contains the epilogue visitor Op with broadcasting vector to all rows
 */

 #pragma once
--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_row_reduction.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_row_reduction.h
@ -30,8 +30,8 @@
 **************************************************************************************************/

 /*! \file
-
-  \brief Epilogue visitor operation that performs a column-wise reduction within a threadblock
+  
+  \brief A file contains the epilogue visitor Op with reduction over rows in CTA
 */

 #pragma once
@ -69,7 +69,6 @@ public:

    static int const kElementsPerAccess = OutputTileIterator::kElementsPerAccess;

-    // TODO: generalize the reduction op
    using ReductionOp = cutlass::plus<Array<ElementReductionAccumulator, kElementsPerAccess>>;
    using ReductionOpScalar = cutlass::plus<ElementReductionAccumulator>;
    using ElementOutput = typename OutputTileIterator::Element;
--- a/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_unary.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_unary.h
@ -30,8 +30,8 @@
 **************************************************************************************************/

 /*! \file
-
-  \brief Epilogue visitor operator performing a unary operation atop a visitor node
+  
+  \brief A file contains the epilogue visitor Op with Unary operation
 */

 #pragma once
@ -79,7 +79,7 @@ public:
    /// Fragment type of accumulator
    using AccumulatorAccessType = Array<ElementAccumulator, kElementsPerAccess>;

-    /// Combination Op TODO: generalize this
+    /// Combination Op
    using UnaryOp = UnaryOp_<ElementCompute, kElementsPerAccess>;

    static_assert(kElementsPerAccess==VisitAccessTypeVisitor::kElements, "kElementsPerAccess mismatches with Visitor");
--- a/tools/library/scripts/pycutlass/src/cpp/include/gemm/gemm_universal_with_visitor.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/gemm/gemm_universal_with_visitor.h
@ -30,7 +30,7 @@
 **************************************************************************************************/

 /*! \file
-    \brief 
+    \brief
 */

 #pragma once
@ -139,8 +139,8 @@ public:
    //
    // Methods
    //
-    
-    Arguments(): 
+
+    Arguments():
      ptr_A(nullptr), ptr_B(nullptr), ptr_C(nullptr), ptr_D(nullptr),
      ptr_gather_A_indices(nullptr),
      ptr_gather_B_indices(nullptr),
@ -169,8 +169,8 @@ public:
      int const *ptr_scatter_D_indices = nullptr
    ):
      UniversalArgumentsBase(mode, problem_size, batch_count, batch_stride_D),
-      epilogue_visitor(epilogue_visitor), 
-      ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D), 
+      epilogue_visitor(epilogue_visitor),
+      ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D),
      batch_stride_A(batch_stride_A), batch_stride_B(batch_stride_B), batch_stride_C(batch_stride_C),
      stride_a(stride_a), stride_b(stride_b), stride_c(stride_c), stride_d(stride_d),
      ptr_gather_A_indices(ptr_gather_A_indices), ptr_gather_B_indices(ptr_gather_B_indices),
@ -205,8 +205,8 @@ public:
      int const *ptr_scatter_D_indices = nullptr
    ):
      UniversalArgumentsBase(mode, problem_size, batch_count, batch_stride_D),
-      epilogue_visitor(epilogue_visitor), 
-      ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D), 
+      epilogue_visitor(epilogue_visitor),
+      ptr_A(ptr_A), ptr_B(ptr_B), ptr_C(ptr_C), ptr_D(ptr_D),
      batch_stride_A(batch_stride_A), batch_stride_B(batch_stride_B), batch_stride_C(batch_stride_C),
      lda(lda), ldb(ldb), ldc(ldc), ldd(ldd),
      ptr_gather_A_indices(ptr_gather_A_indices), ptr_gather_B_indices(ptr_gather_B_indices),
@ -221,7 +221,7 @@ public:
    /// Returns arguments for the transposed problem
    Arguments transposed_problem() const {
      Arguments args(*this);
-      
+
      std::swap(args.problem_size.m(), args.problem_size.n());
      std::swap(args.ptr_A, args.ptr_B);
      std::swap(args.lda, args.ldb);
@ -256,7 +256,7 @@ public:
    typename Mma::IteratorB::Params params_B;
    typename EpilogueVisitor::OutputTileIterator::Params params_C;
    typename EpilogueVisitor::OutputTileIterator::Params params_D;
-    
+
    typename EpilogueVisitor::Params epilogue_visitor;

    void * ptr_A;
@ -325,7 +325,7 @@ public:
      batch_stride_C = args.batch_stride_C;

      epilogue_visitor = args.epilogue_visitor;
-      
+
      semaphore = static_cast<int *>(workspace);
      CUTLASS_TRACE_HOST("GemmUniversal::Params::update()");
    }
@ -345,7 +345,7 @@ public:
  //

  CUTLASS_DEVICE
-  GemmUniversalwithEpilogueVisitor() { } 
+  GemmUniversalwithEpilogueVisitor() { }

  /// Determines whether kernel satisfies alignment
  static Status can_implement(
@ -455,12 +455,12 @@ public:
    //
    // Fetch pointers based on mode.
    //
-    if (params.mode == GemmUniversalMode::kGemm || 
+    if (params.mode == GemmUniversalMode::kGemm ||
      params.mode == GemmUniversalMode::kGemmSplitKParallel) {

      if (threadblock_tile_offset.k() + 1 < params.grid_tiled_shape.k()) {

-        problem_size_k = (threadblock_tile_offset.k() + 1) * params.gemm_k_size; 
+        problem_size_k = (threadblock_tile_offset.k() + 1) * params.gemm_k_size;
      }

      offset_k = threadblock_tile_offset.k() * params.gemm_k_size;
@ -529,10 +529,10 @@ public:

    // Compute threadblock-scoped matrix multiply-add
    mma(
-      gemm_k_iterations, 
-      accumulators, 
-      iterator_A, 
-      iterator_B, 
+      gemm_k_iterations,
+      accumulators,
+      iterator_A,
+      iterator_B,
      accumulators);

    //
@ -555,30 +555,16 @@ public:

    int block_idx = threadblock_tile_offset.m() + threadblock_tile_offset.n() * params.grid_tiled_shape.m();

-    ElementC *ptr_C = static_cast<ElementC *>(params.ptr_C); 
+    ElementC *ptr_C = static_cast<ElementC *>(params.ptr_C);
    ElementC *ptr_D = static_cast<ElementC *>(params.ptr_D);

    //
    // Fetch pointers based on mode.
    //
-    
+
    // Construct the semaphore.
    Semaphore semaphore(params.semaphore + block_idx, thread_idx);

-    // if (params.mode == GemmUniversalMode::kGemm) {
-
-    //   // TODO: fix this order
-    //   // If performing a reduction via split-K, fetch the initial synchronization
-    //   if (params.grid_tiled_shape.k() > 1) {
-        
-    //     // Fetch the synchronization lock initially but do not block.
-    //     semaphore.fetch();
-
-    //     // Indicate which position in a serial reduction the output operator is currently updating
-    //     output_op.set_k_partition(threadblock_tile_offset.k(), params.grid_tiled_shape.k());
-    //   }
-    // }
-    
    // Tile iterator loading from source tensor.

    EpilogueVisitor epilogue_visitor(
@ -590,9 +576,6 @@ public:
        params.problem_size.mn()
    );

-    // if (params.mode == GemmUniversalMode::kGemmSplitKParallel) {
-    //   ptr_D += threadblock_tile_offset.k() * params.batch_stride_D;
-    // }
    if (params.mode == GemmUniversalMode::kBatched || params.mode == GemmUniversalMode::kArray) {
      epilogue_visitor.set_batch_index(threadblock_tile_offset.k());
    }
@ -605,25 +588,20 @@ public:

    // Wait on the semaphore - this latency may have been covered by iterator construction
    if (params.mode == GemmUniversalMode::kGemm && params.grid_tiled_shape.k() > 1) {
-        
-      // For subsequent threadblocks, the source matrix is held in the 'D' tensor.
-      // TODO: ???
-      // if (threadblock_tile_offset.k()) {
-      //   iterator_C = iterator_D;
-      // }

+      // For subsequent threadblocks, the source matrix is held in the 'D' tensor.
      semaphore.wait(threadblock_tile_offset.k());
    }


    // Execute the epilogue operator to update the destination tensor.
-    epilogue(epilogue_visitor, accumulators); 
-    
+    epilogue(epilogue_visitor, accumulators);
+
    //
    // Release the semaphore
    //

-    if (params.mode == GemmUniversalMode::kGemm && params.grid_tiled_shape.k() > 1) { 
+    if (params.mode == GemmUniversalMode::kGemm && params.grid_tiled_shape.k() > 1) {

      int lock = 0;
      if (params.grid_tiled_shape.k() == threadblock_tile_offset.k() + 1) {
@ -635,7 +613,7 @@ public:
        // Otherwise, the semaphore is incremented
        lock = threadblock_tile_offset.k() + 1;
      }
-      
+
      semaphore.release(lock);
    }
  }
--- a/tools/library/scripts/pycutlass/src/cpp/include/swizzling.h
+++ b/tools/library/scripts/pycutlass/src/cpp/include/swizzling.h
@ -83,7 +83,6 @@ void bind_identity_swizzle(py::module & m, std::string name) {
            :param problem_size: Implicit gemm problem size conv_operator(NZPQK, NDHWC, KTRSC)
            :type problem_size: :class:`cutlass.gemm.GemmCoord`)
            )pbdoc")
-        // TODO: the returned dim3 is not usable in python
        .def("get_grid_shape", &T::get_grid_shape,
            py::arg("tiled_shape"), 
            R"pbdoc(Computes CUDA grid dimensions given a size in units of logical tiles)pbdoc")
--- a/tools/library/scripts/pycutlass/src/pycutlass/init.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/init.py
@ -31,6 +31,7 @@ from pycutlass.utils import *
 from pycutlass.frontend import *
 from pycutlass.reduction_operation import *
 from pycutlass.compiler import *
+from pycutlass.utils.device import device_cc

 # module-wide variables

@ -40,6 +41,12 @@ this = sys.modules[__name__]
 # artifact manager
 this.compiler = ArtifactManager()

+try:
+    if not hasattr(this, 'DEVICE_CC') or this.DEVICE_CC is None:
+        this.DEVICE_CC = device_cc()
+except:
+    this.DEVICE_CC = None
+
 def get_memory_pool(init_pool_size=0, max_pool_size=2**34):
    this.memory_pool = PoolMemoryManager(
        init_pool_size=init_pool_size,
--- a/tools/library/scripts/pycutlass/src/pycutlass/builder/collective_op_builder.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/builder/collective_op_builder.py
@ -0,0 +1,395 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+"""
+Utilities for stamping out collective mainloops for SM90 kernels
+"""
+
+import cute
+import cutlass
+from pycutlass import SubstituteTemplate
+import pycutlass.library as library
+
+
+tma_alignment_bytes = 16
+cp_async_min_alignment_bytes = 4
+
+
+class RowColMajorToGMMAMajor:
+    @staticmethod
+    def A(layout, element):
+        """
+        Converts operand A's layout from row/column major format into CuTe's GMMA major format
+
+        :param layout: layout of the A operand
+        :type layout: cutlass.RowMajor or cutlass.ColumnMajor
+        :param element: data type of the A operand
+
+        :return: C++ CuTe GMMA major format
+        :rtype: cute.GMMAMajor
+        """
+        type_requires_k_major = (element == cutlass.tfloat32) or (element == cutlass.int8)
+        if layout == cutlass.ColumnMajor and not type_requires_k_major:
+            return cute.GMMAMajor.MN
+        else:
+            return cute.GMMAMajor.K
+
+    @staticmethod
+    def B(layout, element):
+        """
+        Converts operand B's layout from row/column major format into CuTe's GMMA major format
+
+        :param layout: layout of the B operand
+        :type layout: cutlass.RowMajor or cutlass.ColumnMajor
+        :param element: data type of the B operand
+
+        :return: C++ CuTe GMMA major format
+        :rtype: cute.GMMAMajor
+        """
+        type_requires_k_major = (element == cutlass.tfloat32) or (element == cutlass.int8)
+        if layout == cutlass.RowMajor and not type_requires_k_major:
+            return cute.GMMAMajor.MN
+        else:
+            return cute.GMMAMajor.K
+
+
+def cluster_shape_to_tma(dim):
+    """
+    Returns the TMA copy type for a given cluster dimension
+
+    :param dim: a given dimension of a cluster
+    :type dim: layout
+
+    :return: C++ TMA copy time
+    :rtype: str
+    """
+    return 'cute::SM90_TMA_LOAD' if dim == 1 else 'cute::SM90_TMA_LOAD_MULTICAST'
+
+
+def make_cpasync_gmem_tiled_copy(thread_count, element, alignment, gmma_layout, dim_mn, dim_k):
+    """
+    Returns a `make_tiled_copy` call for a given configuraiton
+
+    :param thread_count: number of threads in the threadblock
+    :type thread_count: int
+    :param element: datatype of the operand in question
+    :param alignment: byte alignment of the operand in question
+    :type alignment: int
+    :param gmma_layout: GMMA layout of the operand in question
+    :type gmma_layout: cute.GMMAMajor
+    :param dim_mn: extent of the M/N dimension of the tile
+    :type dim_mn: int
+    :param dim_k: extent of the reduction dimension of the tile
+    :type dim_k: int
+
+    :return: C++ call to `make_tiled_copy`
+    :rtype: str
+    """
+
+    emission_str = """decltype(cute::make_tiled_copy(
+        cute::Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<cute::uint_byte_t<static_cast<int>(sizeof(${element})) * ${alignment}>>, ${element}>{},
+        cute::Layout<cute::Shape<_${shape0_x}, _${shape0_y}>,
+                    cute::Stride<_${stride_x}, _${stride_y}>>{},
+        cute::Layout<cute::Shape<_${shape1_x}, _${shape1_y}>>{}))"""
+    if gmma_layout == cute.GMMAMajor.K:
+        threads_major = dim_k // alignment
+        threads_minor = thread_count // threads_major
+        values = {
+            'shape0_x': str(threads_minor),
+            'shape0_y': str(threads_major),
+            'stride_x': str(threads_major),
+            'stride_y': '1',
+            'shape1_x': '1',
+            'shape1_y': str(alignment)
+        }
+    elif gmma_layout == cute.GMMAMajor.MN:
+        threads_major = dim_mn // alignment
+        threads_minor = thread_count // threads_major
+        values = {
+            'shape0_x': str(threads_major),
+            'shape0_y': str(threads_minor),
+            'stride_x': '1',
+            'stride_y': str(threads_major),
+            'shape1_x': str(alignment),
+            'shape1_y': '1'
+        }
+    else:
+        raise Exception('Unexpected GMMA layout {}'.format(gmma_layout))
+
+    # Add common values
+    values['element'] = library.DataTypeTag[element]
+    values['alignment'] = str(alignment)
+    return SubstituteTemplate(emission_str, values)
+
+
+def max_stages(op, arch):
+    """
+    Returns the maximum number pipeline stages that can be used for an operation.
+
+    :param op: operation for which the maximum stages should be computed. If stages are
+               set via the `op.tile_description.stages` parameter, this setting is ignored
+               in the present calculation
+    :type op: pycutlass.GemmOperation
+    :param arch: compute capability of the device on which the operation will be run
+    :type arch: int
+
+    :return: maximum number of pipeline stages that can be used for an operation
+    :rtype: int
+    """
+    smem_per_stage = library.CalculateSmemUsagePerStage(op)
+    smem_capacity = library.SharedMemPerCC[arch]
+    return int(smem_capacity // smem_per_stage)
+
+
+class LayoutToStride:
+    _variable_first = 'cute::Stride<int64_t, cute::Int<1>, int64_t>'
+    _variable_last  = 'cute::Stride<cute::Int<1>, int64_t, int64_t>'
+
+    @staticmethod
+    def A(layout):
+        """
+        Returns the CuTe shape type corresponding to the layout of operand A
+
+        :param layout: layout of the B operand
+        :type layout: cutlass.RowMajor or cutlass.ColumnMajor
+
+        :return: C++ declaration of CuTe stride
+        :rtype: str
+        """
+        if layout == cutlass.RowMajor:
+            return LayoutToStride._variable_first
+        elif layout == cutlass.ColumnMajor:
+            return LayoutToStride._variable_last
+        else:
+            raise Exception('Unsupported layout {}'.format(layout))
+
+    @staticmethod
+    def B(layout):
+        """
+        Returns the CuTe shape type corresponding to the layout of operand B
+
+        :param layout: layout of the B operand
+        :type layout: cutlass.RowMajor or cutlass.ColumnMajor
+
+        :return: C++ declaration of CuTe stride
+        :rtype: str
+        """
+        if layout == cutlass.RowMajor:
+            return LayoutToStride._variable_last
+        elif layout == cutlass.ColumnMajor:
+            return LayoutToStride._variable_first
+        else:
+            raise Exception('Unsupported layout {}'.format(layout))
+
+
+EMISSION_STR = """
+using TileShape_MNK = cute::Shape<_${threadblock_shape_m}, _${threadblock_shape_n}, _${threadblock_shape_k}>;
+using ClusterShape_MNK = cute::Shape<_${cluster_shape_m}, _${cluster_shape_n}, _${cluster_shape_k}>;
+using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::ss_op_selector<
+      ${internal_element_A}, ${internal_element_B}, ${element_accumulator}, TileShape_MNK, ${gmma_layout_A}, ${gmma_layout_B}>()));
+
+using SmemLayoutAtomA = decltype(cute::GMMA::smem_selector<${gmma_layout_A}, ${internal_element_A}, _${threadblock_shape_m}, _${threadblock_shape_k}>());
+using SmemLayoutAtomB = decltype(cute::GMMA::smem_selector<${gmma_layout_B}, ${internal_element_B}, _${threadblock_shape_n}, _${threadblock_shape_k}>());
+
+using CollectiveOp = typename cutlass::gemm::collective::CollectiveMma<
+    ${mainloop_type}<${stage_count}, ClusterShape_MNK${kernel_schedule}>,
+    TileShape_MNK,
+    ${element_A},
+    ${stride_A},
+    ${element_B},
+    ${stride_B},
+    TiledMma,
+    ${gmem_tiled_copy_A},
+    SmemLayoutAtomA,
+    void, // GMMA_SS does not need an SmemCopyAtom
+    ${transform_A},
+    ${gmem_tiled_copy_B},
+    SmemLayoutAtomB,
+    void, // GMMA_SS does not need an SmemCopyAtom
+    ${transform_B}
+>;
+"""
+
+
+def internal_element(element):
+    """
+    Returns the data type internally used for `element`.
+
+    :param element: data type
+
+    :return: data type used internally
+    """
+    return cutlass.tfloat32 if element == cutlass.float32 else element
+
+
+def common_values(op, stage_count, transform_A, transform_B):
+    """
+    Returns a dictionary containing common values to be substituted in the emission of the
+    collective operation declaration. Values specific to a particular collective operation
+    should be added to these.
+
+    :param op: GEMM operation for which to build a collective operation
+    :type op: pycutlass.GemmOperation
+    :param stage_count: number of pipeline stages to use in the operation
+    :type stage_count: int
+    :param transform_A: transformation to perform on the A operand
+    :type transform_A: str
+    :param transform_B: transformation to perform on the B operand
+    :type transform_B: str
+
+    :return: dictionary containing values to substitute in emission string
+    :rtype: dict
+    """
+    internal_element_a = internal_element(op.A.element)
+    internal_element_b = internal_element(op.B.element)
+
+    return {
+        'threadblock_shape_m': str(op.tile_description.threadblock_shape[0]),
+        'threadblock_shape_n': str(op.tile_description.threadblock_shape[1]),
+        'threadblock_shape_k': str(op.tile_description.threadblock_shape[2]),
+        'cluster_shape_m': str(op.tile_description.cluster_shape[0]),
+        'cluster_shape_n': str(op.tile_description.cluster_shape[1]),
+        'cluster_shape_k': str(op.tile_description.cluster_shape[2]),
+        'element_A': library.DataTypeTag[op.A.element],
+        'element_B': library.DataTypeTag[op.B.element],
+        'internal_element_A': library.DataTypeTag[internal_element_a],
+        'internal_element_B': library.DataTypeTag[internal_element_b],
+        'element_accumulator': library.DataTypeTag[op.accumulator_type()],
+        'gmma_layout_A': library.CuTeLayoutTag[RowColMajorToGMMAMajor.A(op.A.layout, internal_element_a)],
+        'gmma_layout_B': library.CuTeLayoutTag[RowColMajorToGMMAMajor.B(op.B.layout, internal_element_b)],
+        'stride_A': LayoutToStride.A(op.A.layout),
+        'stride_B': LayoutToStride.B(op.B.layout),
+        'stage_count': str(stage_count),
+        'transform_A': transform_A,
+        'transform_B': transform_B
+    }
+
+
+def build_gmma_tma(op):
+    """
+    Builds a collective operation declaration targetting TMA GMMA kernels
+
+    :param op: GEMM operation for which to build a collective operation
+    :type op: pycutlass.GemmOperation
+
+    :return: string containing the C++ declaration of collective operation
+    :rtype: str
+    """
+    A_tma_aligned = (library.DataTypeSizeBytes[op.A.element] * op.A.alignment) % tma_alignment_bytes == 0
+    B_tma_aligned = (library.DataTypeSizeBytes[op.B.element] * op.B.alignment) % tma_alignment_bytes == 0
+    if not A_tma_aligned or not B_tma_aligned:
+        raise Exception('Each of the A or B operands must be aligned to {} bytes to use TMA'.format(tma_alignment_bytes))
+
+    max_stage_count = max_stages(op, arch=90)
+    if op.tile_description.stages is None:
+        op.tile_description.stages = max_stage_count
+    elif op.tile_description.stages > max_stage_count:
+        raise Exception('Combination of threadblock shape, data types, and number of stages exceeds shared memory capacity.')
+
+    kernel_schedule = 'cutlass::gemm::KernelTmaWarpSpecialized'
+    if op.tile_description.persistent:
+        kernel_schedule = 'cutlass::gemm::KernelTmaWarpSpecializedPersistent'
+
+    transform_A = 'cute::identity'
+    transform_B = 'cute::identity'
+    values = common_values(op, op.tile_description.stages, transform_A, transform_B)
+    specific_values = {
+        'mainloop_type': 'cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized',
+        'kernel_schedule': ', ' + kernel_schedule,
+        'gmem_tiled_copy_A': cluster_shape_to_tma(op.tile_description.cluster_shape[1]),
+        'gmem_tiled_copy_B': cluster_shape_to_tma(op.tile_description.cluster_shape[0])
+    }
+    values.update(specific_values)
+
+    return SubstituteTemplate(EMISSION_STR, values)
+
+
+def build_gmma_cpasync(op):
+    """
+    Builds a collective operation declaration targetting cp.async GMMA kernels
+
+    :param op: GEMM operation for which to build a collective operation
+    :type op: pycutlass.GemmOperation
+
+    :return: string containing the C++ declaration of collective operation
+    :rtype: str
+    """
+    A_cp_async_aligned = (library.DataTypeSizeBytes[op.A.element] * op.A.alignment) % cp_async_min_alignment_bytes == 0
+    B_cp_async_aligned = (library.DataTypeSizeBytes[op.B.element] * op.B.alignment) % cp_async_min_alignment_bytes == 0
+    if not A_cp_async_aligned or not B_cp_async_aligned:
+        raise Exception('Each of the A or B operands must be aligned to {} bytes to use cp.async'.format(cp_async_min_alignment_bytes))
+
+    max_stage_count = max_stages(op, arch=90)
+    if op.tile_description.stages is None:
+        op.tile_description.stages = max_stage_count
+    elif op.tile_description.stages > max_stage_count:
+        raise Exception('Combination of threadblock shape, data types, and number of stages exceeds shared memory capacity.')
+
+    transform_A = 'cute::identity'
+    transform_B = 'cute::identity'
+
+    thread_count = 128
+    cpasync_copy_A = make_cpasync_gmem_tiled_copy(thread_count, op.A.element, op.A.alignment, RowColMajorToGMMAMajor.A(op.A.layout, op.A.element),
+                                                  op.tile_description.threadblock_shape[0], op.tile_description.threadblock_shape[2])
+    cpasync_copy_B = make_cpasync_gmem_tiled_copy(thread_count, op.B.element, op.B.alignment, RowColMajorToGMMAMajor.B(op.B.layout, op.B.element),
+                                                  op.tile_description.threadblock_shape[1], op.tile_description.threadblock_shape[2])
+
+    values = common_values(op, op.tile_description.stages, transform_A, transform_B)
+    specific_values = {
+        'mainloop_type': 'cutlass::gemm::MainloopSm90CpAsyncGmma',
+        'kernel_schedule': '',
+        'gmem_tiled_copy_A': cpasync_copy_A,
+        'gmem_tiled_copy_B': cpasync_copy_B
+    }
+    values.update(specific_values)
+
+    return SubstituteTemplate(EMISSION_STR, values)
+
+
+def build(operation):
+    """
+    Builds a collective operation declaration targetting cp.async or TMA for GMMA kernels
+
+    :param operation: GEMM operation for which to build a collective operation
+    :type operation: pycutlass.GemmOperation
+
+    :return: string containing the C++ declaration of collective operation
+    :rtype: str
+    """
+    A_tma_aligned = (library.DataTypeSizeBytes[operation.A.element] * operation.A.alignment) % tma_alignment_bytes == 0
+    B_tma_aligned = (library.DataTypeSizeBytes[operation.B.element] * operation.B.alignment) % tma_alignment_bytes == 0
+    tma_correct_size = (library.DataTypeSizeBytes[operation.A.element] == 2 and library.DataTypeSizeBytes[operation.B.element] == 2)
+    tma_correct_layout = (operation.A.layout == cutlass.RowMajor or operation.B.layout == cutlass.ColumnMajor)
+    if A_tma_aligned and B_tma_aligned and (tma_correct_size or tma_correct_layout):
+        return build_gmma_tma(operation)
+    else:
+        return build_gmma_cpasync(operation)
--- a/tools/library/scripts/pycutlass/src/pycutlass/c_types.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/c_types.py
@ -33,8 +33,6 @@
 import ctypes
 from pycutlass.library import *

-# 12B
-

 class GemmCoord_(ctypes.Structure):
    _fields_ = [
@ -48,6 +46,24 @@ class GemmCoord_(ctypes.Structure):
            setattr(self, field_name, getattr(gemm_coord, field_name)())


+class GemmCoordBatched_(ctypes.Structure):
+    """
+    Wrapper around a GemmCoord that also contains batch count. This is used for encoding
+    batched GEMM inputs to CUTLASS 3 GEMMs.
+    """
+    _fields_ = [
+        ("m", ctypes.c_int),
+        ("n", ctypes.c_int),
+        ("k", ctypes.c_int),
+        ("batch_count", ctypes.c_int)
+    ]
+
+    def __init__(self, gemm_coord, batch_count) -> None:
+        for field_name, _ in self._fields_[:-1]:
+            setattr(self, field_name, getattr(gemm_coord, field_name)())
+        setattr(self, "batch_count", batch_count)
+
+
 class MatrixCoord_(ctypes.Structure):
    _fields_ = [
        ("row", ctypes.c_int),
@ -55,6 +71,26 @@ class MatrixCoord_(ctypes.Structure):
    ]


+class dim3_(ctypes.Structure):
+    _fields_ = [
+        ("x", ctypes.c_int),
+        ("y", ctypes.c_int),
+        ("z", ctypes.c_int)
+    ]
+
+
+class StrideBatched_(ctypes.Structure):
+    """
+    CUTLASS 3.0 strides for operands contain one static dimension and two variable dimensions. The
+    variable dimensions represent the stride along non-unit-stride dimension of the row/column major
+    layout, and the batch stride. This structure encodes the two variable dimensions.
+    """
+    _fields_ = [
+        ("major_stride", ctypes.c_int64),
+        ("batch_stride", ctypes.c_int64)
+    ]
+
+
 dtype2ctype = {
    cutlass.float16: ctypes.c_uint16,
    cutlass.float32: ctypes.c_float,
@ -63,6 +99,28 @@ dtype2ctype = {
 }


+def get_gemm_arguments_3x(epilogue_functor):
+
+    _EpilogueOutputOpParams = epilogue_functor.epilogue_type
+
+    class _GemmArguments(ctypes.Structure):
+        _fields_ = [
+            ("mode", ctypes.c_int),
+            ("problem_size", GemmCoordBatched_),
+            ("ptr_A", ctypes.c_void_p),
+            ("stride_A", StrideBatched_),
+            ("ptr_B", ctypes.c_void_p),
+            ("stride_B", StrideBatched_),
+            ("ptr_C", ctypes.c_void_p),
+            ("stride_C", StrideBatched_),
+            ("ptr_D", ctypes.c_void_p),
+            ("stride_D", StrideBatched_),
+            ("epilogue", _EpilogueOutputOpParams),
+        ]
+
+    return _GemmArguments, _EpilogueOutputOpParams    
+
+
 def get_gemm_arguments(epilogue_functor):

    _EpilogueOutputOpParams = epilogue_functor.epilogue_type
@ -103,8 +161,6 @@ def get_gemm_arguments(epilogue_functor):
 # GEMM Grouped
 ###########################################################################################

-# include/cutlass/gemm/kernel/gemm_grouped.h
-
 def get_gemm_grouped_arguments(epilogue_functor):
    _EpilogueOutputOpParams = epilogue_functor.epilogue_type

@ -131,12 +187,6 @@ def get_gemm_grouped_arguments(epilogue_functor):
 # Convolution2D
 ############################################################################################

-
-# We use the arguments as the interface
-
-
-# include/cutlass/conv/conv2d_problem_size.h
-# 64B
 class Conv2DProblemSize(ctypes.Structure):
    _fields_ = [
        ("N", ctypes.c_int),
@ -164,8 +214,6 @@ class Conv2DProblemSize(ctypes.Structure):
            setattr(self, field_name, getattr(problem_size, field_name))


-# include/cutlass/layout/tensor.h
-# 12B
 class Layout4D(ctypes.Structure):
    _fields_ = [
        ("stride", ctypes.c_int * 3)
@ -175,13 +223,7 @@ class Layout4D(ctypes.Structure):
        stride = tensor_ref.stride()
        setattr(self, "stride", (stride.at(0), stride.at(1), stride.at(2)))

-# TODO: Tensor 5-D takes ("stride", ctypes.c_int * 4)

-
-# include/cutlass/conv/threadblock/conv2d_dgrad_filter_tile_access_iterator_optimized.h
-# TensorRef is basically cutlass::TensorRef<Element, Layout>;
-# include/cutlass/tensor_ref.h
-# 24B
 class TensorRef_(ctypes.Structure):
    _fields_ = [
        ("ptr", ctypes.c_void_p),
@ -200,9 +242,6 @@ class TensorRef2D_(ctypes.Structure):
    ]


-# include/cutlass/conv/kernel/implicit_gemm_convolution.h
-# split_k_mode: kNone: 0, kSerial: 1, kParallel: 2, kParallelSerial: 3, kInvalid: 4
-
 def get_conv2d_arguments(epilogue_functor):
    _EpilogueOutputOpParams = epilogue_functor.epilogue_type

@ -224,7 +263,6 @@ def get_conv2d_arguments(epilogue_functor):
 # Reduction
 ############################################################################################

-
 def get_reduction_params(epilogue_functor):
    _EpilogueOutputParams = epilogue_functor.epilogue_type

--- a/tools/library/scripts/pycutlass/src/pycutlass/compiler.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/compiler.py
@ -29,6 +29,7 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 #################################################################################################
+import pycutlass
 from pycutlass import *
 import cutlass
 from cuda import cuda
@ -54,11 +55,11 @@ class CompilationOptions:
    '''

    #
-    def __init__(self, flags, architectures=[80], include_paths=[]):
+    def __init__(self, flags, arch, include_paths=[]):
        self.includes = []
        self.include_paths = include_paths
        self.flags = flags
-        self.architectures = architectures
+        self.arch = arch

    def get_str(self):
        options = ""
@ -69,13 +70,11 @@ class CompilationOptions:
        for incl in self.include_paths:
            options += ' --include-path=%s' % incl

-        arch_list = "-arch="
-        for idx, arch in enumerate(self.architectures):
-            if idx:
-                arch_list += ","
-            arch_list += "sm_%d" % arch
+        arch_flag = " -arch=sm_%d" % self.arch
+        if self.arch == 90:
+            arch_flag += 'a'
+        options += arch_flag

-        options += " " + arch_list
        return options

    #
@ -88,13 +87,11 @@ class CompilationOptions:
        for incl in self.include_paths:
            options.append(bytes(str.encode('--include-path=%s' % incl)))

-        arch_list = "-arch="
-        for idx, arch in enumerate(self.architectures):
-            if idx:
-                arch_list += ","
-            arch_list += "sm_%d" % arch
+        arch_flag = " -arch=sm_%d" % self.arch
+        if self.arch == 90:
+            arch_flag += 'a'

-        options.append(bytes(str.encode(arch_list)))
+        options.append(bytes(str.encode(arch_flag)))

        return options

@ -138,12 +135,12 @@ class ArtifactManager:
    def nvrtc(self):
        self.backend = "nvrtc"
        self.default_compile_options = [
-            '-std=c++11', '-default-device',
+            '-std=c++17', '-default-device'
        ]
    def nvcc(self):
        self.backend = "nvcc"
        self.default_compile_options = [
-            '-std=c++11',
+            '-std=c++17', '--expt-relaxed-constexpr', '-Xcudafe --diag_suppress=esa_on_defaulted_function_ignored'
        ]
    def insert_operation(self, op_key, cubin, hostfile, op_name, op_attrs):
        connection = sqlite3.connect("./compiled_cache.db")
@ -158,7 +155,7 @@ class ArtifactManager:
        connection.commit()
        cursor.close()

-    def load_operation(self, op_key):
+    def load_operation(self, op_key, extra_funcs):
        connection = sqlite3.connect("./compiled_cache.db")
        cursor = connection.cursor()
        sqlite_fetch_blob_query = """SELECT * from compiled_operations where op_key = ?"""
@ -194,12 +191,17 @@ class ArtifactManager:
                if isinstance(attr, str):
                    func_name = operation_name + '_' + attr
                    func = getattr(host_lib, func_name)
+
+                    # Set the return type of the function
+                    if attr in extra_funcs and extra_funcs[attr] != None:
+                        func.restype = extra_funcs[attr]
+
                    compiled_host_fns[attr] = func

            self.compiled_cache_host.insert(key, compiled_host_fns)
        return True

-    def emit_compile_(self, operation_list, compilation_options):
+    def emit_compile_(self, operation_list, compilation_options, requires_nvcc_hostlib_compilation):
        """
        Compile a list of kernels and store them into database
        """
@ -276,6 +278,7 @@ class ArtifactManager:
            err, = nvrtc.nvrtcGetCUBIN(program, cubin_image)
            if err != nvrtc.nvrtcResult.NVRTC_SUCCESS:
                raise RuntimeError('NVRTC Error: {}'.format(err))
+
        else:  # with nvcc backend
            # emit code
            tempfile.tempdir = "./"
@ -303,22 +306,34 @@ class ArtifactManager:
            with open(temp_cubin.name, 'rb') as file:
                cubin_image = file.read()

-        # compile the host code
-        options = compilation_options.get()
-        cmd = "echo '%s'|g++ -x c++ -fpermissive -w -fPIC" % source_buffer_host
-        for opt in options:
-            opt = opt.decode("utf-8")
-            if opt not in ['-default-device', '-std=c++11', '-Xcicc', '-Xllc'] and '-arch=sm_' not in opt:
-                if '--include-path=' in opt:
-                    cmd += " " + opt.replace('--include-path=', '-I')
-                else:
-                    cmd += " " + opt
+        # Set up the host-side library code
+        if requires_nvcc_hostlib_compilation:
+            cuda_install_path = os.getenv('CUDA_INSTALL_PATH')
+            assert cuda_install_path is not None, "Environment variable 'CUDA_INSTALL_PATH' is not defined."
+            cmd_template = "echo '%s'|${cuda_install_path}/bin/nvcc -x cu -Xcompiler=\"-fpermissive -w -fPIC\" ${options}" % source_buffer_host
+            cmd = SubstituteTemplate(
+                cmd_template,
+                {
+                    "cuda_install_path": cuda_install_path,
+                    "options": compilation_options.get_str()
+                })
+        else:
+            options = compilation_options.get()
+            cmd = "echo '%s'|g++ -x c++ -fpermissive -w -fPIC" % source_buffer_host
+            filtered_opts = ['-default-device', '-Xcicc', '-Xllc', '--expt-relaxed-constexpr', '-Xcudafe --diag_suppress=esa_on_defaulted_function_ignored']
+            for opt in options:
+                opt = opt.decode("utf-8")
+                if opt not in filtered_opts and '-arch=sm_' not in opt:
+                    if '--include-path=' in opt:
+                        cmd += " " + opt.replace('--include-path=', '-I')
+                    else:
+                        cmd += " " + opt

        tempfile.tempdir = "./"
        temp = tempfile.NamedTemporaryFile(
            prefix='host_func', suffix='.so', delete=True)

-        cmd += ' - -shared -o %s' % temp.name
+        cmd += ' - -shared -o %s -lcudart -lcuda' % temp.name
        os.system(cmd)
        host_lib = ctypes.CDLL(temp.name)

@ -333,23 +348,25 @@ class ArtifactManager:
            assert cutlass_path is not None, "Environment variable 'CUTLASS_PATH' is not defined."
            cuda_install_path = os.getenv('CUDA_INSTALL_PATH')
            assert cuda_install_path is not None, "Environment variable 'CUDA_INSTALL_PATH' is not defined."
-            architectures = []
-            for operation in operations:
-                if hasattr(operation, "tile_description"):
-                    cc = operation.arch
-                    if cc not in architectures:
-                        architectures.append(cc)
            include_paths = [
                cuda_install_path + '/include',
                cutlass_path + '/include',
                cutlass_path + '/tools/util/include',
                cutlass_path + '/tools/library/scripts/pycutlass/src/cpp/include'
            ]
+
+            if pycutlass.DEVICE_CC is not None:
+                arch = pycutlass.DEVICE_CC
+            else:
+                # Find the maximum arch tag among the provided operations and compile for that target.
+                # Since we are compiling to .cubin files, only one architecture may be specified.
+                arch = max([op.arch for op in operations])
            compile_options = CompilationOptions(
-                self.default_compile_options, architectures, include_paths)
+                self.default_compile_options, arch, include_paths)
        # save the cubin
        operation_key = []
        operation_list = []
+        requires_nvcc_hostlib_compilation = False
        for operation in operations:
            # step 1: get kernel string as key
            key = operation.rt_module.emit() + operation.procedural_name() + self.backend
@ -357,7 +374,7 @@ class ArtifactManager:
            compiled_kernel = self.compiled_cache_device.at(key)

            if compiled_kernel is None:
-                hit = self.load_operation(key)
+                hit = self.load_operation(key, getattr(operation.rt_module, 'extra_funcs', {}))
                if hit:
                    compiled_kernel = self.compiled_cache_device.at(key)
                    assert compiled_kernel is not None
@ -371,9 +388,18 @@ class ArtifactManager:
            else:
                operation_list.append(operation.rt_module)
                operation_key.append(key)
+
+            # Creating the Params structures for certain 3.0 kernels currently requires CUDA. For these cases, use NVCC to generate
+            # the PyCUTLASS host-side library. Otherwise, g++ will be used.
+            if isinstance(operation, pycutlass.gemm_operation.GemmOperationUniversal) and operation.api == pycutlass.library.ApiVersion.v3x:
+                if self.backend == "nvrtc":
+                    raise RuntimeError('CUTLASS 3 kernels currently require NVCC for compilation.')
+
+                requires_nvcc_hostlib_compilation = True
+
        if len(operation_list) > 0:
            cubin_image, host_lib, host_file = self.emit_compile_(
-                operation_list, compile_options)
+                operation_list, compile_options, requires_nvcc_hostlib_compilation)

            err, module = cuda.cuModuleLoadData(cubin_image)
            if err != cuda.CUresult.CUDA_SUCCESS:
@ -417,9 +443,11 @@ class ArtifactManager:
                op_attr.append(param_size)

                if hasattr(operation, "extra_funcs"):
-                    for suffix in operation.extra_funcs:
+                    for suffix, ret_type in operation.extra_funcs.items():
                        func_name = operation.name() + '_' + suffix
                        func = getattr(host_lib, func_name)
+                        if ret_type is not None:
+                            func.restype = ret_type
                        setattr(operation, suffix, func)
                        compiled_host_fns[suffix] = func
                        op_attr.append(suffix)
--- a/tools/library/scripts/pycutlass/src/pycutlass/conv2d_operation.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/conv2d_operation.py
@ -463,13 +463,14 @@ class Conv2dOperation:
        )

        if self.stride_support == StrideSupport.Unity:
-            configuration_name = "cutlass_${opcode_class}_${extended_name}_${threadblock}_${layout}_unity_stride_align${alignment}"
+            configuration_name = "cutlass_sm${arch}_${opcode_class}_${extended_name}_${threadblock}_${layout}_unity_stride_align${alignment}"
        else:
-            configuration_name = "cutlass_${opcode_class}_${extended_name}_${threadblock}_${layout}_align${alignment}"
+            configuration_name = "cutlass_sm${arch}_${opcode_class}_${extended_name}_${threadblock}_${layout}_align${alignment}"

        return SubstituteTemplate(
            configuration_name,
            {
+                'arch': str(self.arch),
                'opcode_class': opcode_class_name,
                'extended_name': self.extended_name(),
                'threadblock': threadblock,
@ -509,7 +510,7 @@ class Conv2dOperation:
        intermediate_type = ''

        if self.tile_description.math_instruction.opcode_class == cutlass.OpClass.TensorOp:
-            inst_shape = "%d%d%d" % tuple(
+            inst_shape = "%dx%dx%d" % tuple(
                self.tile_description.math_instruction.instruction_shape)
            if self.tile_description.math_instruction.element_a != self.A.element and \
                    self.tile_description.math_instruction.element_a != self.accumulator_type():
--- a/tools/library/scripts/pycutlass/src/pycutlass/epilogue.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/epilogue.py
@ -111,6 +111,7 @@ class LinearCombination(EpilogueFunctorBase):
        self.element_output = element_output
        self.element_accumulator = element_accumulator
        self.element_epilogue = element_epilogue
+        self.epilogue_vector_length = epilogue_vector_length

        self.template_arguments = [
            DataTypeTag[element_output], str(epilogue_vector_length),
--- a/tools/library/scripts/pycutlass/src/pycutlass/gemm_operation.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/gemm_operation.py
@ -36,6 +36,7 @@ import numpy as np
 from typeguard import typechecked
 import cutlass
 from pycutlass import *
+import pycutlass.builder.collective_op_builder as collective_op_builder
 from cuda import cuda


@ -56,9 +57,9 @@ def transpose_layout(layout: cutlass.layout):


 # @typechecked
-class GemmArguments(ArgumentBase):
+class GemmArguments2x(ArgumentBase):
    """
-    Argument wrapper for GEMM. It encodes problem information and 
+    Argument wrapper for GEMM in CUTLASS 2. It encodes problem information and 
    user-provide tensors into the kernel's argument

    :param operation: the GEMM operation to take the argument
@ -148,7 +149,7 @@ class GemmArguments(ArgumentBase):
                self.batch_count = 1
            self.split_k_slices = self.batch_count

-        if gemm_mode in [cutlass.gemm.Mode.Batched, cutlass.gemm.Mode.Array]:    
+        if gemm_mode in [cutlass.gemm.Mode.Batched, cutlass.gemm.Mode.Array]:
            if 'batch' in kwargs.keys():
                self.batch_count = kwargs['batch']
            else:
@ -313,6 +314,154 @@ class GemmArguments(ArgumentBase):
        self.device_workspace = device_workspace
        self.launch_config = launch_config

+class GemmArguments3x(GemmArguments2x):
+    """
+    Argument wrapper for GEMM in CUTLASS 3. It encodes problem information and 
+    user-provide tensors into the kernel's argument
+
+    :param operation: the GEMM operation to take the argument
+    :type operation: :class:`pycutlass.GemmOperationUniversal` |
+     :class:`pycutlass.GemmOperationGrouped`
+    
+    :param problem_size: GEMM problem size gemm(M, N, K)
+    :type operation: :class:`cutlass.gemm.GemmCoord`
+
+    :param A: tensor A
+    :type A: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
+
+    :param B: tensor B
+    :type B: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
+
+    :param C: tensor C
+    :type C: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
+
+    :param D: tensor D
+    :type D: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
+
+    :param gemm_mode: GEMM mode
+    :type gemm_mode: :class:`cutlass.gemm.Mode`
+
+    :param output_op: output operator, optional
+    :type output_op: :class:`pycutlass.LinearCombinationFunctorArguments`
+    """
+
+    def __init__(
+        self, operation: 'GemmOperation', problem_size: 'cutlass.gemm.GemmCoord',
+        A: 'Tensor', B: 'Tensor', C: 'Tensor', D: 'Tensor',
+        gemm_mode: 'cutlass.gemm.Mode'=cutlass.gemm.Mode.Gemm, **kwargs):
+        if gemm_mode not in [cutlass.gemm.Mode.Gemm, cutlass.gemm.Mode.Batched]:
+            raise Exception("Unsupporged GEMM mode {}.".format(gemm_mode))
+
+        super().__init__(operation, problem_size, A, B, C, D, gemm_mode, **kwargs)
+
+    def get_arguments(self):
+        problem_size_ = GemmCoordBatched_(self.problem_size, self.batch_count)
+
+        if self.batch_count > 1:
+            bsA = self.batched_stride_A
+            bsB = self.batched_stride_B
+            bsC = self.batched_stride_C
+            bsD = self.batched_stride_D
+        else:
+            bsA = 0
+            bsB = 0
+            bsC = 0
+            bsD = 0
+        stride_A = StrideBatched_(self.lda, bsA)
+        stride_B = StrideBatched_(self.ldb, bsB)
+        stride_C = StrideBatched_(self.ldc, bsC)
+        stride_D = StrideBatched_(self.ldd, bsD)
+
+        self.arguments = self.operation.argument_type(
+            self.gemm_mode,
+            problem_size_,
+            int(self.ptr_A),
+            stride_A,
+            int(self.ptr_B),
+            stride_B,
+            int(self.ptr_C),
+            stride_C,
+            int(self.ptr_D),
+            stride_D,
+            self.output_op,
+        )
+
+    def initialize(self):
+        # get the host and evice workspace
+        device_workspace_size = \
+            self.operation.rt_module.get_device_workspace_size(self)
+
+        if device_workspace_size > 0:
+            self.workspace_buffer = device_mem_alloc(device_workspace_size)
+            workspace_ptr = self.workspace_buffer.ptr
+            err, = cuda.cuMemsetD32(
+                workspace_ptr, 0, device_workspace_size // 4)
+        else:
+            workspace_ptr = None
+
+        device_workspace = 0
+        if (workspace_ptr is not None and 
+            self.gemm_mode == cutlass.gemm.Mode.GemmSplitKParallel):
+            # in GEMM splik-K parallel, the D pointer is redirected
+            # to the workspace
+            self.ptr_D = cuda.CUdeviceptr(workspace_ptr)
+        elif (workspace_ptr is not None and 
+            self.gemm_mode == cutlass.gemm.Mode.Gemm):
+            # in GEMM split-K serial
+            device_workspace = workspace_ptr
+
+        self.get_arguments()
+        res_arg = self.operation.rt_module.get_args(
+            ctypes.byref(self.arguments), ctypes.c_void_p(int(device_workspace)))
+        host_workspace = bytearray(res_arg.contents)
+
+        grid = self.operation.rt_module.get_grid_shape(
+            ctypes.byref(self.arguments), ctypes.c_void_p(int(device_workspace)))
+        block = self.operation.rt_module.get_block_shape()
+
+        device_workspace = None
+
+        self.host_workspace = host_workspace
+        self.device_workspace = device_workspace
+        self.launch_config = LaunchConfiguration([grid.x, grid.y, grid.z],
+                                                 [block.x, block.y, block.z],
+                                                 self.operation.rt_module.shared_memory_capacity)
+
+def GemmArguments(operation: 'GemmOperation', problem_size: 'cutlass.gemm.GemmCoord',
+        A: 'Tensor', B: 'Tensor', C: 'Tensor', D: 'Tensor',
+        gemm_mode: 'cutlass.gemm.Mode'=cutlass.gemm.Mode.Gemm, **kwargs):
+    """
+    Argument wrapper for GEMM in CUTLASS 2 or 3. It returns either 2x arguments
+    or 3x arguments depending on the `arch` field specified in `operation`.
+
+    :param operation: the GEMM operation to take the argument
+    :type operation: :class:`pycutlass.GemmOperationUniversal` |
+     :class:`pycutlass.GemmOperationGrouped`
+    
+    :param problem_size: GEMM problem size gemm(M, N, K)
+    :type operation: :class:`cutlass.gemm.GemmCoord`
+
+    :param A: tensor A
+    :type A: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
+
+    :param B: tensor B
+    :type B: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
+
+    :param C: tensor C
+    :type C: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
+
+    :param D: tensor D
+    :type D: cuda.CUdeviceptr | numpy.ndarray | torch.Tensor | cupy.ndarray
+
+    :param gemm_mode: GEMM mode
+    :type gemm_mode: :class:`cutlass.gemm.Mode`
+
+    :param output_op: output operator, optional
+    :type output_op: :class:`pycutlass.LinearCombinationFunctorArguments`
+    """
+    ArgClass = GemmArguments3x if operation.api == ApiVersion.v3x else GemmArguments2x
+    return ArgClass(operation, problem_size, A, B, C, D, gemm_mode, **kwargs)
+

 class GemmGroupedArguments:
    """
@ -383,7 +532,7 @@ class GemmGroupedArguments:
        # process the input arguments
        for idx, problem_size in enumerate(problem_sizes):
            M, N, K = problem_size.m(), problem_size.n(), problem_size.k()
-            temp_argument = GemmArguments(
+            temp_argument = GemmArguments2x(
                operation=operation, 
                problem_size=cutlass.gemm.GemmCoord(M, N, K), 
                A=A[idx], B=B[idx], C=C[idx], D=D[idx],
@ -657,16 +806,164 @@ extern "C" {
            #
            workspace_bytes = 4 * arguments.grid_tiled_shape.x * arguments.grid_tiled_shape.y

-        # TODO: get extra workspace size
-        # see https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm_universal_base.h
        return workspace_bytes


+################################################################################
+# Runtime module for GEMM Universal within CUTLASS 3
+################################################################################
+
+class GemmRTUniversal3x(GemmRTUniversal):
+    """
+    GemmRTUniversal manages the CUTLASS runtime components
+    """
+    KernelTemplate = r'''
+
+using Operator = ${operation_name}${operation_suffix};
+extern "C"
+__global__ __launch_bounds__(Operator::MaxThreadsPerBlock, Operator::MinBlocksPerMultiprocessor)
+void ${operation_name}(__grid_constant__ typename Operator::Params const params) {
+  // Dynamic shared memory base pointer
+  extern __shared__ char smem[];
+
+  // Declare pointer to dynamic shared memory.
+  Operator op;
+  op(params, smem);
+}
+  '''
+    HostTemplate = r'''
+extern "C" {
+  // Get the size of params in bytes
+  int ${operation_name}_get_param_size(){
+    return sizeof(${operation_name}${operation_suffix}::Params);
+  }
+
+  // Get the size of dynamic shared memory in bytes
+  int ${operation_name}_shared_memory_size() {
+    return ${operation_name}${operation_suffix}::SharedStorageSize;
+  }
+
+  using GemmType = ${operation_name}_base;
+
+  // Get the params as byte array
+  char* ${operation_name}_get_params(GemmType::Arguments* argument, int* workspace){
+    GemmType::Params params = GemmType::to_underlying_arguments(*argument, workspace);
+
+    char *bytes = ((char*)(&params));
+    char *output = new char[sizeof(GemmType::Params)];
+    for (unsigned int i = 0; i < sizeof(GemmType::Params); i ++)
+        output[i] = bytes[i];
+
+    return output;
+  }
+
+  // Get the grid shape
+  dim3 ${operation_name}_get_grid_shape(GemmType::Arguments* args, int* workspace) {
+    auto tmp_params = GemmType::to_underlying_arguments(*args, workspace);
+    return GemmType::get_grid_shape(tmp_params);
+  }
+
+  // Get the block shape
+  dim3 ${operation_name}_get_block_shape() {
+    return GemmType::get_block_shape();
+  }
+}
+  '''
+
+    def __init__(self, operation: 'GemmOperation'):
+        super(GemmRTUniversal3x, self).__init__(operation)
+        self.extra_funcs = {
+            'get_grid_shape':  dim3_,
+            'get_block_shape': dim3_
+        }
+        self.emitter = EmitGemmUniversalInstance3x('_type')
+        self.argument_type, self.epilogue_type = get_gemm_arguments_3x(operation.epilogue_functor)
+
+
+class EmitGemmUniversalInstance3x:
+    ''' Responsible for emitting a CUTLASS 3 template definition'''
+
+    def __init__(self, operation_suffix=''):
+        self.operation_suffix = operation_suffix
+        self.includes = [
+            "cutlass/cutlass.h",
+            "cute/tensor.hpp",
+            "cute/atom/mma_atom.hpp",
+            "cutlass/numeric_types.h",
+            "cutlass/gemm/kernel/gemm_universal.hpp",
+            "cutlass/gemm/collective/collective_builder.hpp",
+            "cutlass/epilogue/collective/default_epilogue.hpp",
+            "cutlass/epilogue/thread/linear_combination.h"
+        ]
+        self.gemm_template = """
+using namespace cute;
+
+${collective_op}
+
+using EpilogueOp = cutlass::epilogue::collective::DefaultEpilogue<
+    cutlass::gemm::TagToStrideC_t<${layout_c}>,
+    cutlass::gemm::TagToStrideC_t<${layout_c}>,
+    ${epilogue_functor}
+    >;
+
+// Gemm operator ${operation_name}
+using ${operation_name}_base = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>,
+    CollectiveOp,
+    EpilogueOp
+>;
+
+// Define named type
+struct ${operation_name}${operation_suffix} : 
+  public ${operation_name}_base { };
+"""
+
+    #
+    def emit(self, operation):
+
+        instance_layout_A, instance_layout_B, instance_layout_C = \
+            (operation.A.layout, operation.B.layout, operation.C.layout)
+
+        # Support built-in epilogue functors or user-defined functions
+        epilogue_functor = operation.epilogue_functor.emit()
+
+        collective_op = collective_op_builder.build(operation)
+
+        values = {
+            'operation_name': operation.procedural_name(),
+            'operation_suffix': self.operation_suffix,
+            'collective_op': collective_op,
+            'element_a': DataTypeTag[operation.A.element],
+            'layout_a': LayoutTag[instance_layout_A],
+            'element_b': DataTypeTag[operation.B.element],
+            'layout_b': LayoutTag[instance_layout_B],
+            'element_c': DataTypeTag[operation.C.element],
+            'layout_c': LayoutTag[instance_layout_C],
+            'epilogue_functor': epilogue_functor,
+            'element_output': DataTypeTag[operation.epilogue_functor.element_output],
+            'element_accumulator': DataTypeTag[operation.accumulator_type()],
+            'element_epilogue': DataTypeTag[operation.epilogue_functor.element_epilogue],
+            'epilogue_vector_length': str(operation.epilogue_functor.epilogue_vector_length),
+            'opcode_class': OpcodeClassTag[operation.tile_description.math_instruction.opcode_class],
+            'arch': "cutlass::arch::Sm%d" % operation.arch,
+            'threadblock_shape_m': str(operation.tile_description.threadblock_shape[0]),
+            'threadblock_shape_n': str(operation.tile_description.threadblock_shape[1]),
+            'threadblock_shape_k': str(operation.tile_description.threadblock_shape[2]),
+            'cluster_shape_m': str(operation.tile_description.cluster_shape[0]),
+            'cluster_shape_n': str(operation.tile_description.cluster_shape[1]),
+            'cluster_shape_k': str(operation.tile_description.cluster_shape[2]),
+            'align_a': str(operation.A.alignment),
+            'align_b': str(operation.B.alignment)
+        }
+
+        values['epilogue_functor'] = operation.epilogue_functor.emit()
+        return SubstituteTemplate(self.gemm_template, values)
+
+
 ###################################################################################################
 # Runtime module for GEMM Grouped
 ###################################################################################################

-
 class GemmRTGrouped(GemmRTbase):
    """
    GemmRTGrouped manages the CUTLASS runtime components
@ -713,7 +1010,7 @@ class GemmRTGrouped(GemmRTbase):

    def __init__(self, operation: 'GemmOperation'):
        super(GemmRTGrouped, self).__init__(operation)
-        self.extra_funcs = ['precompute']
+        self.extra_funcs = {'precompute': None}

        self.emitter = EmitGemmGroupedInstance('_type')
        self.argument_type, self.epilogue_type = get_gemm_grouped_arguments(operation.epilogue_functor)
@ -761,7 +1058,7 @@ class GemmOperationBase:
            self, gemm_kind, arch, tile_description: TileDescription,
            A: TensorDescription, B: TensorDescription, C: TensorDescription, 
            epilogue_functor, 
-            swizzling_functor=cutlass.IdentitySwizzle1, **kwargs):
+            swizzling_functor=cutlass.IdentitySwizzle1, api=False, **kwargs):

        #: operation kind
        self.operation_kind: OperationKind = OperationKind.Gemm
@ -772,8 +1069,11 @@ class GemmOperationBase:
        #: gemm kind
        self.gemm_kind: GemmKind = gemm_kind

+        self.api = api
+        self.prefix = "3x" if self.api == ApiVersion.v3x else ""
+
        # use deep copy to avoid overwritting the original TensorDescription
-        if C.layout == cutlass.ColumnMajor:
+        if self.api != ApiVersion.v3x and C.layout == cutlass.ColumnMajor:
            #: Operand A
            self.A: TensorDescription = copy.deepcopy(B)
            #: Operand B
@ -800,7 +1100,6 @@ class GemmOperationBase:
            self.direct_store = kwargs["direct_store"]
        else:
            self.direct_store = False
-        
        if "visitor" in kwargs:
            self.visitor = kwargs["visitor"]
        else:
@ -872,8 +1171,11 @@ class GemmOperationBase:
            math_op_string = math_operations_map[math_op] if math_op in math_operations_map.keys(
            ) else ''

-            inst_shape = "%d%d%d" % tuple(
-                self.tile_description.math_instruction.instruction_shape)
+            if self.tile_description.math_instruction.instruction_shape is not None:
+                inst_shape = "%dx%dx%d" % tuple(
+                    self.tile_description.math_instruction.instruction_shape)
+            else:
+                inst_shape = "Default"
            inst_shape += math_op_string

            if self.tile_description.math_instruction.element_a != self.A.element and \
@ -905,6 +1207,17 @@ class GemmOperationBase:

        return extended_name

+    #
+    def extended_name_3x(self):
+        '''Generates a string representing the MMA atom. Assumes accumulator type is C type.'''
+        extended_name = "{core_name}_{element_a}_{element_b}_{element_acc}_{element_c}".format(
+            element_a = DataTypeNames[self.A.element],
+            element_b = DataTypeNames[self.B.element],
+            element_acc = DataTypeNames[self.tile_description.math_instruction.element_accumulator],
+            element_c = DataTypeNames[self.C.element],
+            core_name = self.core_name())
+        return extended_name
+
    #
    def layout_name(self):
        if self.is_complex() or self.is_planar_complex():
@ -916,25 +1229,49 @@ class GemmOperationBase:
            )
        return "%s%s" % (ShortLayoutTypeNames[self.A.layout], ShortLayoutTypeNames[self.B.layout])

+    # Generates a short string representing the ABC layout tags (e.g. ntn or tnn)
+    def layout_name_3x(self):
+        if self.is_complex() or self.is_planar_complex():
+            return "{}{}{}".format(
+                ShortComplexLayoutNames[(self.A.layout, self.A.complex_transform)], 
+                ShortComplexLayoutNames[(self.B.layout, self.B.complex_transform)],
+                ShortComplexLayoutNames[(self.C.layout, self.C.complex_transform)])
+        else:
+            return "{}{}{}".format(
+                ShortLayoutTypeNames[self.A.layout],
+                ShortLayoutTypeNames[self.B.layout],
+                ShortLayoutTypeNames[self.C.layout])
+
    #
    def procedural_name(self):
        ''' The full procedural name indicates architecture, extended name, tile size, and layout. '''
-        threadblock = self.tile_description.procedural_name()
-
        opcode_class_name = OpcodeClassNames[self.tile_description.math_instruction.opcode_class]
-
-        alignment = max([self.A.alignment, self.B.alignment, self.C.alignment])
-
-        return SubstituteTemplate(
-            "cutlass_${opcode_class}_${extended_name}_${threadblock}_${layout}_align${alignment}",
-            {
-                'opcode_class': opcode_class_name,
-                'extended_name': self.extended_name(),
-                'threadblock': threadblock,
-                'layout': self.layout_name(),
-                'alignment': "%d" % self.A.alignment,
-            }
-        )
+        if self.api == ApiVersion.v3x and self.arch >= 90:
+            kernel_name_template = "cutlass{p}_sm{ar}_{op}_{ex}_{tbm}x{tbn}x{tbk}_{cm}x{cn}x{ck}_{l}_{s}_align{al}"
+            return kernel_name_template.format(
+                p = self.prefix,
+                ar = self.arch,
+                op = opcode_class_name,
+                ex = self.extended_name_3x(),
+                tbm = self.tile_description.threadblock_shape[0],
+                tbn = self.tile_description.threadblock_shape[1],
+                tbk = self.tile_description.threadblock_shape[2],
+                cm = self.tile_description.cluster_shape[0],
+                cn = self.tile_description.cluster_shape[1],
+                ck = self.tile_description.cluster_shape[2],
+                l = self.tile_description.stages,
+                s = self.layout_name_3x(),
+                al = str(self.A.alignment))
+        else:
+            threadblock = self.tile_description.procedural_name()
+            return "cutlass{p}_sm{ar}_{op}_{ex}_{tb}_{l}_align{a}".format(
+                p = self.prefix,
+                ar = self.arch,
+                op = opcode_class_name,
+                ex = self.extended_name(),
+                tb = threadblock,
+                l = self.layout_name(),
+                a = str(self.A.alignment))

    #
    def configuration_name(self):
@ -945,9 +1282,14 @@ class GemmOperationBase:
 class GemmOperationUniversal(GemmOperationBase):
    def __init__(self, arch, tile_description: TileDescription, A: TensorDescription, B, C,
                 epilogue_functor, swizzling_functor=cutlass.IdentitySwizzle1, **kwargs):
+        api = api_version(arch, tile_description.math_instruction.opcode_class, A.element)
        super(GemmOperationUniversal, self).__init__(GemmKind.Universal, arch, tile_description,
-                                                     A, B, C, epilogue_functor, swizzling_functor, **kwargs)
-        self.rt_module = GemmRTUniversal(self)
+                                                     A, B, C, epilogue_functor, swizzling_functor,
+                                                     api=api, **kwargs)
+        if api == ApiVersion.v3x:
+            self.rt_module = GemmRTUniversal3x(self)
+        else:
+            self.rt_module = GemmRTUniversal(self)
        self.argument_type = self.rt_module.argument_type
        self.epilogue_type = self.rt_module.epilogue_type

--- a/tools/library/scripts/pycutlass/src/pycutlass/library.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/library.py
@ -36,6 +36,7 @@ import re

 import enum
 import cutlass
+import cute

 # The following block implements enum.auto() for Python 3.5 variants that don't include it such
 # as the default 3.5.2 on Ubuntu 16.04.
@ -182,6 +183,30 @@ DataTypeSize = {
    cutlass.dtype.cs64: 128,
 }

+
+class DataTypeSizeBytes:
+    """
+    Static class to mimic the `DataTypeSize` dictionary, but with checks for whether the
+    data type key is less than a full byte or a non-integer number of bytes.
+    """
+    @staticmethod
+    def __class_getitem__(datatype):
+        """
+        Returns the number of bytes in size the data type is. Raises an exception if the data type
+        is either less than a full byte or a non-integer number of bytes in size.
+
+        :param datatype: data type to query
+
+        :return: number of bytes the data type occupies
+        :rtype: int
+        """
+        bits = DataTypeSize[datatype]
+        if bits < 8:
+            raise Exception('Data type {} is less than one byte in size.'.format(datatype))
+        elif bits % 8 != 0:
+            raise Exception('Data type {} is not an integer number of bytes.'.format(datatype))
+        return bits // 8
+
 ###################################################################################################
 #

@ -350,6 +375,12 @@ ShortComplexLayoutNames = {
    (cutlass.RowMajor, cutlass.complex_transform.conj): 'h'
 }

+#
+CuTeLayoutTag = {
+    cute.GMMAMajor.K: 'cute::GMMA::Major::K',
+    cute.GMMAMajor.MN: 'cute::GMMA::Major::MN'
+}
+
 ###################################################################################################

 #
@ -436,7 +467,6 @@ OpcodeClassTag = {

 #

-
 class OperationKind(enum.Enum):
    Gemm = enum_auto()
    RankK = enum_auto()
@ -460,16 +490,19 @@ ArchitectureNames = {
    70: 'volta',
    75: 'turing',
    80: 'ampere',
+    90: 'hopper'
 }

 #
 SharedMemPerCC = {
-    70: 96,  # 96KB of SMEM
-    72: 96,  # 96KB of SMEM
-    75: 64,  # 64KB of SMEM
-    80: 160,  # 164KB of SMEM - 4KB reserved for the driver
-    86: 100,  # 100KB of SMEM
-    87: 160,  # 164KB of SMEM - 4KB reserved for the driver
+    70: 96 << 10,   # 96KB of SMEM
+    72: 96 << 10,   # 96KB of SMEM
+    75: 64 << 10,   # 64KB of SMEM
+    80: 160 << 10,  # 164KB of SMEM - 4KB reserved for the driver
+    86: 100 << 10,  # 100KB of SMEM
+    87: 160 << 10,  # 164KB of SMEM - 4KB reserved for the driver
+    89: 100 << 10,  # 100KB of SMEM
+    90: 227 << 10,  # 228KB of SMEM - 1KB reserved for the driver
 }

 ###################################################################################################
@ -646,7 +679,21 @@ ConvModeTag = {


 class MathInstruction:
+    """
+    Description of a the lowest-level matrix-multiply-accumulate operation to be used in a kernel
+    """
    def __init__(self, instruction_shape, element_a, element_b, element_accumulator, opcode_class=cutlass.OpClass.Simt, math_operation=MathOperation.multiply_add):
+        """
+        :param instruction_shape: size of the [M, N, K] dimensions of the instruction
+        :type instruction_shape: list or tuple
+        :param element_a: data type of operand A
+        :param element_b: data type of operand B
+        :param element_accumulator: data type used in accumulation
+        :param opcode_class: higher-level class of the instruction (e.g., SIMT or Tensor Core)
+        :type opcode_class: cutlass.OpClass
+        :param math_operation: the type of low-level operation to be performed (e.g., multiply accumulate)
+        :type math_operation: MathOperation
+        """
        self.instruction_shape = instruction_shape
        self.element_a = element_a
        self.element_b = element_b
@ -658,24 +705,65 @@ class MathInstruction:


 class TileDescription:
-
-    def __init__(self, threadblock_shape, stages, warp_count, math_instruction):
+    """
+    Description of a tile of computation to be performed in the kernel, encompassing threadblock, cluster, and warp shapes,
+    stage count, and math instruction specification
+    """
+    def __init__(self, threadblock_shape, stages, warp_count, math_instruction, cluster_shape=[1, 1, 1], persistent=False):
+        """
+        :param threadblock_shape: shape of a threadblock tyle
+        :type threadblock_shape: list or tuple
+        :param stages: number of pipline stages in the operation. For SM90 kernels, this can be set to `None` and the maximum
+                       number of stages that can be supported for an operation on a given architecture will be computed at a later time
+        :type stages: int or None
+        :param warp_count: number of warps in each [M, N, K] dimension of a threadblock tile
+        :type warp_count: list, tuple, or None
+        :param math_instruction: specification of the instruction type and shape to be performed and the types of its operands
+        :type math_instruction: MathInstruction
+        :param cluster_shape: number of threadblocks in the [X, Y, Z] dimensions of a threadblock cluster
+        :param persistent: whether the kernel uses persistent warp-specialized threadblocks (only available for SM90+)
+        :type persistent: bool
+        """
        self.threadblock_shape = threadblock_shape
-
-        #: number of pipeline stages
+        self.cluster_shape = cluster_shape
+        self.persistent: bool = persistent
        self.stages: int = stages

-        #: number of warps along x, y, z directions
-        self.warp_count: list[int] = warp_count
        self.math_instruction = math_instruction

-        #: number threads per threadblock
-        self.num_threads: int = 32
-        for cnt in self.warp_count:
-            self.num_threads *= cnt
+        # Number of warps along x, y, z directions
+        self.warp_count = warp_count
+
+    @property
+    def num_threads(self):
+        """
+        Returns the number of threads in the threadblock
+
+        :return: number of threads in the threadblock
+        :rtype: int or None (if warp count is None)
+        """
+        if self.warp_count is not None:
+            threads = 32
+            for cnt in self.warp_count:
+                threads *= cnt
+            return threads
+        return None

    def procedural_name(self):
-        return "%dx%d_%dx%d" % (self.threadblock_shape[0], self.threadblock_shape[1], self.threadblock_shape[2], self.stages)
+        """
+        Returns a name identifying the tile description
+
+        :return: name identifying the tile description
+        :rtype: int
+        """
+        emit_stages = 0 if self.stages is None else self.stages
+        name = "%dx%dx%d_%dx%d_%dx%d" % (
+            self.cluster_shape[0], self.cluster_shape[1], self.cluster_shape[2],
+            self.threadblock_shape[0], self.threadblock_shape[1], self.threadblock_shape[2], emit_stages)
+
+        if self.persistent:
+            name += '_persistent'
+        return name

 #

@ -715,30 +803,68 @@ class TriangularTensorDescription:
 ###################################################################################################

 #
+def CalculateSmemUsagePerStage(operation):
+    """
+    Returns the amount of shared memory in bytes consumed in a single stage of a kernel.

+    :param op: operation for which the maximum stages should be computed. If stages are
+               set via the `op.tile_description.stages` parameter, this setting is ignored
+               in the present calculation
+    :type op: pycutlass.Operation

-def CalculateSmemUsage(operation):
-    cta_shape = operation.tile_description.threadblock_shape
-    stages = operation.tile_description.stages
+    :return: number of bytes of shared memory consumed by a single stage
+    :rtype: int
+    """
+    m, n, k = operation.tile_description.threadblock_shape

-    if operation.operation_kind == OperationKind.Gemm and operation.gemm_kind == GemmKind.Sparse:
-        # Elements represented by 8 bits of metadata (based on 4:8, 2:4 or 1:2 sparsity)
-        if DataTypeSize[operation.A.element] == 32:
-            elements_per_8b_md = 2
-        elif DataTypeSize[operation.A.element] == 4:
-            elements_per_8b_md = 8
-        else:
-            elements_per_8b_md = 4
-
-        smem_per_stage = DataTypeSize[operation.A.element] * cta_shape[0] * (cta_shape[2] // 2) // 8 + \
-            DataTypeSize[operation.B.element] * cta_shape[1] * cta_shape[2] // 8 + \
-            cta_shape[0] * (cta_shape[2] // 2) // elements_per_8b_md
+    if operation.operation_kind == OperationKind.Gemm:
+        stage_barrier_bytes = 32
+        return (DataTypeSize[operation.A.element] * m * k // 8) + \
+                         (DataTypeSize[operation.B.element] * k * n // 8) + stage_barrier_bytes
    else:
-        # Few BLAS3 operations only have A tensor
-        smem_per_stage = DataTypeSize[operation.A.element] * cta_shape[0] * cta_shape[2] // 8 + \
-            DataTypeSize[operation.A.element] * \
-            cta_shape[1] * cta_shape[2] // 8
+        raise Exception('Unsupported operation kind {}.'.format(operation.operation_kind))
+
+
+#
+def CalculateSmemUsage(operation):
+    """
+    Returns the amount of shared memory in bytes consumed by a kernel.
+
+    :param op: operation for which the maximum stages should be computed. If stages are
+               set via the `op.tile_description.stages` parameter, this setting is ignored
+               in the present calculation
+    :type op: pycutlass.Operation
+
+    :return: int
+    """
+    return operation.tile_description.stages * CalculateSmemUsagePerStage(operation)
+
+
+class ApiVersion(enum.Enum):
+    """
+    Differentiate between CUTLASS 2.x and 3.x API versions
+    """
+    v2x = enum_auto()
+    v3x = enum_auto()
+
+
+def api_version(arch, opclass, datatype):
+    """
+    Returns whether the architecture, opcode class, and datatype in question require using CUTLASS 2.x
+    or 3.x for code emission.
+
+    :param arch: compute capability of device on which to run
+    :type arch: int
+    :param opclass: class of the operation being performed
+    :type opclass: cutlass.OpClass
+    :param datatype: data type to be used in operation (assumes that ElementA and ElementB are the same)
+
+    :return: API version to be used in code emission
+    :rtype: ApiVersion
+    """
+    if arch >= 90 and opclass == cutlass.OpClass.TensorOp and (datatype != cutlass.float64):
+        return ApiVersion.v3x
+    else:
+        return ApiVersion.v2x

-    smem_usage = smem_per_stage * stages
-    return (smem_usage >> 10)
 ###################################################################################################
--- a/tools/library/scripts/pycutlass/src/pycutlass/operation.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/operation.py
@ -32,6 +32,12 @@

 import ctypes
 from cuda import cuda
+from pycutlass.utils.device import device_cc
+
+from cuda import __version__ as __cuda_version__
+_version_splits = [int(x) for x in __cuda_version__.split('.')]
+supports_cluster_launch = device_cc() >= 90 and (_version_splits[0] > 11 or (_version_splits[0] == 11 and _version_splits[1] >= 8))
+

 ################################################################################
 #
@ -90,21 +96,58 @@ class ExecutableOperation:
    def initialize(self, host_workspace, device_workspace, launch_config, arguments, stream=cuda.CUstream(0)):
        raise NotImplementedError()

+
    #
-    def run(self, host_workspace, device_workspace, launch_config, stream=cuda.CUstream(0)):
+    def run_with_clusters(self, launch_config, kernel_params, stream=cuda.CUstream(0)):
+        if hasattr(self.operation, 'tile_description') and hasattr(self.operation.tile_description, 'cluster_shape'):
+            attr = cuda.CUlaunchAttribute()
+            attr.value.clusterDim.x, attr.value.clusterDim.y, attr.value.clusterDim.z = self.operation.tile_description.cluster_shape
+            attr.id = cuda.CUstreamAttrID.CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION
+            attrs = [attr]

-        cArg = (ctypes.c_char * len(host_workspace)
-                ).from_buffer(host_workspace)
-        packed = (ctypes.c_void_p * 1)()
-        packed[0] = ctypes.addressof(cArg)
+            # Allow for non-portable cluster sizes
+            err, = cuda.cuFuncSetAttribute(
+                self.kernel, cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NON_PORTABLE_CLUSTER_SIZE_ALLOWED, 1)
+            if err != cuda.CUresult.CUDA_SUCCESS:
+                return err
+        else:
+            attrs = []

+        config = cuda.CUlaunchConfig()
+        config.gridDimX, config.gridDimY, config.gridDimZ = launch_config.grid
+        config.blockDimX, config.blockDimY, config.blockDimZ = launch_config.block
+        config.blockDimZ = launch_config.block[2]
+        config.sharedMemBytes = launch_config.shared_memory_capacity
+        config.hStream = stream
+        config.attrs = attrs
+        config.numAttrs = len(attrs)
+
+        err, = cuda.cuLaunchKernelEx(config, f=self.kernel, kernelParams=kernel_params, extra=0)
+        return err
+
+
+    #
+    def run_without_clusters(self, launch_config, kernel_params, stream=cuda.CUstream(0)):
        err, = cuda.cuLaunchKernel(
            self.kernel,
            launch_config.grid[0], launch_config.grid[1], launch_config.grid[2],
            launch_config.block[0], launch_config.block[1], launch_config.block[2],
            launch_config.shared_memory_capacity,
            stream,
-            packed,
+            kernel_params,
            0)

        return err
+
+
+    #
+    def run(self, host_workspace, device_workspace, launch_config, stream=cuda.CUstream(0)):
+        cArg = (ctypes.c_char * len(host_workspace)
+                ).from_buffer(host_workspace)
+        packed = (ctypes.c_void_p * 1)()
+        packed[0] = ctypes.addressof(cArg)
+
+        if supports_cluster_launch:
+            return self.run_with_clusters(launch_config, packed, stream)
+        else:
+            return self.run_without_clusters(launch_config, packed, stream)
--- a/tools/library/scripts/pycutlass/src/pycutlass/parser.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/parser.py
@ -543,7 +543,6 @@ using ${operation_name}_EpilogueVisitor = cutlass::epilogue::threadblock::Epilog
        self.elements_per_access = elements_per_access
        self.element_compute = element_compute
        self.element_output = element_output
-        # TODO: deprecate this
        self.elementwise_functor = elementwise_functor
        pass
    
@ -554,11 +553,8 @@ using ${operation_name}_EpilogueVisitor = cutlass::epilogue::threadblock::Epilog
        #
        tree = function.epilogue_tree
        self.tree = tree
-        # self.tree.show() # for debug
        function.pass_binary_2_unary(self.tree, self.tree.root)
-        # self.tree.show() # for debug
        function.pass_inject_reduction(self.tree, self.tree.root)
-        # self.tree.show() # for debug
        function.pass_inject_epilogue_op(self.tree,self.tree.root)

        visitor = self.tree.get_node(self.tree.root).data.epilogue_node
@ -575,7 +571,6 @@ using ${operation_name}_EpilogueVisitor = cutlass::epilogue::threadblock::Epilog
                    if input_key == "accum":
                        continue
                    if function.input_args[input_key][0] == "scalar": 
-                        # _kwargs[input_key] = kwargs[input_key]
                        continue
                    # tensor input
                    else:
--- a/tools/library/scripts/pycutlass/src/pycutlass/test/conv2d_testbed.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/test/conv2d_testbed.py
@ -265,15 +265,6 @@ class Conv2dLauncher:
        
        flops_total_ = flops_mainloop_ + flops_epilogue_
        
-        # TODO complex-value support
-        # switch (operation_desc.tile_description.math_instruction.math_operation) {
-        # case library::MathOperationID::kMultiplyAddComplex:
-        #     flops_total_ *=4;
-        #     break;
-
-        # default: break;
-        # }
-
        return flops_total_


@ -511,9 +502,8 @@ class Conv2dLauncher:
 # (conv_blacklist_sizes)
 ############################################################################################################

-def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleaved=False):  # TODO: conv_test_sizes and conv_blacklist_sizes
+def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleaved=False):
    passed = True
-
    #
    # Testbed object
    #
@ -529,8 +519,6 @@ def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleave
    # Vector of conv2d problem sizes to avoid duplicate runs
    conv_tested_sizes = []

-    # TODO: include resnet 50 sizes, user sepecified sizes, and rigorous sizes
-    
    # Flatten 2D problem_vectors into a 1D problem sizes
    problem_sizes = conv_problems.conv2d_default_sizes
    
@ -539,7 +527,6 @@ def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleave
    # Sweep conv2d problem sizes (split-k-mode=kSerial, split-k-slices=1, alpha=1.0, beta=0.0)
    for conv_problem in problem_sizes:

-        # TODO: skip blacklist problem sizes
        if conv_problem in conv_tested_sizes:
            continue
            
@ -585,9 +572,8 @@ def test_all_conv2d(operation: Conv2dOperation, conv_test_sizes = [], interleave

        passed = testbed.run(conv_problem)

-        # if not passed: return False
-
-        # TODO: If CUTLASS_UNIT_TEST_PROBLEM_COUNT is set reduce the the number of tested problem counts
+        if not passed:
+            return False

    if interleaved:
        return True
--- a/tools/library/scripts/pycutlass/src/pycutlass/test/gemm_grouped_testbed.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/test/gemm_grouped_testbed.py
@ -184,7 +184,7 @@ class TestbedGrouped:
        arguments.sync()

        #
-        # Reference check - TODO: support caching results
+        # Reference check
        #
        alpha = self.compute_type(alpha).value()
        beta = self.compute_type(beta).value()
--- a/tools/library/scripts/pycutlass/src/pycutlass/test/gemm_testbed.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/test/gemm_testbed.py
@ -33,6 +33,7 @@
 from time import sleep
 import pycutlass
 from pycutlass import *
+import pycutlass.utils.datatypes as datatypes
 import cutlass
 from cuda import cudart
 from cuda import cuda
@ -52,16 +53,22 @@ def transpose(layout):
        return cutlass.ColumnMajorInterleaved32


-def getTensorRef(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: cutlass.layout):
+def getTensorRef(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: cutlass.layout, batch_offset: int = 0):
    ptr = tensor.__array_interface__['data'][0]
    if operand == "a":
        tensor_coord = problem_size.mk()
+        batch_stride = problem_size.m() * problem_size.k()
    elif operand == "b":
        tensor_coord = problem_size.kn()
+        batch_stride = problem_size.k() * problem_size.n()
    elif operand in ["c", "d"]:
        tensor_coord = problem_size.mn()
+        batch_stride = problem_size.m() * problem_size.n()
    else:
-        raise ValueError("unknonw operand: " + operand)
+        raise ValueError("Unknown operand: " + operand)
+
+    elt_size = DataTypeSizeBytes[datatypes.to_cutlass(tensor.dtype)]
+    ptr += batch_offset * batch_stride * elt_size

    if layout == cutlass.RowMajor:
        layout = cutlass.RowMajor.packed(tensor_coord)
@ -96,8 +103,8 @@ def getTensorRef(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, opera
    return getattr(cutlass, ref_name)(ptr, layout)


-def getTensorView(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: str):
-    tensor_ref = getTensorRef(tensor, problem_size, operand, layout)
+def getTensorView(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, operand: str, layout: str, batch_offset: int = 0):
+    tensor_ref = getTensorRef(tensor, problem_size, operand, layout, batch_offset)

    if operand == "a":
        tensor_coord = problem_size.mk()
@ -106,7 +113,7 @@ def getTensorView(tensor: np.ndarray, problem_size: cutlass.gemm.GemmCoord, oper
    elif operand in ["c", "d"]:
        tensor_coord = problem_size.mn()
    else:
-        raise ValueError("unknonw operand: " + operand)
+        raise ValueError("Unknown operand: " + operand)

    if layout == cutlass.RowMajor:
        layout_tag = "RowMajor"
@ -168,7 +175,12 @@ class GemmUniversalLauncher:
        # Compile the operator
        #

-        pycutlass.compiler.add_module([operation, self.reduction_operation])
+        op_list = [operation]
+        if operation.arch < 90:
+            # Split K via Python is currently only supported for pre-SM90 kernels
+            op_list.append(self.reduction_operation)
+
+        pycutlass.compiler.add_module(op_list)

        self.operation = operation

@ -206,8 +218,10 @@ class GemmUniversalLauncher:
    def print_problem_size(self, p, mode, batch_count):
        if mode == cutlass.gemm.Mode.Gemm:
            mode = "Gemm"
+        elif mode == cutlass.gemm.Mode.Batched:
+            mode = "GemmBatched"
        elif mode == cutlass.gemm.Mode.GemmSplitKParallel:
-            mode = "GemmSplitKParalel"
+            mode = "GemmSplitKParallel"
        problem_size = "problem: %d, %d, %d\n batch_count: %d\n mode: %s" % (
            p.m(), p.n(), p.k(), batch_count, mode)
        print(problem_size)
@ -251,8 +265,7 @@ class GemmUniversalLauncher:
            tensor_ref_B, reordered_tensor_ref_B, problem_size)
        return reordered_tensor_B

-    def host_reference(self, problem_size, tensor_A, tensor_B, tensor_C, alpha, beta):
-        # TODO
+    def host_reference(self, problem_size, batch_count, tensor_A, tensor_B, tensor_C, alpha, beta):
        tensor_D_ref = np.ones_like(tensor_C)
        alpha = self.numpy_type(self.compute_type)(alpha)
        beta = self.numpy_type(self.compute_type)(beta)
@ -262,42 +275,46 @@ class GemmUniversalLauncher:
        beta = self.compute_type(beta).value()
        init_acc = self.accumulator_type(init_acc).value()

-        if self.operation.switched:
-            tensor_ref_A = getTensorRef(
-                tensor_A, problem_size, "a", transpose(self.operation.B.layout))
-            tensor_ref_B = getTensorRef(
-                tensor_B, problem_size, "b", transpose(self.operation.A.layout))
-            tensor_ref_C = getTensorRef(
-                tensor_C, problem_size, "c", transpose(self.operation.C.layout))
-            tensor_ref_D_ref = getTensorRef(
-                tensor_D_ref, problem_size, "d", transpose(self.operation.C.layout))
-        else:
-            tensor_ref_A = getTensorRef(
-                tensor_A, problem_size, "a", self.operation.A.layout)
-            tensor_ref_B = getTensorRef(
-                tensor_B, problem_size, "b", self.operation.B.layout)
-            tensor_ref_C = getTensorRef(
-                tensor_C, problem_size, "c", self.operation.C.layout)
-            tensor_ref_D_ref = getTensorRef(
-                tensor_D_ref, problem_size, "d", self.operation.C.layout)
+        for i in range(batch_count):
+            if self.operation.switched:
+                tensor_ref_A = getTensorRef(
+                    tensor_A, problem_size, "a", transpose(self.operation.B.layout), batch_offset=i)
+                tensor_ref_B = getTensorRef(
+                    tensor_B, problem_size, "b", transpose(self.operation.A.layout), batch_offset=i)
+                tensor_ref_C = getTensorRef(
+                    tensor_C, problem_size, "c", transpose(self.operation.C.layout), batch_offset=i)
+                tensor_ref_D_ref = getTensorRef(
+                    tensor_D_ref, problem_size, "d", transpose(self.operation.C.layout), batch_offset=i)
+            else:
+                tensor_ref_A = getTensorRef(
+                    tensor_A, problem_size, "a", self.operation.A.layout, batch_offset=i)
+                tensor_ref_B = getTensorRef(
+                    tensor_B, problem_size, "b", self.operation.B.layout, batch_offset=i)
+                tensor_ref_C = getTensorRef(
+                    tensor_C, problem_size, "c", self.operation.C.layout, batch_offset=i)
+                tensor_ref_D_ref = getTensorRef(
+                    tensor_D_ref, problem_size, "d", self.operation.C.layout, batch_offset=i)

-        if self.math_operation in [MathOperation.multiply_add_saturate]:
-            cutlass.test.gemm.host.gemm_saturate(
-                problem_size, alpha, tensor_ref_A, tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
-        else:
-            cutlass.test.gemm.host.gemm(problem_size, alpha, tensor_ref_A,
-                                        tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
+            if self.math_operation in [MathOperation.multiply_add_saturate]:
+                cutlass.test.gemm.host.gemm_saturate(
+                    problem_size, alpha, tensor_ref_A, tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)
+            else:
+                cutlass.test.gemm.host.gemm(problem_size, alpha, tensor_ref_A,
+                                            tensor_ref_B, beta, tensor_ref_C, tensor_ref_D_ref, init_acc)

        return tensor_D_ref

-    def equal(self, tensor_D, tensor_D_ref, problem_size):
+    def equal(self, tensor_D, tensor_D_ref, problem_size, batch_count):
+        for i in range(batch_count):
+            tensor_view_D = getTensorView(
+                tensor_D, problem_size, "d", self.operation.C.layout, batch_offset=i)
+            tensor_view_D_ref = getTensorView(
+                tensor_D_ref, problem_size, "d", self.operation.C.layout, batch_offset=i)

-        tensor_view_D = getTensorView(
-            tensor_D, problem_size, "d", self.operation.C.layout)
-        tensor_view_D_ref = getTensorView(
-            tensor_D_ref, problem_size, "d", self.operation.C.layout)
+            if not cutlass.test.gemm.host.equals(tensor_view_D, tensor_view_D_ref):
+                return False

-        return cutlass.test.gemm.host.equals(tensor_view_D, tensor_view_D_ref)
+        return True

    def bytes(self, problem_size, batch_count=1, alpha=1.0, beta=0.0):
        m = problem_size.m()
@ -321,9 +338,8 @@ class GemmUniversalLauncher:
        n = problem_size.n()
        k = problem_size.k()

-        flops_ = (m * n * k + m * n) * 2 * batch_count
+        flops_ = (m * n * k) * 2 * batch_count

-        # TODO: complex
        return flops_

    def run_cutlass_profiler(self, mode, problem_size, batch_count=1, alpha=1.0, beta=0.0):
@ -368,21 +384,25 @@ class GemmUniversalLauncher:

        return runtime

-    def run(self, mode, problem_size, batch_count=1, alpha=1.0, beta=0.0):
-
+    def run(self, mode, problem_size, batch_count=1, split_k_slices=1, alpha=1.0, beta=0.0):
        assert get_allocated_size(
        ) == 0, "%d byte of pool memory is not released in previous run" % get_allocated_size()

        np.random.seed(self.seed)

+        # Assign an actual batch count in cases where we are not running in batched mode.
+        # This is to differentiate between the number of split K slices and the batch count,
+        # which are overloaded within the single `batch_count` variable.
+        true_batch_count = batch_count if mode == cutlass.gemm.Mode.Batched else 1
+
        tensor_A = self.uniform_init(
-            size=(problem_size.m() * problem_size.k(),), dtype=self.dtype_A)
+            size=(problem_size.m() * problem_size.k() * true_batch_count,), dtype=self.dtype_A)
        tensor_B = self.uniform_init(
-            size=(problem_size.n() * problem_size.k(),), dtype=self.dtype_B)
+            size=(problem_size.n() * problem_size.k() * true_batch_count,), dtype=self.dtype_B)
        tensor_C = self.uniform_init(
-            size=(problem_size.m() * problem_size.n(),), dtype=self.dtype_C)
+            size=(problem_size.m() * problem_size.n() * true_batch_count,), dtype=self.dtype_C)
        tensor_D = np.zeros(
-            shape=(problem_size.m() * problem_size.n(),), dtype=self.dtype_D)
+            shape=(problem_size.m() * problem_size.n() * true_batch_count,), dtype=self.dtype_D)

        #
        # Launch kernel
@ -392,14 +412,14 @@ class GemmUniversalLauncher:
            operation=self.operation, problem_size=problem_size,
            A=tensor_A, B=tensor_B, C=tensor_C, D=tensor_D,
            output_op=self.operation.epilogue_type(alpha, beta),
-            gemm_mode=mode, split_k_slices=batch_count
+            gemm_mode=mode, split_k_slices=split_k_slices, batch=batch_count
        )

        if mode == cutlass.gemm.Mode.GemmSplitKParallel:
            reduction_arguments = ReductionArguments(
                self.reduction_operation, problem_size=[
                    problem_size.m(), problem_size.n()],
-                partitions=batch_count,
+                partitions=split_k_slices,
                workspace=arguments.ptr_D,
                destination=tensor_D,
                source=tensor_C,
@ -419,8 +439,8 @@ class GemmUniversalLauncher:
            else:
                arguments.sync()
            tensor_D_ref = self.host_reference(
-                problem_size, tensor_A, tensor_B, tensor_C, alpha, beta)
-            passed = self.equal(tensor_D, tensor_D_ref, problem_size)
+                problem_size, true_batch_count, tensor_A, tensor_B, tensor_C, alpha, beta)
+            passed = self.equal(tensor_D, tensor_D_ref, problem_size, true_batch_count)

            try:
                assert passed
@ -494,7 +514,7 @@ def test_all_gemm(operation: 'GemmOperationUniversal', testcase="universal"):
        if operation.A.layout in [cutlass.ColumnMajorInterleaved32, cutlass.RowMajorInterleaved32]:
            interleavedk = 32
        else:
-            raise ValueError("unknonw layout")
+            raise ValueError("Unknown layout")

    if testcase == "interleaved":
        modes = [cutlass.gemm.Mode.Gemm, ]
@ -515,14 +535,22 @@ def test_all_gemm(operation: 'GemmOperationUniversal', testcase="universal"):
        problem_beta = [0.0]
        batch_counts = [1, ]
    else:  # universal
-        modes = [cutlass.gemm.Mode.Gemm, cutlass.gemm.Mode.GemmSplitKParallel]
+        modes = [cutlass.gemm.Mode.Gemm]
+        batch_counts = [1, 2, 3, 5, 7]
+        if operation.arch < 90:
+            # Split K kernels via Python are currently only supported pre-SM90
+            modes.append(cutlass.gemm.Mode.GemmSplitKParallel)
+
        problem_size_m = [alignment_m, 512 - 3 * alignment_m]
        problem_size_n = [alignment_n, 512 - 2 * alignment_n]
+        if operation.tile_description.stages is None:
+            stages_for_k_calc = 7
+        else:
+            stages_for_k_calc = operation.tile_description.stages
        problem_size_k = [
            alignment_k,
-            threadblock_k * operation.tile_description.stages - alignment_k,
-            threadblock_k * operation.tile_description.stages * 3 - alignment_k]
-        batch_counts = [1, 2, 3, 5, 7]
+            threadblock_k * stages_for_k_calc - alignment_k,
+            threadblock_k * stages_for_k_calc * 3 - alignment_k]
        problem_alpha = [1.0]
        problem_beta = [2.0]

@ -543,8 +571,17 @@ def test_all_gemm(operation: 'GemmOperationUniversal', testcase="universal"):

                                problem_size = cutlass.gemm.GemmCoord(m, n, k)

+                                if operation.arch < 90:
+                                    split_k_slices = batch_count
+                                else:
+                                    split_k_slices = 1
+
+                                overridden_mode = mode
+                                if mode == cutlass.gemm.Mode.Gemm and batch_count > 1:
+                                    overridden_mode = cutlass.gemm.Mode.Batched
+
                                passed = testbed.run(
-                                    mode, problem_size, batch_count, alpha, beta)
+                                    overridden_mode, problem_size, batch_count, split_k_slices, alpha, beta)

                                err, = cudart.cudaDeviceSynchronize()
                                if err != cuda.CUresult.CUDA_SUCCESS:
--- a/tools/library/scripts/pycutlass/src/pycutlass/test/utils.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/test/utils.py
@ -0,0 +1,109 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+import cutlass
+from pycutlass import library, SubstituteTemplate
+
+
+class Layout:
+    """
+    Utility class to map transpose and non-transpose terminology to row- and column-major terminology
+    """
+    T = cutlass.RowMajor
+    N = cutlass.ColumnMajor
+
+
+class LayoutCombination:
+    """
+    Utility class defining all combinations of row- and column-major layouts for operands to a GEMMs
+    """
+    NNN = (Layout.N, Layout.N, Layout.N)
+    NNT = (Layout.N, Layout.N, Layout.T)
+    NTN = (Layout.N, Layout.T, Layout.N)
+    NTT = (Layout.N, Layout.T, Layout.T)
+    TNN = (Layout.T, Layout.N, Layout.N)
+    TNT = (Layout.T, Layout.N, Layout.T)
+    TTN = (Layout.T, Layout.T, Layout.N)
+    TTT = (Layout.T, Layout.T, Layout.T)
+
+
+def get_name(layouts, alignments, element_output,
+             element_accumulator, element_epilogue, cluster_shape,
+             threadblock_shape, stages, element_a, element_b, arch, opclass, suffix=""):
+    """
+    Generates a procedural name for a test case.
+
+    :param layouts: indexable container of layouts of A, B, and C operands
+    :param alignments: indexable container of alingments of A, B, and C operands
+    :param element_output: data type of the output element
+    :param element_accumulator: data type used in accumulation
+    :param element_epilogue: data type used in computing the epilogue
+    :param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
+    :param threadblock_shape: indexable container of dimensions of threadblock tiles
+    :param stages: number of pipeline stages to use in the kernel
+    :type stages: int
+    :param element_a: data type of operand A
+    :param element_b: data type of operand B
+    :param arch: compute capability of kernel being generated
+    :type arch: int
+    :param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
+    :type opclass: cutlass.OpClass
+    :param suffix: additional string to add to the suffix of the name
+    :type suffix: str
+
+    :return: str
+    """
+    name_format = 'test_SM${arch}_Device_Gemm_${eA}${lA}_${eB}${lB}_${eC}${lC}_${opclass}_${acc}_${tbM}x${tbN}x${tbK}_${cM}x${cN}x${cK}_${stages}_align${aA}-${aB}-${aC}${suffix}'
+    return SubstituteTemplate(name_format,
+        {
+            'arch': str(arch),
+            'eA': library.DataTypeNames[element_a],
+            'eB': library.DataTypeNames[element_b],
+            'eC': library.DataTypeNames[element_output],
+            'lA': library.ShortLayoutTypeNames[layouts[0]],
+            'lB': library.ShortLayoutTypeNames[layouts[1]],
+            'lC': library.ShortLayoutTypeNames[layouts[2]],
+            'opclass': library.OpcodeClassNames[opclass],
+            'acc': library.DataTypeNames[element_accumulator],
+            'cM': str(cluster_shape[0]),
+            'cN': str(cluster_shape[1]),
+            'cK': str(cluster_shape[2]),
+            'tbM': str(threadblock_shape[0]),
+            'tbN': str(threadblock_shape[1]),
+            'tbK': str(threadblock_shape[2]),
+            'stages': str(stages) if stages is not None else 'auto',
+            'aA' : str(alignments[0]),
+            'aB' : str(alignments[1]),
+            'aC' : str(alignments[2]),
+            'suffix': '' if suffix is None else suffix
+        }
+    )
--- a/tools/library/scripts/pycutlass/src/pycutlass/utils/datatypes.py
+++ b/tools/library/scripts/pycutlass/src/pycutlass/utils/datatypes.py
@ -0,0 +1,121 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+"""
+Utility functions for converting between frontend datatypes and CUTLASS datatypes
+"""
+
+from typing import Union, Tuple
+
+import cutlass
+
+import pycutlass.library as library
+
+
+try:
+    import numpy as np
+    numpy_available = True
+except ImportError:
+    numpy_available = False
+
+def numpy_to_cutlass(inp):
+    if numpy_available:
+        if inp == np.float16:
+            return cutlass.float16
+        elif inp == np.float32:
+            return cutlass.float32
+        elif inp == np.float64:
+            return cutlass.float64
+        elif inp == np.int8:
+            return cutlass.int8
+        elif inp == np.int32:
+            return cutlass.int32
+    return None
+
+try:
+    import cupy as cp
+    cupy_available = True
+    cupy_to_cutlass_dict = {
+        cp.float16: cutlass.float16,
+        cp.float32: cutlass.float32,
+        cp.float64: cutlass.float64
+    }
+except ImportError:
+    cupy_available = False
+
+def cupy_to_cutlass(inp):
+    if cupy_available:
+        if inp == cp.float16:
+            return cutlass.float16
+        elif inp == cp.float32:
+            return cutlass.float32
+        elif inp == cp.float64:
+            return cutlass.float64
+    return None
+
+try:
+    import torch
+    torch_available = True
+    torch_to_cutlass_dict = {
+        torch.half:    cutlass.float16,
+        torch.float16: cutlass.float16,
+        torch.float:   cutlass.float32,
+        torch.float32: cutlass.float32,
+        torch.double:  cutlass.float64,
+        torch.float64: cutlass.float64
+    }
+except ImportError:
+    torch_available = False
+
+def torch_to_cutlass(inp):
+    if torch_available:
+        return torch_to_cutlass_dict.get(inp, None)
+
+try:
+    import bfloat16
+    bfloat16_available = True
+except ImportError:
+    bfloat16_available = False
+
+def bfloat16_to_cutlass(inp):
+    if bfloat16_available:
+        if inp == bfloat16.bfloat16:
+            return cutlass.bfloat16
+
+
+def to_cutlass(inp):
+    for cvt_fn in [bfloat16_to_cutlass, cupy_to_cutlass, numpy_to_cutlass, torch_to_cutlass]:
+        out = cvt_fn(inp)
+        if out is not None:
+            return out
+
+    raise Exception('No available conversion from type {} to a CUTLASS type.'.format(inp))
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_dgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_dgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_dgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
 from pycutlass.conv2d_operation import *
 from pycutlass import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu
 import pycutlass
 from pycutlass import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_dgrad_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_dgrad_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_dgrad_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu
 import pycutlass
 from pycutlass.conv2d_operation import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_dgrad_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_dgrad_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu
 import pycutlass
 from pycutlass import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_few_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_few_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_fprop_few_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu
 import pycutlass
 from pycutlass.test import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_fixed_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_fixed_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_fprop_fixed_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu
 import pycutlass
 from pycutlass.test import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
 import pycutlass
 from pycutlass import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu
 import pycutlass
 from pycutlass import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu
 import pycutlass
 from pycutlass.conv2d_operation import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu
 import pycutlass
 from pycutlass import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu
 import pycutlass
 from pycutlass import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_wgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_wgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_wgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
 import pycutlass
 from pycutlass import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_wgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_wgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_wgrad_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu
 import pycutlass
 from pycutlass import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_wgrad_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_wgrad_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_wgrad_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu
 import pycutlass
 from pycutlass.conv2d_operation import *
--- a/tools/library/scripts/pycutlass/test/conv/conv2d_wgrad_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/conv/conv2d_wgrad_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 # test/unit/conv/device/conv2d_wgrad_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu
 import pycutlass
 from pycutlass import *
--- a/tools/library/scripts/pycutlass/test/conv/run_all_tests.py
+++ b/tools/library/scripts/pycutlass/test/conv/run_all_tests.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 import pycutlass
 import unittest
 from pycutlass.memory_manager import *
--- a/tools/library/scripts/pycutlass/test/example/run_all_example.sh
+++ b/tools/library/scripts/pycutlass/test/example/run_all_example.sh
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 pushd $CUTLASS_PATH/examples/40_cutlass_py/customizable

 python gemm.py -i 8 8 4 -ta float64 -tb float64 -tc float64 -tacc float64 -m multiply_add -op TensorOp -b 32 32 16 -s 4 -w 2 2 1 -cc 80 -la ColumnMajor -aa 1 -lb RowMajor -ab 1 -lc RowMajor -ac 1 -te float64 -ep LinearCombination -sw IdentitySwizzle1 -p 512 256 128 -alpha 1.0 -beta 0.5 -gm Gemm -k 1
--- a/tools/library/scripts/pycutlass/test/frontend/run_test.sh
+++ b/tools/library/scripts/pycutlass/test/frontend/run_test.sh
@ -1 +1,33 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 CUPY_CACHE_DIR=./ python test_frontend.py
--- a/tools/library/scripts/pycutlass/test/frontend/test_frontend.py
+++ b/tools/library/scripts/pycutlass/test/frontend/test_frontend.py
@ -29,13 +29,15 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 #################################################################################################
-## Test case for Pytorch
+
+"""
+Test cases for frontends
+"""
+
 import pycutlass
 import unittest
 from pycutlass import *
 from pycutlass.utils.device import device_cc
-import torch
-import cupy as cp


 class Test_Frontend(unittest.TestCase):
@ -49,9 +51,7 @@ class Test_Frontend(unittest.TestCase):
            cutlass.OpClass.Simt, MathOperation.multiply_add
        )

-        # Stages > 2 is supported only for compute capability 80 and beyond
-        stages = 4 if cc >= 80 else 2
-
+        stages = 2
        tile_description = TileDescription(
            [128, 128, 8], stages, [2, 4, 1],
            math_inst
@ -84,6 +84,11 @@ class Test_Frontend(unittest.TestCase):


    def test_torch_frontend(self):
+        try:
+            import torch
+        except:
+            self.assertTrue(False, "Unable to import torch")
+
        problem_size = cutlass.gemm.GemmCoord(512, 256, 128)

        tensor_A = torch.ceil(torch.empty(size=(problem_size.m(), problem_size.k()), dtype=torch.float32, device="cuda").uniform_(-8.5, 7.5))
@ -111,6 +116,11 @@ class Test_Frontend(unittest.TestCase):
        self.assertTrue(torch.equal(tensor_D, tensor_D_ref))
    
    def test_cupy_frontend(self):
+        try:
+            import cupy as cp
+        except:
+            self.assertTrue(False, "Unable to import cupy")
+
        cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

        problem_size = cutlass.gemm.GemmCoord(512, 256, 128)
@ -139,7 +149,6 @@ class Test_Frontend(unittest.TestCase):
        self.assertTrue(cp.array_equal(tensor_D, tensor_D_ref))


-
 if __name__ == '__main__':
    pycutlass.get_memory_pool(2**32, 2**32)
    unittest.main()
--- a/tools/library/scripts/pycutlass/test/gemm/gemm_bf16_sm80.py
+++ b/tools/library/scripts/pycutlass/test/gemm/gemm_bf16_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 import pycutlass
 from pycutlass import *
 from pycutlass.test import *
@ -92,5 +124,5 @@ class GemmBF16TensorOpSm80(unittest.TestCase):
        self.assertTrue(test_all_gemm(operation, "multistage"))

 if __name__ == '__main__':
-    pycutlass.get_memory_pool(2**24, 2**24)
+    pycutlass.get_memory_pool(2**30, 2**30)
    unittest.main()
--- a/tools/library/scripts/pycutlass/test/gemm/gemm_bf16_sm90.py
+++ b/tools/library/scripts/pycutlass/test/gemm/gemm_bf16_sm90.py
@ -0,0 +1,138 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+from functools import partial
+import pycutlass
+from pycutlass import *
+from pycutlass import library
+from pycutlass.test import *
+import unittest
+
+from pycutlass.test.utils import LayoutCombination, get_name
+from pycutlass.test.gemm_testbed import test_all_gemm
+from pycutlass.utils.device import device_cc
+
+
+name_fn = partial(get_name, element_a=cutlass.bfloat16, element_b=cutlass.bfloat16, arch=90)
+
+def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
+             cluster_shape, threadblock_shape, stages, opclass, persistent=False):
+    """
+    Create a test-running function with the given specification and set it as a method of `cls`.
+
+    :param cls: class to which the generated method will be added
+    :type cls: type
+    :param layouts: indexable container of layouts of A, B, and C operands
+    :param alignments: indexable container of alingments of A, B, and C operands
+    :param element_output: data type of the output element
+    :param element_accumulator: data type used in accumulation
+    :param element_epilogue: data type used in computing the epilogue
+    :param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
+    :param threadblock_shape: indexable container of dimensions of threadblock tiles
+    :param stages: number of pipeline stages to use in the kernel
+    :type stages: int
+    :param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
+    :type opclass: cutlass.OpClass
+    :param persistent: whether this is a persistent warp-specialized kernel
+    :type persistent: bool
+    """
+
+    def run(self):
+        """
+        Dynamically-generated function that constructs a GEMM operation and verifies it against
+        multiple test cases.
+        """
+        element_A = cutlass.bfloat16
+        element_B = cutlass.bfloat16
+        inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
+        warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
+        math_inst = MathInstruction(
+            instruction_shape=inst_shape,
+            element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
+            opcode_class=opclass, math_operation=MathOperation.multiply_add
+        )
+
+        tile_description = TileDescription(
+            threadblock_shape=threadblock_shape,
+            cluster_shape=cluster_shape,
+            stages=stages, warp_count=warp_count,
+            math_instruction=math_inst,
+            persistent=persistent
+        )
+
+        A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
+        B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
+        C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
+
+        epilogue_functor = LinearCombination(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
+
+        swizzling_functor = cutlass.IdentitySwizzle1
+
+        operation = GemmOperationUniversal(
+            arch=90, tile_description=tile_description, A=A, B=B, C=C,
+            epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
+
+        self.assertTrue(test_all_gemm(operation, "universal"))
+
+    if persistent:
+        suffix = "_persistent"
+    else:
+        suffix = ""
+
+    name = name_fn(layouts, alignments, element_output, element_accumulator,
+                  element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass, suffix=suffix)
+    setattr(cls, name, run)
+
+    return run
+
+
+@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
+class GemmBF16Sm90(unittest.TestCase):
+    """
+    Wrapper class to which tests will be added dynamically in __main__
+    """
+    pass
+
+
+add_test_tensorop = partial(add_test, opclass=cutlass.OpClass.TensorOp)
+add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
+
+add_test_tensorop(GemmBF16Sm90, LayoutCombination.NNN, [8, 8, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], 3)
+add_test_tensorop(GemmBF16Sm90, LayoutCombination.NNN, [4, 4, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], 5)
+add_test_tensorop(GemmBF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [2, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmBF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.bfloat16, cutlass.float32, cutlass.float32, [2, 1, 1], [128, 128, 32], None, persistent=True)
+add_test_simt(GemmBF16Sm90, LayoutCombination.NNN, [1, 1, 1], cutlass.bfloat16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 8], 2)
+
+
+if __name__ == '__main__':
+    pycutlass.get_memory_pool(2**30, 2**30)
+    unittest.main()
--- a/tools/library/scripts/pycutlass/test/gemm/gemm_f16_sm80.py
+++ b/tools/library/scripts/pycutlass/test/gemm/gemm_f16_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 import pycutlass
 from pycutlass import *
 from pycutlass.test import *
@ -443,5 +475,5 @@ class GemmF16Sm80(unittest.TestCase):
    

 if __name__ == '__main__':
-    pycutlass.get_memory_pool(2**24, 2**24)
+    pycutlass.get_memory_pool(2**30, 2**30)
    unittest.main()
--- a/tools/library/scripts/pycutlass/test/gemm/gemm_f16_sm90.py
+++ b/tools/library/scripts/pycutlass/test/gemm/gemm_f16_sm90.py
@ -0,0 +1,182 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+from functools import partial
+import pycutlass
+from pycutlass import *
+from pycutlass import library
+from pycutlass.test import *
+import unittest
+
+from pycutlass.test.utils import LayoutCombination, get_name
+from pycutlass.test.gemm_testbed import test_all_gemm
+from pycutlass.utils.device import device_cc
+
+
+# Partial specialziation for naming tests
+name_fn = partial(get_name, element_a=cutlass.float16, element_b=cutlass.float16, arch=90)
+
+
+def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
+             cluster_shape, threadblock_shape, stages, opclass, persistent=False):
+    """
+    Create a test-running function with the given specification and set it as a method of `cls`.
+
+    :param cls: class to which the generated method will be added
+    :type cls: type
+    :param layouts: indexable container of layouts of A, B, and C operands
+    :param alignments: indexable container of alingments of A, B, and C operands
+    :param element_output: data type of the output element
+    :param element_accumulator: data type used in accumulation
+    :param element_epilogue: data type used in computing the epilogue
+    :param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
+    :param threadblock_shape: indexable container of dimensions of threadblock tiles
+    :param stages: number of pipeline stages to use in the kernel
+    :type stages: int
+    :param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
+    :type opclass: cutlass.OpClass
+    :param persistent: whether this is a persistent warp-specialized kernel
+    :type persistent: bool
+    """
+
+    def run(self):
+        """
+        Dynamically-generated function that constructs a GEMM operation and verifies it against
+        multiple test cases.
+        """
+
+        element_A = cutlass.float16
+        element_B = cutlass.float16
+        inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
+        warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
+        math_inst = MathInstruction(
+            instruction_shape=inst_shape,
+            element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
+            opcode_class=opclass, math_operation=MathOperation.multiply_add
+        )
+
+        tile_description = TileDescription(
+            threadblock_shape=threadblock_shape,
+            cluster_shape=cluster_shape,
+            stages=stages, warp_count=warp_count,
+            math_instruction=math_inst,
+            persistent=persistent
+        )
+
+        A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
+        B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
+        C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
+
+        epilogue_functor = LinearCombination(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
+
+        swizzling_functor = cutlass.IdentitySwizzle1
+
+        operation = GemmOperationUniversal(
+            arch=90, tile_description=tile_description, A=A, B=B, C=C,
+            epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
+
+        self.assertTrue(test_all_gemm(operation, "universal"))
+
+    if persistent:
+        suffix = "_persistent"
+    else:
+        suffix = ""
+
+    name = name_fn(layouts, alignments, element_output, element_accumulator,
+                  element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass, suffix=suffix)
+    setattr(cls, name, run)
+
+    return run
+
+
+@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
+class GemmF16Sm90(unittest.TestCase):
+    """
+    Wrapper class to which tests will be added dynamically in __main__
+    """
+    pass
+
+
+add_test_tensorop = partial(add_test, opclass=cutlass.OpClass.TensorOp)
+add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
+
+# Tests with 1x1x1 clusters
+add_test_tensorop(GemmF16Sm90, LayoutCombination.NNN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], 3)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.NNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.NTN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.NTT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 64, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 64, 64], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [4, 4, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [4, 4, 8], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 32], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [8, 8, 8], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 64, 64], 5)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNT, [2, 2, 2], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 32], None)
+
+# Tests with different cluster shapes
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TNN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.NTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.NNN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 4, 1], [64, 128, 64], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 4, 1], [64, 128, 64], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 1, 1], [64, 128, 64], None)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 2, 1], [64, 128, 64], None)
+
+# Tests for persistent warp-specialized threadblocks
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 128, 64], None, persistent=True)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 1, 1], [64, 128, 64], None, persistent=True)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 64], None, persistent=True)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 1, 1], [128, 128, 64], None, persistent=True)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 2, 1], [64, 128, 64], None, persistent=True)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 2, 1], [64, 128, 64], None, persistent=True)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [1, 4, 1], [64, 128, 64], None, persistent=True)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [2, 4, 1], [64, 128, 64], None, persistent=True)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 1, 1], [64, 128, 64], None, persistent=True)
+add_test_tensorop(GemmF16Sm90, LayoutCombination.TTN, [8, 8, 8], cutlass.float32, cutlass.float32, cutlass.float32, [4, 4, 1], [64, 128, 64], None, persistent=True)
+
+# Tests using SIMT
+add_test_simt(GemmF16Sm90, LayoutCombination.NNN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 128, 8], 2)
+add_test_simt(GemmF16Sm90, LayoutCombination.TNN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 128, 8], 2)
+add_test_simt(GemmF16Sm90, LayoutCombination.NTN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [128, 64, 8], 2)
+add_test_simt(GemmF16Sm90, LayoutCombination.TTN, [1, 1, 1], cutlass.float16, cutlass.float32, cutlass.float32, [1, 1, 1], [64, 64, 8], 2)
+add_test_simt(GemmF16Sm90, LayoutCombination.NNT, [1, 1, 1], cutlass.float16, cutlass.float16, cutlass.float16, [1, 1, 1], [128, 128, 8], 2)
+
+
+if __name__ == '__main__':
+    pycutlass.get_memory_pool(2**30, 2**30)
+    unittest.main()
--- a/tools/library/scripts/pycutlass/test/gemm/gemm_f32_sm80.py
+++ b/tools/library/scripts/pycutlass/test/gemm/gemm_f32_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 import pycutlass
 from pycutlass import *
 from pycutlass.memory_manager import get_allocated_size
--- a/tools/library/scripts/pycutlass/test/gemm/gemm_f64_sm80.py
+++ b/tools/library/scripts/pycutlass/test/gemm/gemm_f64_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 import pycutlass
 from pycutlass import *
 from pycutlass.test import *
@ -98,5 +130,5 @@ class GemmF64TensorOpSm80(unittest.TestCase):
        self.assertTrue(test_all_gemm(operation, "universal"))

 if __name__ == '__main__':
-    pycutlass.get_memory_pool(2**24, 2**24)
+    pycutlass.get_memory_pool(2**30, 2**30)
    unittest.main()
--- a/tools/library/scripts/pycutlass/test/gemm/gemm_f64_sm90.py
+++ b/tools/library/scripts/pycutlass/test/gemm/gemm_f64_sm90.py
@ -0,0 +1,124 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+from functools import partial
+import pycutlass
+from pycutlass import *
+from pycutlass import library
+from pycutlass.test import *
+import unittest
+
+from pycutlass.test.utils import LayoutCombination, get_name
+from pycutlass.test.gemm_testbed import test_all_gemm
+from pycutlass.utils.device import device_cc
+
+
+name_fn = partial(get_name, element_a=cutlass.float64, element_b=cutlass.float64, arch=90)
+
+def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
+             cluster_shape, threadblock_shape, stages, opclass):
+    """
+    Create a test-running function with the given specification and set it as a method of `cls`.
+
+    :param cls: class to which the generated method will be added
+    :type cls: type
+    :param layouts: indexable container of layouts of A, B, and C operands
+    :param alignments: indexable container of alingments of A, B, and C operands
+    :param element_output: data type of the output element
+    :param element_accumulator: data type used in accumulation
+    :param element_epilogue: data type used in computing the epilogue
+    :param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
+    :param threadblock_shape: indexable container of dimensions of threadblock tiles
+    :param stages: number of pipeline stages to use in the kernel
+    :type stages: int
+    :param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
+    :type opclass: cutlass.OpClass
+    """
+
+    def run(self):
+        """
+        Dynamically-generated function that constructs a GEMM operation and verifies it against
+        multiple test cases.
+        """
+        element_A = cutlass.float64
+        element_B = cutlass.float64
+        inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
+        warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
+        math_inst = MathInstruction(
+            instruction_shape=inst_shape,
+            element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
+            opcode_class=opclass, math_operation=MathOperation.multiply_add
+        )
+
+        tile_description = TileDescription(
+            threadblock_shape=threadblock_shape,
+            cluster_shape=cluster_shape,
+            stages=stages, warp_count=warp_count,
+            math_instruction=math_inst
+        )
+
+        A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
+        B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
+        C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
+
+        epilogue_functor = LinearCombination(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
+
+        swizzling_functor = cutlass.IdentitySwizzle1
+
+        operation = GemmOperationUniversal(
+            arch=90, tile_description=tile_description, A=A, B=B, C=C,
+            epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
+
+        self.assertTrue(test_all_gemm(operation, "universal"))
+
+    name = name_fn(layouts, alignments, element_output, element_accumulator,
+                  element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass)
+    setattr(cls, name, run)
+
+    return run
+
+
+@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
+class GemmF64Sm90(unittest.TestCase):
+    """
+    Wrapper class to which tests will be added dynamically in __main__
+    """
+    pass
+
+
+add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
+add_test_simt(GemmF64Sm90, LayoutCombination.NNN, [1, 1, 1], cutlass.float64, cutlass.float64, cutlass.float64, [1, 1, 1], [64, 64, 32], 2)
+
+
+if __name__ == '__main__':
+    pycutlass.get_memory_pool(2**30, 2**30)
+    unittest.main()
--- a/tools/library/scripts/pycutlass/test/gemm/gemm_grouped_sm80.py
+++ b/tools/library/scripts/pycutlass/test/gemm/gemm_grouped_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 import pycutlass
 from pycutlass import *
 from pycutlass.test import *
@ -199,5 +231,5 @@ class GemmGroupedSm80(unittest.TestCase):


 if __name__ == '__main__':
-    pycutlass.get_memory_pool(2**26, 2**26)
+    pycutlass.get_memory_pool(2**30, 2**30)
    unittest.main()
--- a/tools/library/scripts/pycutlass/test/gemm/gemm_s8_sm80.py
+++ b/tools/library/scripts/pycutlass/test/gemm/gemm_s8_sm80.py
@ -1,3 +1,35 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 import pycutlass
 from pycutlass import *
 from pycutlass.epilogue import LinearCombinationClamp
@ -225,5 +257,5 @@ class GemmS8TensorOpF32Sm80(unittest.TestCase):


 if __name__ == '__main__':
-    pycutlass.get_memory_pool(2**24, 2**24)
+    pycutlass.get_memory_pool(2**30, 2**30)
    unittest.main()
--- a/tools/library/scripts/pycutlass/test/gemm/gemm_s8_sm90.py
+++ b/tools/library/scripts/pycutlass/test/gemm/gemm_s8_sm90.py
@ -0,0 +1,154 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+from functools import partial
+import pycutlass
+from pycutlass import *
+from pycutlass import library
+from pycutlass.test import *
+import unittest
+
+from pycutlass.test.utils import LayoutCombination, get_name
+from pycutlass.test.gemm_testbed import test_all_gemm
+from pycutlass.utils.device import device_cc
+
+
+name_fn = partial(get_name, element_a=cutlass.float16, element_b=cutlass.float16, arch=90)
+
+def add_test(cls, layouts, alignments, element_output, element_accumulator, element_epilogue,
+             cluster_shape, threadblock_shape, stages, opclass, persistent=False):
+    """
+    Create a test-running function with the given specification and set it as a method of `cls`.
+
+    :param cls: class to which the generated method will be added
+    :type cls: type
+    :param layouts: indexable container of layouts of A, B, and C operands
+    :param alignments: indexable container of alingments of A, B, and C operands
+    :param element_output: data type of the output element
+    :param element_accumulator: data type used in accumulation
+    :param element_epilogue: data type used in computing the epilogue
+    :param cluster_shape: indexable container of dimensions of threadblock cluster to be launched
+    :param threadblock_shape: indexable container of dimensions of threadblock tiles
+    :param stages: number of pipeline stages to use in the kernel
+    :type stages: int
+    :param opclass: class of operation being performed (e.g., SIMT, Tensor Core)
+    :type opclass: cutlass.OpClass
+    :param persistent: whether this is a persistent warp-specialized kernel
+    :type persistent: bool
+    """
+
+    def run(self):
+        """
+        Dynamically-generated function that constructs a GEMM operation and verifies it against
+        multiple test cases.
+        """
+        element_A = cutlass.int8
+        element_B = cutlass.int8
+        inst_shape = [1, 1, 1] if opclass == cutlass.OpClass.Simt else None
+        warp_count = [2, 2, 1] if opclass == cutlass.OpClass.Simt else None
+        math_inst = MathInstruction(
+            instruction_shape=inst_shape,
+            element_a=element_A, element_b=element_B, element_accumulator=element_accumulator,
+            opcode_class=opclass, math_operation=MathOperation.multiply_add
+        )
+
+        tile_description = TileDescription(
+            threadblock_shape=threadblock_shape,
+            cluster_shape=cluster_shape,
+            stages=stages, warp_count=warp_count,
+            math_instruction=math_inst,
+            persistent=persistent
+        )
+
+        A = TensorDescription(element=element_A, layout=layouts[0], alignment=alignments[0])
+        B = TensorDescription(element=element_B, layout=layouts[1], alignment=alignments[1])
+        C = TensorDescription(element=element_output, layout=layouts[2], alignment=alignments[2])
+
+        if opclass == cutlass.OpClass.Simt:
+            epilogue_functor_cls = LinearCombinationClamp
+        else:
+            epilogue_functor_cls = LinearCombination
+        epilogue_functor = epilogue_functor_cls(C.element, C.alignment, math_inst.element_accumulator, element_epilogue)
+
+        swizzling_functor = cutlass.IdentitySwizzle1
+
+        operation = GemmOperationUniversal(
+            arch=90, tile_description=tile_description, A=A, B=B, C=C,
+            epilogue_functor=epilogue_functor, swizzling_functor=swizzling_functor)
+
+        self.assertTrue(test_all_gemm(operation, "universal"))
+
+    if persistent:
+        suffix = "_persistent"
+    else:
+        suffix = ""
+
+    name = name_fn(layouts, alignments, element_output, element_accumulator,
+                  element_epilogue, cluster_shape, threadblock_shape, stages, opclass=opclass, suffix=suffix)
+    setattr(cls, name, run)
+
+    return run
+
+
+@unittest.skipIf(device_cc() < 90, "Device compute capability is insufficient for SM90 tests.")
+class GemmS8Sm90(unittest.TestCase):
+    """
+    Wrapper class to which tests will be added dynamically in __main__
+    """
+    pass
+
+
+add_test_tensorop = partial(add_test, opclass=cutlass.OpClass.TensorOp)
+add_test_simt = partial(add_test, opclass=cutlass.OpClass.Simt)
+
+# Tests with 1x1x1 clusters
+add_test_tensorop(GemmS8Sm90, LayoutCombination.TNN, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], 3)
+add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], None)
+add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 8],  cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], None)
+add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [64, 128, 128], None)
+add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 64, 32], None)
+add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [4, 4, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [128, 128, 128], None)
+
+# Tests with different cluster shapes
+add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [2, 2, 1], [128, 128, 128], None)
+add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [1, 4, 1], [128, 128, 128], None)
+add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [4, 4, 1], [128, 128, 128], None)
+
+# Tests with persistent warp-specialized threadblocks
+add_test_tensorop(GemmS8Sm90, LayoutCombination.TNT, [16, 16, 16], cutlass.int8, cutlass.int32, cutlass.int32, [2, 1, 1], [128, 128, 128], None, persistent=True)
+
+# Tests for SIMT
+add_test_simt(GemmS8Sm90, LayoutCombination.TNN, [1, 1, 1], cutlass.int8, cutlass.int32, cutlass.int32, [1, 1, 1], [64, 32, 8], 2)
+
+if __name__ == '__main__':
+    pycutlass.get_memory_pool(2**30, 2**30)
+    unittest.main()
--- a/tools/library/scripts/pycutlass/test/gemm/run_all_tests.py
+++ b/tools/library/scripts/pycutlass/test/gemm/run_all_tests.py
@ -1,8 +1,40 @@
+#################################################################################################
+#
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
 import pycutlass
 import unittest

 if __name__ == '__main__':
-    pycutlass.get_memory_pool(2**26, 2**26)
+    pycutlass.get_memory_pool(2**30, 2**30)
    loader = unittest.TestLoader()
    tests = loader.discover('./', 'gemm_*.py')
    testRunner = unittest.runner.TextTestRunner()