Release v4.0.0 (#2294)

2025-05-13 15:55:29 -04:00
parent ad7b2f5e84
commit f115c3f854
299 changed files with 51495 additions and 4413 deletions
--- a/examples/python/deprecated/00_basic_gemm.ipynb
+++ b/examples/python/deprecated/00_basic_gemm.ipynb
@ -0,0 +1,475 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "1ef96b3f",
+   "metadata": {},
+   "source": [
+    "# Basic example of using the CUTLASS Python interface\n",
+    "This notebook walks through a basic example of using the CUTLASS Python interface to declare, compile, and run GEMMs.\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cutlass/blob/main/examples/python/00_basic_gemm.ipynb)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "df94d7e6",
+   "metadata": {},
+   "source": [
+    "## Prerequisites for running on Colab\n",
+    "This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "71c7a069",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!#nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf16785d",
+   "metadata": {},
+   "source": [
+    "If running on Colab, you will need to install the CUTLASS Python interface. To do so, uncomment the following line and run the cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c819bb68",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!#pip install nvidia-cutlass"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "962324fd",
+   "metadata": {},
+   "source": [
+    "## General setup\n",
+    "We first import various packages needed for the example and construct the input and output tensors that will be used in our example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e324219",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import random\n",
+    "\n",
+    "import cutlass\n",
+    "\n",
+    "# This controls whether the C++ GEMM declaration will be printed at each step. \n",
+    "# Set to `False` to omit this information.\n",
+    "print_module = True\n",
+    "\n",
+    "m = 128\n",
+    "n = m\n",
+    "k = m\n",
+    "\n",
+    "dtype = np.float16\n",
+    "type_A = np.float16\n",
+    "type_B = np.float16\n",
+    "type_C = np.float16\n",
+    "type_D = np.float16\n",
+    "\n",
+    "np.random.seed(1234)\n",
+    "random.seed(1234)\n",
+    "scope_min = -4\n",
+    "scope_max = 4\n",
+    "tensor_A = np.ceil(np.random.uniform(low=scope_min, high=scope_max, size=(m, k)).astype(type_A))\n",
+    "tensor_B = np.ceil(np.random.uniform(low=scope_min, high=scope_max, size=(k, n)).astype(type_B))\n",
+    "tensor_C = np.ceil(np.random.uniform(low=scope_min, high=scope_max, size=(m, n)).astype(type_C))\n",
+    "\n",
+    "alpha = np.float16(1.)\n",
+    "beta = np.float16(0.)\n",
+    "\n",
+    "tensor_D = np.zeros(tensor_C.shape).astype(type_D)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "f2c7bf48",
+   "metadata": {},
+   "source": [
+    "## Declaring and running a GEMM\n",
+    "To get started, one only needs to provide the tensors declared above to the `cutlass.op.Gemm` call.\n",
+    "This sets up a default GEMM operation for the given device on which you are running.\n",
+    "\n",
+    "Assuming that we are running on SM80, this default to using a GEMM that leverages FP16 Tensor Core operations.\n",
+    "\n",
+    "Calling `plan.run()` will generate the CUTLASS C++ kernel in question, compile it, and run it on the tensors we previously passed in. By setting `print_module` to `true`, the C++ code that is emitted is printed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0dfd8975",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# We specify `element_accumulator` here so as to match the kernel run by NumPy below. However,\n",
+    "# specifying `element_accumulator` is not required if it is the same as `element`\n",
+    "plan = cutlass.Gemm(element=dtype, layout=cutlass.LayoutType.RowMajor, element_accumulator=np.float32)\n",
+    "plan.run(tensor_A, tensor_B, tensor_C, tensor_D, print_module=print_module)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "4a5856de",
+   "metadata": {},
+   "source": [
+    "There are many other ways to construct a plan from `cutlass.op.Gemm` (e.g., by specifiying they types and layouts of each operand, by providing representative tensors as inputs). For more details on these, see the documentation in the `cutlass.op.Gemm` constructor."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "945478ef",
+   "metadata": {},
+   "source": [
+    "We then compare the output to running the GEMM using NumPy."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b669de6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tensor_D_numpy = (alpha * (tensor_A @ tensor_B)) + (beta * tensor_C)\n",
+    "np.testing.assert_array_equal(tensor_D, tensor_D_numpy)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "ee5cbbbe",
+   "metadata": {},
+   "source": [
+    "Note that one could use the same kernel just declared for tensors provided by other frameworks beyond NumPy, such as PyTorch or CuPy."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "b6c86493",
+   "metadata": {},
+   "source": [
+    "## Changing operation modes\n",
+    "By default, the CUTLASS Python interface will try to use Tensor Core operations whenever possible. If the configuration provided to `cutlass.op.Gemm` is not supported on Tensor Cores, the interface will fall back to using a SIMT kernel.\n",
+    "\n",
+    "The operation mode currently in use can be returned via the `plan.opclass` property. In this case Tensor Core operations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "529fda93",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(plan.opclass)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "6d27c575",
+   "metadata": {},
+   "source": [
+    "Suppose that we don't want to use Tensor Cores for this GEMM. One can change to using CUTLASS's SIMT GEMMs by setting the plan's `opclass` field.\n",
+    "\n",
+    "As is shown in the printed output, the emitted kernel uses template parameters that fit CUTLASS's SIMT GEMMs.\n",
+    "\n",
+    "Also notice that, this time around, we provided tensor parameters to `plan.run()`. One is free to provide different parameters to `plan.run()` than were passed in at the initial call to `cutlass.op.Gemm`, provided that the passed-in tensors have the same data type and layout as those passed in on intialization."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6a44d35b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tensor_D_simt = np.zeros(tensor_C.shape).astype(type_D)\n",
+    "plan.opclass = cutlass.OpcodeClass.Simt\n",
+    "plan.run(tensor_A, tensor_B, tensor_C, tensor_D_simt, alpha, beta, print_module=print_module)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "639dcb59",
+   "metadata": {},
+   "source": [
+    "If we compare the output of the Tensor Core and SIMT GEMMs we just ran we see that they are equal."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9b480853",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "np.testing.assert_array_equal(tensor_D, tensor_D_simt)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "0cce1eae",
+   "metadata": {},
+   "source": [
+    "## Running cached kernels\n",
+    "You may have noticed that the `plan.run()` calls for the previous two kernels took some time to execute. This is because the kernel being emitted had not yet been compiled.\n",
+    "\n",
+    "CUTLASS caches compiled binaries so that recompilation isn't necessary every time a kernel is run. For example, if we change modes back to using Tensor Cores and call `plan.run()` again (with a different set of tensor parameters), you'll find the call to return much faster."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f8051e5e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m = 2400\n",
+    "n = 3232\n",
+    "k = 4096\n",
+    "\n",
+    "tensor_A = np.ceil(np.random.uniform(low=scope_min, high=scope_max, size=(m, k)).astype(type_A))\n",
+    "tensor_B = np.ceil(np.random.uniform(low=scope_min, high=scope_max, size=(k, n)).astype(type_B))\n",
+    "tensor_C = np.ceil(np.random.uniform(low=scope_min, high=scope_max, size=(m, n)).astype(type_C))\n",
+    "tensor_D = np.zeros(tensor_C.shape).astype(type_D)\n",
+    "\n",
+    "alpha = np.float16(1.)\n",
+    "beta = np.float16(2.)\n",
+    "\n",
+    "plan.opclass = cutlass.OpcodeClass.TensorOp\n",
+    "plan.run(tensor_A, tensor_B, tensor_C, tensor_D, alpha, beta, print_module=print_module)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "52a4e318",
+   "metadata": {},
+   "source": [
+    "## Running non-default GEMMs\n",
+    "The previous examples showed how it is simple to get started running a default GEMM kernel in CUTLASS. But, what do you do if you want a bit more control over the parameters to the GEMM?\n",
+    "\n",
+    "Under the hood, CUTLASS enumerates the different GEMM configuration parameters possible for this kernel from the CUTLASS profiler. The code below shows how one can access the tile descriptions for the kernels (e.g., cluster, threadblock, and warp shape)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1c593be1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tiles = plan.tile_descriptions()\n",
+    "print('{} tile descriptions returned'.format(len(tiles)))\n",
+    "num_print = 10\n",
+    "print('First {} tile descriptions are:'.format(num_print))\n",
+    "for td in tiles[:num_print]:\n",
+    "    print(td)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "dc3ad875",
+   "metadata": {},
+   "source": [
+    "Next, we'll pick one of these configurations at random and compile and run it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a8dc5287",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tiles = [td for td in tiles if td.threadblock_shape[0] >= 128]\n",
+    "idx = random.randint(0, len(tiles)-1)\n",
+    "td = tiles[idx]\n",
+    "print('Tile description {} is: {}'.format(idx, td))\n",
+    "plan.compile(td)\n",
+    "plan.run(tensor_A, tensor_B, tensor_C, tensor_D, alpha, beta, print_module=print_module)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "c5a8b534",
+   "metadata": {},
+   "source": [
+    "One can also change the swizzling function used by the kernel. For example, one can modify the kernel to use the stream K feature of CUTLASS via:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e5e88d17",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Stream K is exposed through the threadblock swizzle method for pre-SM90 kernels,\n",
+    "# and via the tile_scheduler attribute of the TileDescription for post-SM90 kernels\n",
+    "if plan.cc < 90:\n",
+    "    plan.swizzling_functor = cutlass.swizzle.ThreadblockSwizzleStreamK\n",
+    "    plan.run(tensor_A, tensor_B, tensor_C, tensor_D, alpha, beta, print_module=print_module)\n",
+    "else:\n",
+    "    # Stream-K is currently only supported for warp-specialized cooperative kernels\n",
+    "    td.kernel_schedule = cutlass.KernelScheduleType.TmaWarpSpecializedCooperative\n",
+    "    td.epilogue_schedule = cutlass.EpilogueScheduleType.TmaWarpSpecializedCooperative\n",
+    "    td.tile_scheduler = cutlass.TileSchedulerType.StreamK\n",
+    "\n",
+    "    plan.compile(td)\n",
+    "    plan.run(tensor_A, tensor_B, tensor_C, tensor_D, alpha, beta, print_module=print_module)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "5a8ba2ba",
+   "metadata": {},
+   "source": [
+    "## Handling errors\n",
+    "The CUTLASS Python interface attempts to catch runtime and compilation errors in Python so as to provide more understandable error messages.\n",
+    "\n",
+    "Here's an example in which we try to use too many stages for a given GEMM kernel. Normally, this would result in a runtime error due to the GPU having insufficient shared memory to launch the kernel with 8 stages. The CUTLASS Python interface is able to detect this issue before compiling the kernel, and reports it back to the user. Uncomment and run the code below to see this error."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe7d0e42",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# td = tiles[0]\n",
+    "# td.stages = 8\n",
+    "# plan.compile(td)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0fff34a4",
+   "metadata": {},
+   "source": [
+    "## Specializations for other data types\n",
+    "\n",
+    "Various CUTLASS kernels specialized for specific data types can also be run via the Python interface.\n",
+    "\n",
+    "For example, the code below shows how to declare and run a GEMM using the 3xTF32 feature (see corresponding C++ example [here](https://github.com/NVIDIA/cutlass/blob/main/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu))."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "338ad890",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from cutlass.backend.utils.device import device_cc\n",
+    "\n",
+    "# 3xTF32 requires SM80 or higher\n",
+    "if device_cc() >= 80:\n",
+    "    plan = cutlass.op.Gemm(element=np.float32, layout=cutlass.LayoutType.RowMajor)\n",
+    "    plan.math_operation = cutlass.MathOperation.multiply_add_fast_f32\n",
+    "\n",
+    "    # Create input/output tensors in FP32\n",
+    "    A, B = [np.ones((128, 128)).astype(np.float32) for _ in range(2)]\n",
+    "    C, D = [np.zeros((128, 128)).astype(np.float32) for _ in range(2)]\n",
+    "\n",
+    "    # Run the GEMM\n",
+    "    plan.run(A, B, C, D, print_module=print_module)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "65531df1",
+   "metadata": {},
+   "source": [
+    "Additionally, one can run CUTLASS's FP8 GEMMs if using a frontend library capable of allocating and initializing FP8 tensors (e.g., PyTorch)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "776f1d8d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "try:\n",
+    "    import torch\n",
+    "except ImportError:\n",
+    "    print(\"PyTorch is not available. Skipping FP8 example\")\n",
+    "    import sys; sys.exit(0)\n",
+    "\n",
+    "if not hasattr(torch, \"float8_e4m3fn\"):\n",
+    "    print(\"Version of PyTorch does not have the float8_e4m3fn data type. Skipping FP8 example\")\n",
+    "    import sys; sys.exit(0)\n",
+    "\n",
+    "# FP8 is supported through the CUTLASS Python interface on SM90 and higher\n",
+    "if device_cc() >= 90:\n",
+    "    plan = cutlass.op.Gemm(element=torch.float8_e4m3fn, element_C=torch.float32, element_accumulator=torch.float32,\n",
+    "                        layout_A=cutlass.LayoutType.RowMajor, layout_B=cutlass.LayoutType.ColumnMajor,\n",
+    "                        layout_C=cutlass.LayoutType.ColumnMajor)\n",
+    "\n",
+    "    # Create input/output tensors in FP8\n",
+    "    A, B = [torch.ones((128, 128)).to(torch.float8_e4m3fn).to(\"cuda\") for _ in range(2)]\n",
+    "    C, D = [torch.zeros((128, 128)).to(torch.float8_e4m3fn).to(\"cuda\") for _ in range(2)]\n",
+    "\n",
+    "    # Run the GEMM\n",
+    "    plan.run(A, B, C, D, print_module=print_module)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "0466d96796c9cd8f7a1cad264ff326ececc950ba2420e0256d5105fc1a3c6e70"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/examples/python/deprecated/01_epilogue.ipynb
+++ b/examples/python/deprecated/01_epilogue.ipynb
@ -0,0 +1,253 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "5d24a692",
+   "metadata": {},
+   "source": [
+    "# Example of using elementwise activation functions in the CUTLASS Python interface\n",
+    "This notebook walks through a basic example of using the CUTLASS Python interface to declare, compile, and run GEMMs with different epilogues.\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cutlass/blob/main/examples/python/01_epilogue.ipynb)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "28c916da",
+   "metadata": {},
+   "source": [
+    "## Prerequisites for running on Colab\n",
+    "This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0fcea8ea",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!#nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ec60b57",
+   "metadata": {},
+   "source": [
+    "If running on Colab, you will need to install the CUTLASS Python interface. To do so, uncomment the following line and run the cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1db9e51c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!#pip install nvidia-cutlass"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "962324fd",
+   "metadata": {},
+   "source": [
+    "## General setup\n",
+    "We first import various packages needed for the example and construct the input and output tensors that will be used in our example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "63a70a3c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "import cutlass\n",
+    "\n",
+    "# This controls whether ther C++ GEMM declaration will be printed at each step. Set to `false` to\n",
+    "# omit this information.\n",
+    "print_module = True\n",
+    "\n",
+    "m = 256\n",
+    "n = m\n",
+    "k = m\n",
+    "\n",
+    "type_A = np.float16\n",
+    "type_B = np.float16\n",
+    "type_C = np.float16\n",
+    "type_D = np.float16\n",
+    "\n",
+    "np.random.seed(1234)\n",
+    "scope_min = -4\n",
+    "scope_max = 4\n",
+    "tensor_A = np.ceil(np.random.uniform(low=scope_min, high=scope_max, size=(m, k)).astype(type_A))\n",
+    "tensor_B = np.ceil(np.random.uniform(low=scope_min, high=scope_max, size=(k, n)).astype(type_B))\n",
+    "tensor_C = np.ceil(np.random.uniform(low=scope_min, high=scope_max, size=(m, n)).astype(type_C))\n",
+    "\n",
+    "alpha = np.float16(1.)\n",
+    "beta = np.float16(0.)\n",
+    "\n",
+    "tensor_D = np.zeros(tensor_C.shape).astype(type_D)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1eb0d95b",
+   "metadata": {},
+   "source": [
+    "## Run a GEMM with an identity activation function\n",
+    "To begin, we simply run a default GEMM with an identity activation function. This performs the well-known operation `D = alpha * (A @ B) + beta * C`. This is the default activation function used, and does not need to be specified."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d257833",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plan = cutlass.op.Gemm(element=np.float16, layout=cutlass.LayoutType.RowMajor)\n",
+    "plan.run(tensor_A, tensor_B, tensor_C, tensor_D, print_module=print_module)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "54961694",
+   "metadata": {},
+   "source": [
+    "## Run a GEMM with a ReLU element-wise activation function\n",
+    "CUTLASS makes it easy to support other element-wise activation functions. This results in performing an element-wise after the generic linear combination performed in a GEMM. If we call such an activation function `act`, the resulting formulation is:\n",
+    "```\n",
+    "D = alpha * (A @ B) + beta * C\n",
+    "D = act(D)\n",
+    "```\n",
+    "\n",
+    "Here, we will add a ReLU activation function. Given an input `x`, ReLU returns `max(x, 0)`.\n",
+    "\n",
+    "This is easy to do in CUTLASS. One only needs to set the plan's `activation` field."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5fe49443",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tensor_D_relu = np.zeros(tensor_C.shape).astype(type_D)\n",
+    "plan.activation = \"relu\"\n",
+    "plan.run(tensor_A, tensor_B, tensor_C, tensor_D_relu, print_module=print_module)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "455d0a37",
+   "metadata": {},
+   "source": [
+    "We can now verify that the result of the GEMM that used a ReLU activation function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e32e7798",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "relu_ref = (tensor_D >= 0).astype(type_D) * tensor_D\n",
+    "np.testing.assert_array_equal(relu_ref, tensor_D_relu)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf959171",
+   "metadata": {},
+   "source": [
+    "## Other element-wise activation functions\n",
+    "CUTLASS supports a variety of widely-used element-wise activation functions. We can obtain a list of these functions via the `get_activations()` method."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e17d730",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "activations = plan.activations()\n",
+    "for activation in activations:\n",
+    "    print(activation)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e4599fa",
+   "metadata": {},
+   "source": [
+    "We can then run each of them:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9c3598c9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for activation in activations:\n",
+    "    print('=============================================================================================')\n",
+    "    print(f'Compiling and running activation {activation}')\n",
+    "    print('=============================================================================================')\n",
+    "    plan.activation = activation\n",
+    "    plan.run(tensor_A, tensor_B, tensor_C, tensor_D, print_module=print_module)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18828622",
+   "metadata": {},
+   "source": [
+    "To add an activation with parameter such as `leaky_relu`, a tuple should be provided containing the activation function name and the (or a list of) parameter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "53108eae",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "negative_slope = 0.5\n",
+    "plan.activation = (\"leaky_relu\", negative_slope)\n",
+    "plan.run(tensor_A, tensor_B, tensor_C, tensor_D, print_module=print_module)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/examples/python/deprecated/02_pytorch_extension_grouped_gemm.ipynb
+++ b/examples/python/deprecated/02_pytorch_extension_grouped_gemm.ipynb
@ -0,0 +1,300 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "6acbea5d",
+   "metadata": {},
+   "source": [
+    "# Exporting a CUTLASS grouped GEMM kernel to a PyTorch CUDA extension\n",
+    "This notebook walks through a basic example of using the CUTLASS Python interface to declare\n",
+    "a grouped GEMM kernel and export it as a PyTorch CUDA extension. Note that GEMM and Conv2d can also be exported as PyTorch CUDA extensions. \n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cutlass/blob/main/examples/python/02_pytorch_extension_grouped_gemm.ipynb)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2d70560e",
+   "metadata": {},
+   "source": [
+    "## Prerequisites for running on Colab\n",
+    "This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cc7c7458",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!#nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2107bb0d",
+   "metadata": {},
+   "source": [
+    "If running on Colab, you will need to install the CUTLASS Python interface and PyTorch. To do so, uncomment the following line and run the cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a9852cb8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!#pip install nvidia-cutlass torch --extra-index-url https://download.pytorch.org/whl/cu121"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "962324fd",
+   "metadata": {},
+   "source": [
+    "## Background on grouped GEMM\n",
+    "Grouped GEMM enables one to execute a set of GEMMs (each with potentially different sizes and strides)\n",
+    "in a single CUDA kernel. It can be thought of as a generalized version of a pointer-array GEMM,\n",
+    "without the requirement that the sizes and strides of each GEMM be the same.\n",
+    "\n",
+    "For example, if one has `p` GEMMs with sizes:\n",
+    "```text\n",
+    "M_1 x N_1 x K_1\n",
+    "M_2 x N_2 x K_2\n",
+    "...\n",
+    "M_p x N_p x K_p\n",
+    "```\n",
+    "CUTLASS's grouped GEMM will execute these in a single CUDA kernel.\n",
+    "\n",
+    "Grouped GEMM is particularly beneficial for saturating the GPU with many small problems that would\n",
+    "insufficiently utilize the device in isolation.\n",
+    "\n",
+    "## Declaring a grouped GEMM via the CUTLASS Python interface\n",
+    "A grouped GEMM operation is declared similarly to a GEMM operation in the CUTLASS Python interface: one\n",
+    "simply calls `cutlass.op.GroupedGemm`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fdcf21d8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cutlass\n",
+    "import torch\n",
+    "\n",
+    "dtype = torch.float16\n",
+    "plan = cutlass.op.GroupedGemm(element=dtype, layout=cutlass.LayoutType.RowMajor)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "514f40a4",
+   "metadata": {},
+   "source": [
+    "We can then compile and run this operation on a group of GEMMs. We'll first set up some utility functions to initialize GEMMs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c2a7371e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import random\n",
+    "random.seed(2023)\n",
+    "\n",
+    "# Utility function to initialize A, B, C, and D matrices corresponding to dimensions M, N, and K\n",
+    "def initialize(dtype, M, N, K):\n",
+    "    sizes = [(M, K), (K, N), (M, N), (M, N)]\n",
+    "    return [torch.randint(-3, 3, size, device='cuda').to(dtype) for size in sizes]\n",
+    "\n",
+    "# Utility function to generate `problems` GEMMs of random sizes\n",
+    "def generate_problems(problems):\n",
+    "    valid_sizes = [128, 256, 512, 1024]\n",
+    "    As, Bs, Cs, Ds = [], [], [], []\n",
+    "    for _ in range(problems):\n",
+    "        M, N, K = [random.choice(valid_sizes) for _ in range(3)]\n",
+    "        A, B, C, D = initialize(dtype, M, N, K)\n",
+    "        As.append(A)\n",
+    "        Bs.append(B)\n",
+    "        Cs.append(C)\n",
+    "        Ds.append(D)\n",
+    "    return As, Bs, Cs, Ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "590a3bc5",
+   "metadata": {},
+   "source": [
+    "We'll next run a group of 20 GEMMs via the CUTLASS Python interface and via PyTorch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "776c9233",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "As, Bs, Cs, Ds, = generate_problems(20)\n",
+    "\n",
+    "plan.run(As, Bs, Cs, Ds, print_module=True)\n",
+    "Ds_torch = [a @ b for a, b in zip(As, Bs)]\n",
+    "\n",
+    "for d, d_torch in zip(Ds, Ds_torch):\n",
+    "    assert torch.allclose(d, d_torch)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "766e4f03",
+   "metadata": {},
+   "source": [
+    "## Exporting the CUTLASS kernel to a PyTorch CUDA extension\n",
+    "The procedure above allows one to quickly experiment with using a CUTLASS kernels However, one might prefer to use the CUTLASS kernel via a [PyTorch CUDA extension](https://pytorch.org/tutorials/advanced/cpp_extension.html). This will avoids adding any runtime overheads associated with the Python portions of the CUTLASS Python interface.\n",
+    "\n",
+    "The CUTLASS Python interface provides simple solutions for creating PyTorch CUDA extensions for a CUTLASS kernel. These extensions can either be written out for a later \"ahead-of-time\" compilation, or be just-in-time compiled and returned to the user.\n",
+    "\n",
+    "To create a JIT-compiled module from the CUTLASS kernel we defined above, simply call the following:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a98dee6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "op = plan.construct()\n",
+    "grouped_gemm = cutlass.emit.pytorch(op, name='grouped_gemm', cc=plan.cc, sourcedir='out', jit=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8ca3991",
+   "metadata": {},
+   "source": [
+    "The `cutlass.emit.pytorch` function emits:\n",
+    "* `out/grouped_gemm_kernel.cu`: This file contains the declaration of the CUTLASS kernel and a method to call it from PyTorch tensors\n",
+    "* `out/grouped_gemm.cpp`: This file contains a C++ wrapper around the aforementioned CUTLASS kernel\n",
+    "* `setup.py`: This file contains the `setuptools` script for building and installing the generated extension\n",
+    "\n",
+    "The extension can be build from within the `module_output` directory by running:\n",
+    "```bash\n",
+    "TORCH_CUDA_ARCH_LIST=\"8.0\" python setup.py install\n",
+    "```\n",
+    "Where `TORCH_ARCH_LIST` is set to the compute capability of the device on which the kernel will be run.\n",
+    "\n",
+    "See the PyTorch [\"Custom C++ and CUDA Extensions\"](https://pytorch.org/tutorials/advanced/cpp_extension.html) tutorial for more details on this.\n",
+    "\n",
+    "The PyTorch CUDA extension could be built for this module by running:\n",
+    "```bash\n",
+    "cd out\n",
+    "TORCH_CUDA_ARCH_LIST=\"8.0\" python setup.py\n",
+    "```\n",
+    "(assuming that one is building for SM80)\n",
+    "\n",
+    "One could then use the kernel in a later PyTorch module by running:\n",
+    "\n",
+    "```python\n",
+    "import torch\n",
+    "import grouped_gemm\n",
+    "\n",
+    "grouped_gemm.run(As, Bs)\n",
+    "```\n",
+    "\n",
+    "In this case, however, we set `jit=True`, which specifies that we would like to compile and load the PyTorch CUDA extension on the fly.\n",
+    "Under the hood, this leverages the [torch.utils.cpp_extension.load](https://pytorch.org/tutorials/advanced/cpp_extension.html) method\n",
+    "and returns back the loaded extension.\n",
+    "\n",
+    "We can then use the extension and compare its results to running the GEMMs via vanilla PyTorch GEMMs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cecb26a4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "Ds = grouped_gemm.run(As, Bs)\n",
+    "Ds_torch = [a @ b for a, b in zip(As, Bs)]\n",
+    "for d, d_torch in zip(Ds, Ds_torch):\n",
+    "    assert torch.allclose(d, d_torch)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "50db80e4",
+   "metadata": {},
+   "source": [
+    "Finally, we can profile our grouped GEMM extension:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b76805d3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_warmup = 20\n",
+    "num_profile = 100\n",
+    "\n",
+    "# Warmup iterations\n",
+    "for _ in range(num_warmup):\n",
+    "    Ds = grouped_gemm.run(As, Bs)\n",
+    "    Ds_torch = [a @ b for a, b in zip(As, Bs)]\n",
+    "    torch.cuda.synchronize()\n",
+    "\n",
+    "# Timing iterations\n",
+    "import time\n",
+    "grouped = 0\n",
+    "nongrouped = 0\n",
+    "for _ in range(num_profile):\n",
+    "    start = time.time()\n",
+    "    Ds = grouped_gemm.run(As, Bs)\n",
+    "    torch.cuda.synchronize()\n",
+    "    grouped += time.time() - start\n",
+    "\n",
+    "    start = time.time()\n",
+    "    Ds_torch = [a @ b for a, b in zip(As, Bs)]\n",
+    "    torch.cuda.synchronize()\n",
+    "    nongrouped += time.time() - start\n",
+    "\n",
+    "print('Grouped:     {:.3f} us'.format(grouped * 1e6/num_profile))\n",
+    "print('Non-Grouped: {:.3f} us'.format(nongrouped * 1e6/num_profile))\n",
+    "print('Speedup: {:.3f}'.format(nongrouped / grouped))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/examples/python/deprecated/03_basic_conv2d.ipynb
+++ b/examples/python/deprecated/03_basic_conv2d.ipynb
@ -0,0 +1,465 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Basic example of using the CUTLASS Python interface for Conv2d\n",
+    "\n",
+    "This notebook walks through a basic example of using the CUTLASS Python interface to declare, compile, and run Conv2d. \n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cutlass/blob/main/examples/python/03_basic_conv2d.ipynb)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites for running on Colab\n",
+    "This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!#nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If running on Colab, you will need to install the CUTLASS Python interface. To do so, uncomment the following line and run the cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!#pip install nvidia-cutlass"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## General setup\n",
+    "We first import various packages needed for the example and construct the input and output tensors that will be used in our example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import random\n",
+    "\n",
+    "import cutlass\n",
+    "\n",
+    "# This controls whether the C++ GEMM declaration will be printed at each step. \n",
+    "# Set to `false` to omit this information.\n",
+    "print_module = True\n",
+    "\n",
+    "# Input tensor: [N, H, W, C] under the channel-last layout\n",
+    "N, H, W, C = [32, 28, 28, 64]\n",
+    "\n",
+    "# Weight tensor: [K, R, S, C] under the channel-last layout\n",
+    "K, R, S = [128, 3, 3]\n",
+    "\n",
+    "# Stride, and padding\n",
+    "stride = (2, 2)\n",
+    "padding = (1, 1)\n",
+    "dilation = (1, 1)\n",
+    "\n",
+    "# Compute the output size [N, P, Q, K]\n",
+    "N, P, Q, K = cutlass.Conv2d.output_size((N, H, W, C), (K, R, S, C), padding, stride, dilation)\n",
+    "\n",
+    "dtype = torch.float16\n",
+    "type_A = torch.float16\n",
+    "type_B = torch.float16\n",
+    "type_C = torch.float16\n",
+    "type_D = torch.float16\n",
+    "\n",
+    "torch.manual_seed(1234)\n",
+    "\n",
+    "input = torch.ceil(\n",
+    "    torch.empty(size=(N, C, H, W), dtype=type_A, device=\"cuda\").uniform_(-4.5, 3.5)\n",
+    ").to(memory_format=torch.channels_last)\n",
+    "weight = torch.ceil(\n",
+    "    torch.empty(size=(K, C, R, S), dtype=type_B, device=\"cuda\").uniform_(-4.5, 3.5)\n",
+    ").to(memory_format=torch.channels_last)\n",
+    "tensor_C = torch.ceil(\n",
+    "    torch.empty(size=(N, K, P, Q), dtype=type_B, device=\"cuda\").uniform_(-4.5, 3.5)\n",
+    ").to(memory_format=torch.channels_last)\n",
+    "output = torch.zeros_like(tensor_C)\n",
+    "\n",
+    "alpha = 1.0\n",
+    "beta = 0.0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Declaring and running a Conv2d Fprop\n",
+    "\n",
+    "We first show you how to run a Conv2d in the forward propagation. To get started, one only needs to provide the tensors declared above to the `cutlass.op.Conv2dFprop` call. This sets up a default Conv2d fprop operation for the given device on which you are running. \n",
+    "\n",
+    "Assuming that we are runing on SM80, the default is a Conv2d that leverages FP16 Tensor Core operations.\n",
+    "\n",
+    "Calling `plan.run()` will generate the CUTLASS C++ kernel in question, compile it, and run it on the tensors we previously passed in. By setting `print_module` to `true`, the C++ code that is emitted is printed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Specifying `element_accumulator` is not required if it is the same as `element`\n",
+    "plan = cutlass.Conv2dFprop(element=dtype, element_accumulator=torch.float32)\n",
+    "plan.run(input, weight, tensor_C, output, stride, padding, dilation, alpha, beta, print_module=print_module)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are many other ways to construct a plan from `cutlass.op.Conv2dFprop` (e.g., by specifying the types of each operand, by providing representative tensors as input). For more details on these, see the documentation in the `cutlass.op.Conv2dFprop` constructor.\n",
+    "\n",
+    "We then compare the output to running the Conv2d using PyTorch. PyTorch use NCHW layout by default, so permutations are required."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "output_torch = alpha * torch.ops.aten.conv2d(\n",
+    "    input, weight, stride=stride, padding=padding, dilation=dilation\n",
+    ") + beta * tensor_C\n",
+    "\n",
+    "assert torch.equal(output_torch, output)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that one could use the same kernel just declared for tensors provided by other frameworks beyond PyTorch, such as NumPy."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Declaring and running Conv2d Dgrad and Wgrad\n",
+    "\n",
+    "The Python interface also supports declaring and running backward kernels of Conv2d. To begin with, we construct the tensors for the gradient of input, output, and weight."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "grad_output = torch.ceil(\n",
+    "    torch.empty(size=(N, K, P, Q), dtype=type_A, device=\"cuda\").uniform_(-4.5, 3.5)\n",
+    ").to(memory_format=torch.channels_last)\n",
+    "grad_input = torch.zeros_like(input)\n",
+    "grad_weight = torch.zeros_like(weight)\n",
+    "\n",
+    "tensor_C_dgrad = torch.ceil(\n",
+    "    torch.empty(size=(N, C, H, W), dtype=type_A, device=\"cuda\").uniform_(-4.5, 3.5)\n",
+    ").to(memory_format=torch.channels_last)\n",
+    "tensor_C_wgrad = torch.ceil(\n",
+    "    torch.empty(size=(K, C, R, S), dtype=type_B, device=\"cuda\").uniform_(-4.5, 3.5)\n",
+    ").to(memory_format=torch.channels_last)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The script below gives a simple example of computing a data gradient via the CUTLASS Python interface and via PyTorch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plan_dgrad = cutlass.Conv2dDgrad(element=dtype, element_accumulator=torch.float32)\n",
+    "plan_dgrad.run(grad_output, weight, tensor_C_dgrad, grad_input, stride, padding, dilation, alpha, beta, print_module=print_module)\n",
+    "\n",
+    "grad_input_torch = alpha * torch.nn.grad.conv2d_input(\n",
+    "    (N, C, H, W),\n",
+    "    weight, grad_output,\n",
+    "    stride=stride, padding=padding\n",
+    ") + beta * tensor_C_dgrad\n",
+    "\n",
+    "assert torch.equal(grad_input_torch, grad_input)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The script below gives a simple example of computing a weight gradient via the CUTLASS Python interface and via PyTorch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plan_wgrad = cutlass.Conv2dWgrad(element=dtype, element_accumulator=torch.float32)\n",
+    "plan_wgrad.run(grad_output, input, tensor_C_wgrad, grad_weight, stride, padding, dilation, alpha, beta, print_module=print_module)\n",
+    "\n",
+    "grad_weight_torch = alpha * torch.nn.grad.conv2d_weight(\n",
+    "    input, (K, C, R, S), grad_output,\n",
+    "    stride=stride, padding=padding\n",
+    ") + beta * tensor_C_wgrad\n",
+    "\n",
+    "assert torch.equal(grad_weight_torch, grad_weight)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Running non-default Conv2ds\n",
+    "\n",
+    "The previous examples showed how it is simple to get starting running a default Conv2d kernel in CUTLASS. But, what do you do if you want a bit more control over the parameters to the Conv2d? CUTLASS Python interface exposes mutable parameters that can be set after the `plan` initialization. We summarize these in the table below.\n",
+    "\n",
+    "|Parameter|Description|\n",
+    "| --      | --      |\n",
+    "|`tile_description`|The threadblock tile size, warp count, software pipeline stages, and instruction shape|\n",
+    "|`iterator_algorithm`|The iterator algorithm used to access the source operands|\n",
+    "|`swizzling_stride`|The stride of the threadblock swizzling functor|\n",
+    "|`split-K`|Partitions the reduction dimension to different threadblocks|"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Tile Description\n",
+    "\n",
+    "The `tile_description` defines the tiling size of each threadblock, the warp count along each dimension of the tile, the software pipeline stages, and the instruction size. Under the hood, CUTLASS enumerates the different Conv2d configuration parameters for this kernel from the CUTLASS profiler. The code below shows how one can access the tile descriptions for the kernel (e.g., threadblock and warp shape)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plan.opclass = \"tensor_op\"\n",
+    "tiles = plan.tile_descriptions()\n",
+    "print(f'{len(tiles)} tile descriptions returned')\n",
+    "num_print = 10\n",
+    "print(f'First {num_print} tile descriptions are:')\n",
+    "for td in tiles[:num_print]:\n",
+    "    print(td)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we'll pick one of these configurations at random and compile and run it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "random.seed(42)\n",
+    "idx = random.randint(0, len(tiles)-1)\n",
+    "td = tiles[idx]\n",
+    "print(f'Tile description {idx} is: {td}')\n",
+    "plan.tile_description = td\n",
+    "plan.run(input, weight, tensor_C, output, stride, padding, dilation, alpha, beta, print_module=print_module)\n",
+    "assert torch.equal(output_torch, output)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Besides tile descriptions enumerated by CUTLASS, the users can also explicitly set the `threadblockshape`, `warp_shape`, `stages`, `instruction_shape`, and `cluster_shape`. If the configuration is invalid, an exception will be raised at `plan.run()` and the detailed compilation error will be stored in `./cutlass_python_compilation_error.txt` for debugging."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if plan.cc == 70:\n",
+    "    plan.tile_description = {\n",
+    "        \"threadblock_shape\": [64, 256, 32],\n",
+    "        \"warp_count\": [1, 4, 1],\n",
+    "        \"stages\": 2,\n",
+    "        \"instruction_shape\": [8, 8, 4], # optional,\n",
+    "        \"cluster_shape\": [1, 1, 1] # optional, only [1, 1, 1] is supported currently\n",
+    "    }\n",
+    "elif plan.cc == 75:\n",
+    "    plan.tile_description = {\n",
+    "        \"threadblock_shape\": [128, 64, 32],\n",
+    "        \"warp_count\": [2, 1, 1],\n",
+    "        \"stages\": 2,\n",
+    "        \"instruction_shape\": [16, 8, 8], # optional,\n",
+    "        \"cluster_shape\": [1, 1, 1] # optional, only [1, 1, 1] is supported currently\n",
+    "    }\n",
+    "elif plan.cc == 80:\n",
+    "    plan.tile_description = {\n",
+    "        \"threadblock_shape\": [128, 128, 64],\n",
+    "        \"warp_count\": [2, 2, 1],\n",
+    "        \"stages\": 4,\n",
+    "        \"instruction_shape\": [16, 8, 16], # optional,\n",
+    "        \"cluster_shape\": [1, 1, 1] # optional, only [1, 1, 1] is supported currently\n",
+    "    }\n",
+    "elif plan.cc == 86:\n",
+    "    plan.tile_description = {\n",
+    "        \"threadblock_shape\": [128, 64, 64],\n",
+    "        \"warp_count\": [2, 2, 1],\n",
+    "        \"stages\": 3,\n",
+    "        \"instruction_shape\": [16, 8, 16],\n",
+    "        \"cluster_shape\": [1, 1, 1]\n",
+    "    }\n",
+    "\n",
+    "plan.run(input, weight, tensor_C, output, stride, padding, dilation, alpha, beta, print_module=print_module)\n",
+    "assert torch.equal(output_torch, output)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Iterator Algorithm\n",
+    "\n",
+    "The iterator algorithm describes how sources are loaded from memory. There are some iterator algorithms optimized for specific alignments and input/output channels that have better performance. The table below illustrates the available iterator algorithms.\n",
+    "\n",
+    "|Conv Kind | Iterator Algorithm | Description |\n",
+    "| --       | --                 | --          |\n",
+    "|Fprop     | \"analytic\"         | Functionally correct in all cases but lower performance |\n",
+    "|          | \"optimized\"        | Optimized for and requires `R <= 32`, `S<= 32`, and `C % alignment_input == 0`|\n",
+    "|          | \"few_channels\"     | optimized for small `C` and requires `C % alignment_input == 0`|\n",
+    "|          | \"fixed_channels\"   | optimized for small `C` and requires `C == alignment_input` |\n",
+    "|Dgrad     | \"analytic\"         | Functionally correct in all cases but lower performance |\n",
+    "|          | \"optimized\"        | Optimzed for and require `R <= 32`, `S<= 32`, `K % alignment_grad_output == 0`, and `C % alignment_weight == 0`|\n",
+    "|Wgrad     | \"analytic\"         | Functionally correct in all cases but lower performance |\n",
+    "|          | \"optimized\"        | Optimized for and require `K % alignment_grad_output == 0`, and `C % alignment_input == 0`|\n",
+    "\n",
+    "By default, the Python interface will automatically propose a suitable iterator algorithm based on the input tensors in `plan.run()`. However, the user can also specify the desired iterator algorithm as follows"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plan.iterator_algorithm = \"analytic\"\n",
+    "plan.run(input, weight, tensor_C, output, stride, padding, dilation, alpha, beta, print_module=print_module)\n",
+    "assert torch.equal(output_torch, output)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If the iterator algorithm is invalid for the problem size in `plan.run()`, an exception will be raised."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Swizzling Stride\n",
+    "The swizzling changes how the tile are mapped to threadblocks to improve the L2 Locality. Given a swizzling stride `N`, the threadblock `(tb_x, tb_y)` computes tile `(tb_x / N, tb_y * N + (tb_x % N))`. Currently, stride values of `1`, `2`, `4`, and `8` are supported for `fprop`, `wgrad`, and `1`, and `4` for `dgrad`. The swizzling stride can be set with:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plan.swizzling_stride = 4\n",
+    "plan.run(input, weight, tensor_C, output, stride, padding, dilation, alpha, beta, print_module=print_module)\n",
+    "assert torch.equal(output_torch, output)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Split-K\n",
+    "Split-K is usually applied when the Conv2d has small spatial dimensions and large reduction dimension to ensure good utilization. It further partitions the reduction dimension to different threadblocks. The CUTLASS Python interface supports two types of split-K strategies: `Parallel`, and `Serial`. \n",
+    "* `Parallel`: the partial results from different threadblocks are stored in a temporary buffer in the global memory. When the Conv2d is done, a separate reduction kernel is created and launched to reduce the partial results.\n",
+    "* `Serial`: A semaphore is used to coordinate the order of different threadblocks adding their partial results to a given output tile. A separate kernel does not need to be launched for prforming the reduction.\n",
+    "\n",
+    "While all `fprop`, `dgrad`, and `wgrad` support split-K, here we use `wgrad` as an example. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Parallel Split-K with 5 slices\n",
+    "grad_weight_parallel = torch.zeros_like(grad_weight)\n",
+    "plan_wgrad.run(\n",
+    "    grad_output, input, tensor_C_wgrad, grad_weight_parallel,  \n",
+    "    stride, padding, dilation, alpha, beta, print_module=print_module, split_k=(\"parallel\", 5))\n",
+    "assert torch.equal(grad_weight_torch, grad_weight_parallel)\n",
+    "\n",
+    "# Serial Split-K with 3 slices\n",
+    "grad_weight_serial = torch.zeros_like(grad_weight)\n",
+    "plan_wgrad.run(\n",
+    "    grad_output, input, tensor_C_wgrad, grad_weight_serial,  \n",
+    "    stride, padding, dilation, alpha, beta, print_module=print_module, split_k=(\"serial\", 3))\n",
+    "assert torch.equal(grad_weight_torch, grad_weight_serial)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/examples/python/deprecated/04_epilogue_visitor.ipynb
+++ b/examples/python/deprecated/04_epilogue_visitor.ipynb
@ -0,0 +1,258 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "5d24a692",
+   "metadata": {},
+   "source": [
+    "# Example of using epilogue visitor in the CUTLASS Python interface\n",
+    "This notebook walks through a basic example of using the CUTLASS Python interface to declare, compile, and run GEMMs with different epilogues through CUTLASS Epilogue Visitor.\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cutlass/blob/main/examples/python/04_epilogue_visitor.ipynb)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3a800e79",
+   "metadata": {},
+   "source": [
+    "## Prerequisites for running on Colab\n",
+    "This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9cfff2c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!#nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "06706f00",
+   "metadata": {},
+   "source": [
+    "If running on Colab, you will need to install the CUTLASS Python interface. To do so, uncomment the following line and run the cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "491a7314",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!#pip install nvidia-cutlass"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "962324fd",
+   "metadata": {},
+   "source": [
+    "## General setup\n",
+    "We first import various packages needed for the example, construct the input and output tensors that will be used in our example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "63a70a3c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import cutlass\n",
+    "from cutlass.epilogue import relu\n",
+    "from cutlass import Tensor as FakeTensor\n",
+    "from cutlass.utils.profiler import CUDAEventProfiler\n",
+    "\n",
+    "# This controls whether ther C++ GEMM declaration will be printed at each step. Set to `false` to\n",
+    "# omit this information.\n",
+    "print_module = True\n",
+    "\n",
+    "# The Epilogue Visitor feature currently only works for SM80 and 90\n",
+    "from cutlass.backend.utils.device import device_cc\n",
+    "if device_cc() not in [80, 90]:\n",
+    "    import sys\n",
+    "    sys.exit()\n",
+    "\n",
+    "m = 16384\n",
+    "n = m\n",
+    "k = 512\n",
+    "\n",
+    "type_A = torch.float16\n",
+    "type_B = torch.float16\n",
+    "type_C = torch.float16\n",
+    "type_D = torch.float16\n",
+    "\n",
+    "torch.manual_seed(2023)\n",
+    "scope_min = -4\n",
+    "scope_max = 4\n",
+    "tensor_A = torch.ceil(torch.empty(size=(m, k), dtype=type_A, device=\"cuda\").uniform_(scope_min, scope_max))\n",
+    "tensor_B = torch.ceil(torch.empty(size=(k, n), dtype=type_B, device=\"cuda\").uniform_(scope_min, scope_max))\n",
+    "tensor_C = torch.ceil(torch.empty(size=(m, n), dtype=type_C, device=\"cuda\").uniform_(scope_min, scope_max))\n",
+    "tensor_D = torch.zeros_like(tensor_C)\n",
+    "\n",
+    "plan = cutlass.op.Gemm(element=torch.float16, layout=cutlass.LayoutType.RowMajor, element_accumulator=torch.float32)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1eb0d95b",
+   "metadata": {},
+   "source": [
+    "## Define the epilogue visitor functor\n",
+    "The epilogue functor can be defined as a simple Python function and a set of example tensors for inputs and outputs. The example below illustrates a complex epilogue under the directed acyclic graph structure (`F` is used twice). The epilogue takes source tensors in different ranks: `alpha`, `beta` are scalars, `bias` is a column vector to broadcast, and `C`, `aux` are matrices. It contains various math operations from basic arithmatic operations and built-in callable functions like `relu`. It also accomodates multiple outputs `D` and `F`. Note that there are some restrictions on syntax.\n",
+    "* Each named variable must be assigned exactly once and defined before it used.\n",
+    "* Reserved names: `accum`, `C`, and `D` are reserved for accumulator, tensor_C, and tensor_D.\n",
+    "* Return values must be a named variable.\n",
+    "\n",
+    "The example tensors is a dictionary with tensor names as keys and reference tensors as values. The reference tensors can be `float`, `torch.Tensor`, `numpy.ndarray`, or our `FakeTensor`. They provides the shape and data type information of the inputs and outputs of the epilogue.\n",
+    "\n",
+    "The epilogue can be generated simply through `cutlass.evt.trace(<epilogue function>, <example_tensors>)`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d257833",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define epilogue visitor\n",
+    "def example_epilogue(accum, alpha, C, beta, aux, bias):\n",
+    "    F = alpha * accum + (beta * C + aux)\n",
+    "    E = relu(F + 1) + bias\n",
+    "    D = E + F\n",
+    "    return D, F\n",
+    "\n",
+    "# Construct inputs and outputs\n",
+    "alpha = 0.5\n",
+    "beta = 0.5\n",
+    "aux = torch.ceil(torch.empty(size=(m, n), dtype=type_C, device=\"cuda\").uniform_(scope_min, scope_max))\n",
+    "bias = torch.ceil(torch.empty(size=(m, 1), dtype=type_C, device=\"cuda\").uniform_(scope_min, scope_max))\n",
+    "tensor_F = torch.zeros_like(tensor_D)\n",
+    "examples_tensors = {\n",
+    "    \"accum\": FakeTensor(element=torch.float32, shape=(m, n), layout_tag=cutlass.LayoutType.RowMajor),\n",
+    "    \"alpha\": alpha,\n",
+    "    \"C\": tensor_C,\n",
+    "    \"beta\": beta,\n",
+    "    \"aux\": aux,\n",
+    "    \"bias\": bias,\n",
+    "    \"D\": tensor_D,\n",
+    "    \"F\": tensor_F\n",
+    "}\n",
+    "\n",
+    "# Trace the epilogue visitor\n",
+    "epilogue_visitor = cutlass.epilogue.trace(example_epilogue, examples_tensors)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "54961694",
+   "metadata": {},
+   "source": [
+    "## Run a GEMM with the epilogue visitor functor\n",
+    "The `epilogue_visitor` can be used by setting the plan's `epilogue_visitor` field. The arguments for the epilogue visitor are provided as a `dict` through the `visitor_args` keyword argument."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5fe49443",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "visitor_args = {\n",
+    "    \"alpha\": alpha, \"C\": tensor_C, \"beta\": beta, \n",
+    "    \"aux\": aux, \"bias\": bias, \"D\": tensor_D, \"F\": tensor_F\n",
+    "}\n",
+    "\n",
+    "plan.epilogue_visitor = epilogue_visitor\n",
+    "plan.run(\n",
+    "    tensor_A, tensor_B, tensor_C, tensor_D, \n",
+    "    visitor_args=visitor_args, print_module=print_module)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "455d0a37",
+   "metadata": {},
+   "source": [
+    "The epilogue function `example_epilogue` can be used as a reference function. We can now verify the results simply with"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e32e7798",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class TorchReference(torch.nn.Module):\n",
+    "    def forward(self, A, B, alpha, C, beta, aux, bias):\n",
+    "        accum = torch.matmul(A, B)\n",
+    "        return example_epilogue(accum, alpha, C, beta, aux, bias)\n",
+    "\n",
+    "torch_reference = TorchReference()\n",
+    "tensor_D_ref, tensor_F_ref = torch_reference(tensor_A, tensor_B, alpha, tensor_C, beta, aux, bias)\n",
+    "\n",
+    "assert torch.equal(tensor_D, tensor_D_ref)\n",
+    "assert torch.equal(tensor_F, tensor_F_ref)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b69e441f",
+   "metadata": {},
+   "source": [
+    "The performance of CUTLASS fused kernel can be profiled with"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8db92150",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "warmup_iterations = 10\n",
+    "profile_iterations = 50\n",
+    "# Profile CUTLASS fused kernel\n",
+    "duration = CUDAEventProfiler(\n",
+    "    plan, warmup_iterations, profile_iterations,\n",
+    "    tensor_A, tensor_B, tensor_C, tensor_D, \n",
+    "    visitor_args=visitor_args)()\n",
+    "\n",
+    "print(f\"CUTLASS duration: {duration:.2f} ms\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/examples/python/deprecated/README.md
+++ b/examples/python/deprecated/README.md
@ -0,0 +1,54 @@
+# Examples of using the CUTLASS Python interface
+
+* [00_basic_gemm](/examples/python/00_basic_gemm.ipynb)
+
+    Shows how declare, configure, compile, and run a CUTLASS GEMM using the Python interface
+
+* [01_epilogue](/examples/python/01_epilogue.ipynb)
+
+    Shows how to fuse elementwise activation functions to GEMMs via the Python interface
+
+* [02_pytorch_extension_grouped_gemm](/examples/python/02_pytorch_extension_grouped_gemm.ipynb)
+
+    Shows how to declare, compile, and run a grouped GEMM operation via the Python interface,
+    along with how the emitted kernel can be easily exported to a PyTorch CUDA extension.
+
+* [03_basic_conv2d](/examples/python/03_basic_conv2d.ipynb)
+
+    Shows how to declare, configure, compile, and run a CUTLASS Conv2d using the Python interface
+
+* [04_epilogue_visitor](/examples/python/04_epilogue_visitor.ipynb)
+
+    Shows how to fuse elementwise activation functions to GEMMs via the Python Epilogue Visitor interface
+
+# Copyright
+
+Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: BSD-3-Clause
+
+```
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions are met:
+
+  1. Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+  3. Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```