cutlass/examples/python/CuTeDSL/notebooks/tensor.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cutlass\n",
    "import cutlass.cute as cute"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tensor\n",
    "\n",
    "A tensor in CuTe is created through the composition of two key components:\n",
    "\n",
    "1. An **Engine** (E) - A random-access, pointer-like object that supports:\n",
    "   - Offset operation: `e + d → e` (offset engine by elements of a layout's codomain)\n",
    "   - Dereference operation: `*e → v` (dereference engine to produce value)\n",
    "\n",
    "2. A **Layout** (L) - Defines the mapping from coordinates to offsets\n",
    "\n",
    "A tensor is formally defined as the composition of an engine E with a layout L, expressed as `T = E ∘ L`. When evaluating a tensor at coordinate c, it:\n",
    "\n",
    "1. Maps the coordinate c to the codomain using the layout\n",
    "2. Offsets the engine accordingly\n",
    "3. Dereferences the result to obtain the tensor's value\n",
    "\n",
    "This can be expressed mathematically as:\n",
    "\n",
    "```\n",
    "T(c) = (E ∘ L)(c) = *(E + L(c))\n",
    "```\n",
    "\n",
    "## Example Usage\n",
    "\n",
    "Here's a simple example of creating a tensor using pointer and layout `(8,5):(5,1)` and fill with ones:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cute.jit\n",
    "def create_tensor_from_ptr(ptr: cute.Pointer):\n",
    "    layout = cute.make_layout((8, 5), stride=(5, 1))\n",
    "    tensor = cute.make_tensor(ptr, layout)\n",
    "    tensor.fill(1)\n",
    "    cute.print_tensor(tensor)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This creates a tensor where:\n",
    "- The engine is a pointer\n",
    "- The layout with shape `(8, 5)` and stride `(5, 1)`\n",
    "- The resulting tensor can be evaluated using coordinates defined by the layout\n",
    "\n",
    "We can test this by allocating buffer with torch and run test with pointer to torch tensor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(raw_ptr(0x000000000736b0c0: f32, generic, align<4>) o (8,5):(5,1), data=\n",
      "       [[ 1.000000,  1.000000,  1.000000,  1.000000,  1.000000, ],\n",
      "        [ 1.000000,  1.000000,  1.000000,  1.000000,  1.000000, ],\n",
      "        [ 1.000000,  1.000000,  1.000000,  1.000000,  1.000000, ],\n",
      "        ...\n",
      "        [ 1.000000,  1.000000,  1.000000,  1.000000,  1.000000, ],\n",
      "        [ 1.000000,  1.000000,  1.000000,  1.000000,  1.000000, ],\n",
      "        [ 1.000000,  1.000000,  1.000000,  1.000000,  1.000000, ]])\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "from cutlass.torch import dtype as torch_dtype\n",
    "import cutlass.cute.runtime as cute_rt\n",
    "\n",
    "a = torch.randn(8, 5, dtype=torch_dtype(cutlass.Float32))\n",
    "ptr_a = cute_rt.make_ptr(cutlass.Float32, a.data_ptr())\n",
    "\n",
    "create_tensor_from_ptr(ptr_a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DLPACK support \n",
    "\n",
    "CuTe DSL is designed to support dlpack protocol natively. This offers easy integration with frameworks \n",
    "supporting DLPack, e.g. torch, numpy, jax, tensorflow, etc.\n",
    "\n",
    "For more information, please refer to DLPACK project: https://github.com/dmlc/dlpack\n",
    "\n",
    "Calling `from_dlpack` can convert any tensor or ndarray object supporting `__dlpack__` and `__dlpack_device__`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cutlass.cute.runtime import from_dlpack\n",
    "\n",
    "@cute.jit\n",
    "def print_tensor_dlpack(src: cute.Tensor):\n",
    "    print(src)\n",
    "    cute.print_tensor(src)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor<ptr<f32, generic> o (8,5):(5,1)>\n",
      "tensor(raw_ptr(0x0000000007559340: f32, generic, align<4>) o (8,5):(5,1), data=\n",
      "       [[-1.151769,  1.019397, -0.371175, -0.717776,  0.502176, ],\n",
      "        [ 0.114282,  0.900084,  0.320770,  1.564574, -0.632329, ],\n",
      "        [-0.570140,  0.178112, -0.423079,  1.936198,  0.003355, ],\n",
      "        ...\n",
      "        [-2.425393, -0.275528,  1.267157, -0.811101, -0.985456, ],\n",
      "        [ 0.777889, -2.114074,  0.357184, -0.321312, -0.938138, ],\n",
      "        [ 1.959564,  1.797602,  0.116901,  0.306198, -1.837295, ]])\n"
     ]
    }
   ],
   "source": [
    "a = torch.randn(8, 5, dtype=torch_dtype(cutlass.Float32))\n",
    "\n",
    "print_tensor_dlpack(from_dlpack(a))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor<ptr<f32, generic> o (8,8):(8,1)>\n",
      "tensor(raw_ptr(0x0000000007979da0: f32, generic, align<4>) o (8,8):(8,1), data=\n",
      "       [[ 0.122739, -0.605744, -1.442022, ..., -0.356501, -0.993329, -0.091110, ],\n",
      "        [ 0.278448,  0.318482, -0.276867, ...,  1.542181, -1.701539, -0.309454, ],\n",
      "        [ 0.563565, -0.753936,  0.131214, ...,  0.437912, -0.482277, -0.051540, ],\n",
      "        ...\n",
      "        [-1.974096, -0.177881,  0.426807, ..., -1.579115, -0.304974,  0.451164, ],\n",
      "        [ 0.149851, -0.704689, -0.295063, ..., -0.653001,  0.008871,  0.903916, ],\n",
      "        [ 1.188619,  1.519662,  1.270734, ...,  0.404082,  0.173200,  0.093476, ]])\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "a = np.random.randn(8, 8).astype(np.float32)\n",
    "\n",
    "print_tensor_dlpack(from_dlpack(a))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tensor Evaluation Methods\n",
    "\n",
    "Tensors support two primary methods of evaluation:\n",
    "\n",
    "### 1. Full Evaluation\n",
    "When applying the tensor evaluation with a complete coordinate c, it computes the offset, applies it to the engine, \n",
    "and dereferences it to return the stored value. This is the straightforward case where you want to access \n",
    "a specific element of the tensor.\n",
    "\n",
    "### 2. Partial Evaluation (Slicing)\n",
    "When evaluating with an incomplete coordinate c = c' ⊕ c* (where c* represents the unspecified portion), \n",
    "the result is a new tensor which is a slice of the original tensor with its engine offset to account for \n",
    "the coordinates that were provided. This operation can be expressed as:\n",
    "\n",
    "```\n",
    "T(c) = (E ∘ L)(c) = (E + L(c')) ∘ L(c*) = T'(c*)\n",
    "```\n",
    "\n",
    "Slicing effectively reduces the dimensionality of the tensor, creating a sub-tensor that can be \n",
    "further evaluated or manipulated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a[2] = 10.000000 (equivalent to a[(2,0)])\n",
      "a[9] = 6.000000 (equivalent to a[(1,1)])\n",
      "a[2,0] = 10.000000\n",
      "a[2,4] = 14.000000\n",
      "a[(2,4)] = 14.000000\n",
      "a[2,3] = 100.000000\n",
      "a[(2,4)] = 101.000000\n",
      "tensor([[  0.,   1.,   2.,   3.,   4.],\n",
      "        [  5.,   6.,   7.,   8.,   9.],\n",
      "        [ 10.,  11.,  12., 100., 101.],\n",
      "        [ 15.,  16.,  17.,  18.,  19.],\n",
      "        [ 20.,  21.,  22.,  23.,  24.],\n",
      "        [ 25.,  26.,  27.,  28.,  29.],\n",
      "        [ 30.,  31.,  32.,  33.,  34.],\n",
      "        [ 35.,  36.,  37.,  38.,  39.]])\n"
     ]
    }
   ],
   "source": [
    "@cute.jit\n",
    "def tensor_access_item(a: cute.Tensor):\n",
    "    # access data using linear index\n",
    "    cute.printf(\"a[2] = {} (equivalent to a[{}])\", a[2],\n",
    "                cute.make_identity_tensor(a.layout.shape)[2])\n",
    "    cute.printf(\"a[9] = {} (equivalent to a[{}])\", a[9],\n",
    "                cute.make_identity_tensor(a.layout.shape)[9])\n",
    "\n",
    "    # access data using n-d coordinates, following two are equivalent\n",
    "    cute.printf(\"a[2,0] = {}\", a[2, 0])\n",
    "    cute.printf(\"a[2,4] = {}\", a[2, 4])\n",
    "    cute.printf(\"a[(2,4)] = {}\", a[2, 4])\n",
    "\n",
    "    # assign value to tensor@(2,4)\n",
    "    a[2,3] = 100.0\n",
    "    a[2,4] = 101.0\n",
    "    cute.printf(\"a[2,3] = {}\", a[2,3])\n",
    "    cute.printf(\"a[(2,4)] = {}\", a[(2,4)])\n",
    "\n",
    "@cute.kernel\n",
    "def print_tensor_gpu(ptr: cute.Pointer):\n",
    "    layout = cute.make_layout((8, 5), stride=(5, 1))\n",
    "    tensor = cute.make_tensor(ptr, layout)\n",
    "\n",
    "    tidx, _, _ = cute.arch.thread_idx()\n",
    "\n",
    "    if tidx == 0:\n",
    "        cute.print_tensor(tensor)\n",
    "\n",
    "\n",
    "# Create a tensor with sequential data using torch\n",
    "data = torch.arange(0, 8*5, dtype=torch.float32).reshape(8, 5)\n",
    "tensor_access_item(from_dlpack(data))\n",
    "\n",
    "print(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tensor as memory view\n",
    "\n",
    "In CUDA programming, different memory spaces have different characteristics in terms of access speed, scope, and lifetime:\n",
    "\n",
    "- **generic**: Default memory space that can refer to any other memory space.\n",
    "- **global memory (gmem)**: Accessible by all threads across all blocks, but has higher latency.\n",
    "- **shared memory (smem)**: Accessible by all threads within a block, with much lower latency than global memory.\n",
    "- **register memory (rmem)**: Thread-private memory with the lowest latency, but limited capacity.\n",
    "- **tensor memory (tmem)**: Specialized memory introduced in NVIDIA Blackwell architecture for tensor operations.\n",
    "\n",
    "When creating tensors in CuTe, you can specify the memory space to optimize performance based on your access patterns.\n",
    "\n",
    "For more information on CUDA memory spaces, see the [CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Coordinate Tensor\n",
    "\n",
    "A coordinate tensor is a special type of tensor that maps coordinates to coordinates rather than to values. \n",
    "The key distinction is that while regular tensors map coordinates to some value type (like numbers), \n",
    "coordinate tensors map coordinates to other coordinates.\n",
    "\n",
    "For example, given a shape (4,4), a coordinate tensor using row-major layout would appear as:\n",
    "\n",
    "\\begin{bmatrix} \n",
    "(0,0) & (0,1) & (0,2) & (0,3) \\\\\n",
    "(1,0) & (1,1) & (1,2) & (1,3) \\\\\n",
    "(2,0) & (2,1) & (2,2) & (2,3) \\\\\n",
    "(3,0) & (3,1) & (3,2) & (3,3)\n",
    "\\end{bmatrix}\n",
    "\n",
    "The same shape with a column-major layout would appear as:\n",
    "\n",
    "\\begin{bmatrix}\n",
    "(0,0) & (1,0) & (2,0) & (3,0) \\\\\n",
    "(0,1) & (1,1) & (2,1) & (3,1) \\\\\n",
    "(0,2) & (1,2) & (2,2) & (3,2) \\\\\n",
    "(0,3) & (1,3) & (2,3) & (3,3)\n",
    "\\end{bmatrix}\n",
    "\n",
    "The key points about coordinate tensors are:\n",
    "- Each element in the tensor is itself a coordinate tuple (i,j) rather than a scalar value\n",
    "- The coordinates map to themselves - so position (1,2) contains the coordinate (1,2)\n",
    "- The layout (row-major vs column-major) determines how these coordinate tuples are arranged in memory\n",
    "\n",
    "For example, coordinate tensors can be created using the `make_identity_tensor` utility:\n",
    "\n",
    "```python\n",
    "coord_tensor = make_identity_tensor(layout.shape())\n",
    "```\n",
    "\n",
    "This creates a tensor that maps each coordinate to itself, providing a reference point for understanding how other layouts transform these coordinates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor<(0,0) o (8,4):(1@0,1@1)>\n"
     ]
    }
   ],
   "source": [
    "@cute.jit\n",
    "def print_tensor_coord(a: cute.Tensor):\n",
    "    coord_tensor = cute.make_identity_tensor(a.layout.shape)\n",
    "    print(coord_tensor)\n",
    "\n",
    "a = torch.randn(8,4, dtype=torch_dtype(cutlass.Float32))\n",
    "print_tensor_coord(from_dlpack(a))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}