381 lines
13 KiB
Plaintext
381 lines
13 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import cutlass\n",
|
|
"import cutlass.cute as cute"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Tensor\n",
|
|
"\n",
|
|
"A tensor in CuTe is created through the composition of two key components:\n",
|
|
"\n",
|
|
"1. An **Engine** (E) - A random-access, pointer-like object that supports:\n",
|
|
" - Offset operation: `e + d → e` (offset engine by elements of a layout's codomain)\n",
|
|
" - Dereference operation: `*e → v` (dereference engine to produce value)\n",
|
|
"\n",
|
|
"2. A **Layout** (L) - Defines the mapping from coordinates to offsets\n",
|
|
"\n",
|
|
"A tensor is formally defined as the composition of an engine E with a layout L, expressed as `T = E ∘ L`. When evaluating a tensor at coordinate c, it:\n",
|
|
"\n",
|
|
"1. Maps the coordinate c to the codomain using the layout\n",
|
|
"2. Offsets the engine accordingly\n",
|
|
"3. Dereferences the result to obtain the tensor's value\n",
|
|
"\n",
|
|
"This can be expressed mathematically as:\n",
|
|
"\n",
|
|
"```\n",
|
|
"T(c) = (E ∘ L)(c) = *(E + L(c))\n",
|
|
"```\n",
|
|
"\n",
|
|
"## Example Usage\n",
|
|
"\n",
|
|
"Here's a simple example of creating a tensor using pointer and layout `(8,5):(5,1)` and fill with ones:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"@cute.jit\n",
|
|
"def create_tensor_from_ptr(ptr: cute.Pointer):\n",
|
|
" layout = cute.make_layout((8, 5), stride=(5, 1))\n",
|
|
" tensor = cute.make_tensor(ptr, layout)\n",
|
|
" tensor.fill(1)\n",
|
|
" cute.print_tensor(tensor)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This creates a tensor where:\n",
|
|
"- The engine is a pointer\n",
|
|
"- The layout with shape `(8, 5)` and stride `(5, 1)`\n",
|
|
"- The resulting tensor can be evaluated using coordinates defined by the layout\n",
|
|
"\n",
|
|
"We can test this by allocating buffer with torch and run test with pointer to torch tensor"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"tensor(raw_ptr(0x000000000736b0c0: f32, generic, align<4>) o (8,5):(5,1), data=\n",
|
|
" [[ 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, ],\n",
|
|
" [ 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, ],\n",
|
|
" [ 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, ],\n",
|
|
" ...\n",
|
|
" [ 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, ],\n",
|
|
" [ 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, ],\n",
|
|
" [ 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, ]])\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import torch\n",
|
|
"\n",
|
|
"from cutlass.torch import dtype as torch_dtype\n",
|
|
"import cutlass.cute.runtime as cute_rt\n",
|
|
"\n",
|
|
"a = torch.randn(8, 5, dtype=torch_dtype(cutlass.Float32))\n",
|
|
"ptr_a = cute_rt.make_ptr(cutlass.Float32, a.data_ptr())\n",
|
|
"\n",
|
|
"create_tensor_from_ptr(ptr_a)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## DLPACK support \n",
|
|
"\n",
|
|
"CuTe DSL is designed to support dlpack protocol natively. This offers easy integration with frameworks \n",
|
|
"supporting DLPack, e.g. torch, numpy, jax, tensorflow, etc.\n",
|
|
"\n",
|
|
"For more information, please refer to DLPACK project: https://github.com/dmlc/dlpack\n",
|
|
"\n",
|
|
"Calling `from_dlpack` can convert any tensor or ndarray object supporting `__dlpack__` and `__dlpack_device__`.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from cutlass.cute.runtime import from_dlpack\n",
|
|
"\n",
|
|
"@cute.jit\n",
|
|
"def print_tensor_dlpack(src: cute.Tensor):\n",
|
|
" print(src)\n",
|
|
" cute.print_tensor(src)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"tensor<ptr<f32, generic> o (8,5):(5,1)>\n",
|
|
"tensor(raw_ptr(0x0000000007559340: f32, generic, align<4>) o (8,5):(5,1), data=\n",
|
|
" [[-1.151769, 1.019397, -0.371175, -0.717776, 0.502176, ],\n",
|
|
" [ 0.114282, 0.900084, 0.320770, 1.564574, -0.632329, ],\n",
|
|
" [-0.570140, 0.178112, -0.423079, 1.936198, 0.003355, ],\n",
|
|
" ...\n",
|
|
" [-2.425393, -0.275528, 1.267157, -0.811101, -0.985456, ],\n",
|
|
" [ 0.777889, -2.114074, 0.357184, -0.321312, -0.938138, ],\n",
|
|
" [ 1.959564, 1.797602, 0.116901, 0.306198, -1.837295, ]])\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"a = torch.randn(8, 5, dtype=torch_dtype(cutlass.Float32))\n",
|
|
"\n",
|
|
"print_tensor_dlpack(from_dlpack(a))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"tensor<ptr<f32, generic> o (8,8):(8,1)>\n",
|
|
"tensor(raw_ptr(0x0000000007979da0: f32, generic, align<4>) o (8,8):(8,1), data=\n",
|
|
" [[ 0.122739, -0.605744, -1.442022, ..., -0.356501, -0.993329, -0.091110, ],\n",
|
|
" [ 0.278448, 0.318482, -0.276867, ..., 1.542181, -1.701539, -0.309454, ],\n",
|
|
" [ 0.563565, -0.753936, 0.131214, ..., 0.437912, -0.482277, -0.051540, ],\n",
|
|
" ...\n",
|
|
" [-1.974096, -0.177881, 0.426807, ..., -1.579115, -0.304974, 0.451164, ],\n",
|
|
" [ 0.149851, -0.704689, -0.295063, ..., -0.653001, 0.008871, 0.903916, ],\n",
|
|
" [ 1.188619, 1.519662, 1.270734, ..., 0.404082, 0.173200, 0.093476, ]])\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"a = np.random.randn(8, 8).astype(np.float32)\n",
|
|
"\n",
|
|
"print_tensor_dlpack(from_dlpack(a))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Tensor Evaluation Methods\n",
|
|
"\n",
|
|
"Tensors support two primary methods of evaluation:\n",
|
|
"\n",
|
|
"### 1. Full Evaluation\n",
|
|
"When applying the tensor evaluation with a complete coordinate c, it computes the offset, applies it to the engine, \n",
|
|
"and dereferences it to return the stored value. This is the straightforward case where you want to access \n",
|
|
"a specific element of the tensor.\n",
|
|
"\n",
|
|
"### 2. Partial Evaluation (Slicing)\n",
|
|
"When evaluating with an incomplete coordinate c = c' ⊕ c* (where c* represents the unspecified portion), \n",
|
|
"the result is a new tensor which is a slice of the original tensor with its engine offset to account for \n",
|
|
"the coordinates that were provided. This operation can be expressed as:\n",
|
|
"\n",
|
|
"```\n",
|
|
"T(c) = (E ∘ L)(c) = (E + L(c')) ∘ L(c*) = T'(c*)\n",
|
|
"```\n",
|
|
"\n",
|
|
"Slicing effectively reduces the dimensionality of the tensor, creating a sub-tensor that can be \n",
|
|
"further evaluated or manipulated."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"a[2] = 10.000000 (equivalent to a[(2,0)])\n",
|
|
"a[9] = 6.000000 (equivalent to a[(1,1)])\n",
|
|
"a[2,0] = 10.000000\n",
|
|
"a[2,4] = 14.000000\n",
|
|
"a[(2,4)] = 14.000000\n",
|
|
"a[2,3] = 100.000000\n",
|
|
"a[(2,4)] = 101.000000\n",
|
|
"tensor([[ 0., 1., 2., 3., 4.],\n",
|
|
" [ 5., 6., 7., 8., 9.],\n",
|
|
" [ 10., 11., 12., 100., 101.],\n",
|
|
" [ 15., 16., 17., 18., 19.],\n",
|
|
" [ 20., 21., 22., 23., 24.],\n",
|
|
" [ 25., 26., 27., 28., 29.],\n",
|
|
" [ 30., 31., 32., 33., 34.],\n",
|
|
" [ 35., 36., 37., 38., 39.]])\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"@cute.jit\n",
|
|
"def tensor_access_item(a: cute.Tensor):\n",
|
|
" # access data using linear index\n",
|
|
" cute.printf(\"a[2] = {} (equivalent to a[{}])\", a[2],\n",
|
|
" cute.make_identity_tensor(a.layout.shape)[2])\n",
|
|
" cute.printf(\"a[9] = {} (equivalent to a[{}])\", a[9],\n",
|
|
" cute.make_identity_tensor(a.layout.shape)[9])\n",
|
|
"\n",
|
|
" # access data using n-d coordinates, following two are equivalent\n",
|
|
" cute.printf(\"a[2,0] = {}\", a[2, 0])\n",
|
|
" cute.printf(\"a[2,4] = {}\", a[2, 4])\n",
|
|
" cute.printf(\"a[(2,4)] = {}\", a[2, 4])\n",
|
|
"\n",
|
|
" # assign value to tensor@(2,4)\n",
|
|
" a[2,3] = 100.0\n",
|
|
" a[2,4] = 101.0\n",
|
|
" cute.printf(\"a[2,3] = {}\", a[2,3])\n",
|
|
" cute.printf(\"a[(2,4)] = {}\", a[(2,4)])\n",
|
|
"\n",
|
|
"\n",
|
|
"# Create a tensor with sequential data using torch\n",
|
|
"data = torch.arange(0, 8*5, dtype=torch.float32).reshape(8, 5)\n",
|
|
"tensor_access_item(from_dlpack(data))\n",
|
|
"\n",
|
|
"print(data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Tensor as memory view\n",
|
|
"\n",
|
|
"In CUDA programming, different memory spaces have different characteristics in terms of access speed, scope, and lifetime:\n",
|
|
"\n",
|
|
"- **generic**: Default memory space that can refer to any other memory space.\n",
|
|
"- **global memory (gmem)**: Accessible by all threads across all blocks, but has higher latency.\n",
|
|
"- **shared memory (smem)**: Accessible by all threads within a block, with much lower latency than global memory.\n",
|
|
"- **register memory (rmem)**: Thread-private memory with the lowest latency, but limited capacity.\n",
|
|
"- **tensor memory (tmem)**: Specialized memory introduced in NVIDIA Blackwell architecture for tensor operations.\n",
|
|
"\n",
|
|
"When creating tensors in CuTe, you can specify the memory space to optimize performance based on your access patterns.\n",
|
|
"\n",
|
|
"For more information on CUDA memory spaces, see the [CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy).\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Coordinate Tensor\n",
|
|
"\n",
|
|
"A coordinate tensor is a special type of tensor that maps coordinates to coordinates rather than to values. \n",
|
|
"The key distinction is that while regular tensors map coordinates to some value type (like numbers), \n",
|
|
"coordinate tensors map coordinates to other coordinates.\n",
|
|
"\n",
|
|
"For example, given a shape (4,4), a coordinate tensor using row-major layout would appear as:\n",
|
|
"\n",
|
|
"\\begin{bmatrix} \n",
|
|
"(0,0) & (0,1) & (0,2) & (0,3) \\\\\n",
|
|
"(1,0) & (1,1) & (1,2) & (1,3) \\\\\n",
|
|
"(2,0) & (2,1) & (2,2) & (2,3) \\\\\n",
|
|
"(3,0) & (3,1) & (3,2) & (3,3)\n",
|
|
"\\end{bmatrix}\n",
|
|
"\n",
|
|
"The same shape with a column-major layout would appear as:\n",
|
|
"\n",
|
|
"\\begin{bmatrix}\n",
|
|
"(0,0) & (1,0) & (2,0) & (3,0) \\\\\n",
|
|
"(0,1) & (1,1) & (2,1) & (3,1) \\\\\n",
|
|
"(0,2) & (1,2) & (2,2) & (3,2) \\\\\n",
|
|
"(0,3) & (1,3) & (2,3) & (3,3)\n",
|
|
"\\end{bmatrix}\n",
|
|
"\n",
|
|
"The key points about coordinate tensors are:\n",
|
|
"- Each element in the tensor is itself a coordinate tuple (i,j) rather than a scalar value\n",
|
|
"- The coordinates map to themselves - so position (1,2) contains the coordinate (1,2)\n",
|
|
"- The layout (row-major vs column-major) determines how these coordinate tuples are arranged in memory\n",
|
|
"\n",
|
|
"For example, coordinate tensors can be created using the `make_identity_tensor` utility:\n",
|
|
"\n",
|
|
"```python\n",
|
|
"coord_tensor = make_identity_tensor(layout.shape())\n",
|
|
"```\n",
|
|
"\n",
|
|
"This creates a tensor that maps each coordinate to itself, providing a reference point for understanding how other layouts transform these coordinates."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"tensor<(0,0) o (8,4):(1@0,1@1)>\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"@cute.jit\n",
|
|
"def print_tensor_coord(a: cute.Tensor):\n",
|
|
" coord_tensor = cute.make_identity_tensor(a.layout.shape)\n",
|
|
" print(coord_tensor)\n",
|
|
"\n",
|
|
"a = torch.randn(8,4, dtype=torch_dtype(cutlass.Float32))\n",
|
|
"print_tensor_coord(from_dlpack(a))"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.12.5"
|
|
},
|
|
"widgets": {
|
|
"application/vnd.jupyter.widget-state+json": {
|
|
"state": {},
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|