175 lines
5.0 KiB
Plaintext
175 lines
5.0 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Your First Program with CuTe DSL\n",
|
|
"\n",
|
|
"## Introduction\n",
|
|
"\n",
|
|
"Welcome! In this tutorial, we'll write a simple \"Hello World\" program that runs on your GPU using CuTe DSL. This will help you understand the basics of GPU programming with our framework.\n",
|
|
"\n",
|
|
"### What You'll Learn\n",
|
|
"\n",
|
|
"- How to write code that runs on both CPU (host) and GPU (device),\n",
|
|
"- How to launch a GPU kernel (a function that runs on the GPU),\n",
|
|
"- Basic CUDA concepts like threads and thread blocks,\n",
|
|
"\n",
|
|
"### Step 1: Import Required Libraries\n",
|
|
"\n",
|
|
"First, let's import the libraries we need:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import cutlass \n",
|
|
"import cutlass.cute as cute "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"### Step 2: Write Our GPU Kernel\n",
|
|
"A GPU kernel is a function that runs on the GPU. Here's a simple kernel that prints \"Hello World\".\n",
|
|
"Key concepts:\n",
|
|
"- `@cute.kernel`: This decorator tells CUTLASS that this function should run on the GPU\n",
|
|
"- `cute.arch.thread_idx()`: Gets the ID of the current GPU thread (like a worker's ID number)\n",
|
|
"- We only want one thread to print the message (thread 0) to avoid multiple prints"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"@cute.kernel\n",
|
|
"def kernel():\n",
|
|
" # Get the x component of the thread index (y and z components are unused)\n",
|
|
" tidx, _, _ = cute.arch.thread_idx()\n",
|
|
" # Only the first thread (thread 0) prints the message\n",
|
|
" if tidx == 0:\n",
|
|
" cute.printf(\"Hello world\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Step 3: Write Our Host Function\n",
|
|
"\n",
|
|
"Now we need a function that sets up the GPU and launches our kernel.\n",
|
|
"Key concepts:\n",
|
|
"- `@cute.jit`: This decorator is for functions that run on the CPU but can launch GPU code\n",
|
|
"- We need to initialize CUDA before using the GPU\n",
|
|
"- `.launch()` tells CUDA how many blocks, threads, shared memory, etc. to use"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"@cute.jit\n",
|
|
"def hello_world():\n",
|
|
"\n",
|
|
" # Print hello world from host code\n",
|
|
" cute.printf(\"hello world\")\n",
|
|
"\n",
|
|
" # Launch kernel\n",
|
|
" kernel().launch(\n",
|
|
" grid=(1, 1, 1), # Single thread block\n",
|
|
" block=(32, 1, 1) # One warp (32 threads) per thread block\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Step 4: Run Our Program\n",
|
|
"\n",
|
|
"There are 2 ways we can run our program:\n",
|
|
"\n",
|
|
"1. compile and run immediately\n",
|
|
"2. separate compilation which allows us to compile the code once and run multiple times\n",
|
|
" \n",
|
|
"Please note the `Compiling...` for Method 2 prints before the \"Hello world\" of the first kernel. This shows the asynchronous behavior between CPU and GPU prints. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Running hello_world()...\n",
|
|
"hello world\n",
|
|
"Compiling...\n",
|
|
"Hello world\n",
|
|
"Running compiled version...\n",
|
|
"hello world\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Initialize CUDA context for launching a kernel with error checking\n",
|
|
"# We make context initialization explicit to allow users to control the context creation \n",
|
|
"# and avoid potential issues with multiple contexts\n",
|
|
"cutlass.cuda.initialize_cuda_context()\n",
|
|
"\n",
|
|
"# Method 1: Just-In-Time (JIT) compilation - compiles and runs the code immediately\n",
|
|
"print(\"Running hello_world()...\")\n",
|
|
"hello_world()\n",
|
|
"\n",
|
|
"# Method 2: Compile first (useful if you want to run the same code multiple times)\n",
|
|
"print(\"Compiling...\")\n",
|
|
"hello_world_compiled = cute.compile(hello_world)\n",
|
|
"\n",
|
|
"# Run the pre-compiled version\n",
|
|
"print(\"Running compiled version...\")\n",
|
|
"hello_world_compiled()"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.12.5"
|
|
},
|
|
"widgets": {
|
|
"application/vnd.jupyter.widget-state+json": {
|
|
"state": {},
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|