Files
cutlass/media/docs/pythonDSL/cute_dsl_general/debugging.rst
Junkai-Wu b1d6e2c9b3 v4.3 update. (#2709)
* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>
2025-10-21 14:26:30 -04:00

181 lines
5.7 KiB
ReStructuredText

.. _debugging:
Debugging
=========
This page provides an overview of debugging techniques and tools for CuTe DSL programs.
Getting Familiar with the Limitations
-------------------------------------
Before diving into comprehensive debugging capabilities, it's important to understand the limitations of CuTe DSL.
Understanding these limitations will help you avoid potential pitfalls from the start.
Please refer to :doc:`../limitations` for more details.
Source Code Correlation
-----------------------
CuTe DSL provides Python code to PTX/SASS correlation to enable the profiling/debugging of generated kernels with debug symbols by generating line info when compiling the kernel.
You can enable that globally via the environment variable CUTE_DSL_LINEINFO=1. Alternative, you can use compilation options to enable that per kernel. Please refer to :doc:`./dsl_jit_compilation_options` for more details.
DSL Debugging
-------------
CuTe DSL provides built-in logging mechanisms to help you understand the code execution flow and
some of the internal state.
Enabling Logging
~~~~~~~~~~~~~~~~
CuTe DSL provides environment variables to control logging level:
.. code:: bash
# Enable console logging (default: False)
export CUTE_DSL_LOG_TO_CONSOLE=1
# Log to file instead of console (default: False)
export CUTE_DSL_LOG_TO_FILE=my_log.txt
# Control log verbosity (0, 10, 20, 30, 40, 50, default: 10)
export CUTE_DSL_LOG_LEVEL=20
Log Categories and Levels
~~~~~~~~~~~~~~~~~~~~~~~~~
Similar to standard Python logging, different log levels provide varying degrees of detail:
+--------+-------------+
| Level | Description |
+========+=============+
| 0 | Disabled |
+--------+-------------+
| 10 | Debug |
+--------+-------------+
| 20 | Info |
+--------+-------------+
| 30 | Warning |
+--------+-------------+
| 40 | Error |
+--------+-------------+
| 50 | Critical |
+--------+-------------+
Dump the generated IR
~~~~~~~~~~~~~~~~~~~~~
For users familiar with MLIR and compilers, CuTe DSL supports dumping the Intermediate Representation (IR).
This helps you verify whether the IR is generated as expected.
.. code:: bash
# Dump Generated CuTe IR (default: False)
export CUTE_DSL_PRINT_IR=1
# Keep Generated CuTe IR in a file (default: False)
export CUTE_DSL_KEEP_IR=1
Dump the generated PTX & CUBIN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For users familiar with PTX and SASS, CuTe DSL supports dumping the generated PTX and CUBIN.
.. code:: bash
# Dump generated PTX in a .ptx file (default: False)
export CUTE_DSL_KEEP_PTX=1
# Dump generated cubin in a .cubin file (default: False)
export CUTE_DSL_KEEP_CUBIN=1
To further get SASS from cubin, users can use ``nvdisasm`` (usually installed with CUDA toolkit) to disassemble the cubin.
.. code:: bash
nvdisasm your_dsl_code.cubin > your_dsl_code.sass
Access the dumped contents programmatically
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For compiled kernels, the generated PTX/CUBIN/IR can be accessed programmatically as well through following attributes:
- ``__ptx__``: The generated PTX code of the compiled kernel.
- ``__cubin__``: The generated CUBIN data of the compiled kernel.
- ``__mlir__``: The generated IR code of the compiled kernel.
.. code:: python
compiled_foo = cute.compile(foo, ...)
print(f"PTX: {compiled_foo.__ptx__}")
with open("foo.cubin", "wb") as f:
f.write(compiled_foo.__cubin__)
Change the dump directory
~~~~~~~~~~~~~~~~~~~~~~~~~
By default, all dumped files are saved in the current working directory. To specify a different directory for the dumped files, please set the environment variable CUTE_DSL_DUMP_DIR accordingly.
Kernel Functional Debugging
----------------------------
Using Python's ``print`` and CuTe's ``cute.printf``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CuTe DSL programs can use both Python's native ``print()`` as well as our own ``cute.printf()`` to
print debug information during kernel generation and execution. They differ in a few key ways:
- Python's ``print()`` executes during compile-time only (no effect on the generated kernel) and is
typically used for printing static values (e.g. a fully static layouts).
- ``cute.printf()`` executes at runtime on the GPU itself and changes the PTX being generated. This
can be used for printing values of tensors at runtime for diagnostics, but comes at a performance
overhead similar to that of `printf()` in CUDA C.
For detailed examples of using these functions for debugging, please refer to the associated
notebook referenced in :doc:`notebooks`.
Handling Unresponsive/Hung Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When a kernel becomes unresponsive and ``SIGINT`` (``CTRL+C``) fails to terminate it,
you can follow these steps to forcefully terminate the process:
1. Use ``CTRL+Z`` to suspend the unresponsive kernel
2. Execute the following command to terminate the suspended process:
.. code:: bash
# Terminate the most recently suspended process
kill -9 $(jobs -p | tail -1)
CuTe DSL can also be debugged using standard NVIDIA CUDA tools.
Using Compute-Sanitizer
~~~~~~~~~~~~~~~~~~~~~~~
For detecting memory errors and race conditions:
.. code:: bash
compute-sanitizer --some_options python your_dsl_code.py
Please refer to the `compute-sanitizer documentation <https://developer.nvidia.com/compute-sanitizer>`_ for more details.
Conclusion
----------
This page covered several key methods for debugging CuTe DSL programs. Effective debugging typically requires a combination of these approaches.
If you encounter issues with DSL, you can enable logging and share the logs with the CUTLASS team as a GitHub issue to report a bug.