cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_code_generation.rst

.. _dsl_code_generation:
.. |DC|  replace:: dynamic compilation
.. |DSL| replace:: CuTe DSL
.. |IR|  replace:: intermediate representation (IR)

End-to-End Code Generation
==========================


1. Techniques for Turning Python into |IR|
------------------------------------------

1.1 AST rewrite
^^^^^^^^^^^^^^^^
The function’s abstract-syntax tree is analysed **before** execution.
Python control-flow (``for``/``while``, ``if``/``else``) and built-ins are converted to structured |IR|
constructs.  Computation inside each region is left untouched at this stage.

*Advantages*

*  Sees the entire program, so every branch and loop is preserved.
*  Keeps loop structure intact for optimization such as tiling, vectorisation
   or GPU thread mapping.

*Disadvantages*

*  Requires a well-defined Python subset that the rewriter understands.


1.2 Tracing
^^^^^^^^^^^
The decorated function is executed once with *proxy* arguments; overloaded
operators record every tensor operation that actually runs and produce a flat
trace that is lowered to |IR|.

*Advantages*

*  Near-zero compile latency, ideal for straight-line arithmetic.
*  No need to parse Python source, so it supports many dynamic Python
   features, and Python has many features.

*Disadvantages*

*  Untaken branches vanish, so the generated kernel may be wrong for other
   inputs.
*  Loops are flattened to the iteration count observed during tracing.
*  Data-dependent control-flow freezes to a single execution path.


2. |DSL| Code-Generation Modes
------------------------------

CuTe’s Python front-end combines the techniques above into **two mutually
exclusive modes**, selectable with the ``preprocessor`` flag of the
``@jit`` decorator:

1. Tracing mode ``@jit(preprocess=False)`` – tracing only.
This results in the fastest compilation path and is recommended only for kernels that are guaranteed to be
straight-line arithmetic. It suffers from all tracing limitations listed in the previous section.

2.  Preprocessor mode (**default**) ``@jit(preprocess=True)`` – **AST rewrite + tracing**.
The AST pass captures every loop and branch, eliminating the correctness and
optimisation problems of pure tracing; tracing then fills in the arithmetic.
This hybrid “preprocessor” pipeline is unique to |DSL| and was designed
specifically to overcome the disadvantages identified above.

.. figure:: dsl_modes.png
   :width: 400
   :align: center

   *Left*: tracing mode records only the path that executed.
   *Right*: preprocessor mode emits structured |IR| for every branch and loop
   before tracing the arithmetic.


Why Tracing-Only Is Insufficient for Control-Flow
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Branch loss** – The untaken side of an ``if``/``else`` is never lowered.
* **Loop unrolling** – Loops are flattened to the iteration count observed,
  destroying structure needed for parallel mapping and tiling.
* **Data-dependent paths** – Control-flow that depends on tensor values freezes
  to a single execution path at trace time.

The preprocessor mode fixes all of these by lowering control-flow first and delegating
only the arithmetic to the tracer.