2.7 KiB
2.7 KiB
Changelog for CuTe DSL API changes
4.1.0 (2025-07-16)
- for loop
- Python built-in
rangenow always generates IR and executes at runtime cutlass.rangeis advancedrangewith IR level unrolling and pipelining control- Deprecated
cutlass.range_dynamic, please replace withrangeorcutlass.range - Experimental Added
pipeliningcontrol for compiler generated software pipeline code
- Python built-in
- while/if
while/ifnow by default generates IR and executes at runtime unlesscutlass.const_expris specified for the predicate- Deprecated
cutlass.dynamic_expr, please remove it
- Rename mbarrier functions to reduce ambiguity
- Modify SyncObject API (
MbarrierArray,NamedBarrier,TmaStoreFence) to matchstd::barrier - Change pipeline
createfunction to take only keyword arguments, and makebarrier_storageoptional. - Introduce
cutlass.cute.arch.get_dyn_smem_sizeapi to get runtime dynamic shared memory size. - Various API Support for SM100 BlockScaled Gemm
- Introduce BlockScaled MmaOps in tcgen05/mma.py, and provide a
make_blockscaled_trivial_tiled_mmafunction in blackwell_helpers.py to help construct a BlockScaled TiledMma. - Introduce S2T CopyOps in tcgen05/copy.py.
- Introduce BlockScaled layout utilities in blockscaled_layout.py for creating the required scale factor layouts in global memory, shared memory and tensor memory.
- Introduce BlockScaled MmaOps in tcgen05/mma.py, and provide a
cutlass.cute.compilenow supports compilation options. Refer to JIT compilation options for more details.cutlass.cute.testing.assert_now works for device JIT function. Specify--enable-device-assertionsas compilation option to enable.cutlass.cute.make_tiled_copyis now deprecated. Please usecutlass.cute.make_tiled_copy_tvinstead.- Shared memory capacity query
- Introduce
cutlass.utils.get_smem_capacity_in_bytesfor querying the shared memory capacity. <arch>_utils.SMEM_CAPACITY["<arch_str>"]is now deprecated.
- Introduce
4.0.0 (2025-06-03)
- Fixed API mismatch in class
cute.runtime.Pointer: changeelement_typetodtypeto matchtyping.Pointer