DeepGemm Support - Step 2 (#2142 )

* add cpp example for DeepGemm. * add grouped_gemm_contiguous. * add groupedgemm_masked. * add python example and tests.
DeepGemm Support (#2137 )
2025-02-28 10:11:59 -05:00 · 2025-02-26 07:01:12 -05:00 · 2025-02-26 01:29:07 -05:00 · 2025-02-25 23:18:03 -05:00 · 2025-02-24 08:31:56 -05:00 · 2025-02-21 17:47:30 -05:00
431 changed files with 124441 additions and 2647 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,3 +1,4 @@
 # PyCache files
 __pycache__/
-cutlass_library.egg-info/
+cutlass_library.egg-info/
+/build*
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,9 +1,75 @@
 # NVIDIA CUTLASS Changelog
+
+## [3.8.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.8.0) (2025-01-25)
+
+* Support for new CuTe building blocks specifically for Blackwell SM100 architecture:
+  - [5th generation Blackwell Tensor Core instructions (TCGen05)](./include/cute/atom/mma_traits_sm100.hpp) via CuTe MMA atoms.
+  - Extensions to [Tensor Memory Accelerator](./include/cute/atom/copy_traits_sm100_tma.hpp) via CuTe Copy atoms.
+  - Exposure of Blackwell's new tensor memory (note: distinct from TMA) as [`tmem`](./include/cute/pointer.hpp) across CuTe as a first class data locale.
+  - Exposure of [`tmem->rmem`, `rmem->tmem` and `smem->tmem data movement instructions`](./include/cute/atom/copy_traits_sm100.hpp) as copy atoms in CuTe.
+  - [`make_tmem_copy()`](./include/cute/atom/copy_traits_sm100.hpp) utility method to ease creation of tiled copies for tmem copy atoms.
+  - Support for [new variants of LDSM on Blackwell](./include/cute/atom/copy_traits_sm100.hpp) via CuTe Copy atoms.
+* Support for new CUTLASS building blocks specifically for Blackwell SM100 architecture:
+  - Various narrow precision [FP4, FP6, and FP8](./include/cutlass/exmy_base.h) formats as well as their [block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8](./include/cutlass/float_subbyte.h)
+  - [Pipelines that implement Blackwell specific synchronization](./include/cutlass/pipeline/sm100_pipeline.hpp).
+  - [Cluster launch control API supporting preferred and fallback cluster shapes](./include/cutlass/cluster_launch.hpp).
+  - Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
+  - Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](./media/docs/blackwell_cluster_launch_control.md) to implement dynamic persistence scheduling for [GEMMs](./include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp), and [stream-K](./include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp).
+  - Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
+* Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
+  - [Blackwell specific kernel layers](./include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp) that
+    + Implement a new warp-specialization recipe tuned specifically for Blackwell SM100 architecture.
+    + Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.
+    + Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
+  - Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
+    * [Non-block scaled data types without support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_mma_warpspecialized.hpp)
+    * [Non-block scaled data types with support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_mma_array_warpspecialized.hpp)
+    * [Block scaled data types without support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp)
+    * [Block scaled data types with support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp)
+  - Blackwell [collective mainloop for convolution kernels](./include/cutlass/conv/collective/sm100_implicit_gemm_umma_warpspecialized.hpp) supporting non-block scaled data types for fprop, dgrad, and wgrad.
+  - New [GEMM](./include/cutlass/gemm/dispatch_policy.hpp), [convolution](./include/cutlass/conv/dispatch_policy.hpp), and [epilogue](./include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.
+  - [Blackwell epilogue that supports loading accumulators from `tmem`](./include/cutlass/epilogue/collective/sm100_epilogue_tma_warpspecialized.hpp) and [full set of EVT fusions]().
+* CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
+  - Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
+  - Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
+  - Support for mixed input GEMM kernels on Hopper in the profiler.
+* New CUTLASS profiler flag `use-cuda-graphs` to reduce overheads when benchmarking launch-bound kernels.
+* A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler (`./cutlass_profiler --operation=GroupedGemm --help` for details).
+* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
+  - [Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples](./examples/70_blackwell_gemm/), demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
+  - GEMM with [opt-in collective builder schedules showcasing available recipes](./examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu) for Blackwell.
+  - Block scaled data type GEMMs targeting Blackwell's native block scaled Tensor Cores:
+    + [NVFP4 inputs with BF16 output](./examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu)
+    + [NVFP4 inputs with NVFP4 output](./examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu)
+    + [Mixed MXFP8 and MXFP6 inputs with BF16 output](./examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm.cu)
+  - GEMM example demonstrating [Blackwell's new preferred cluster support via dynamic cluster shapes](./examples/73_blackwell_gemm_preferred_cluster/blackwell_gemm_preferred_cluster.cu) for increased occupancy.
+  - [GEMM with CLC based StreamK scheduler for load balancing](./examples/74_blackwell_gemm_streamk/blackwell_gemm_streamk.cu).
+  - Grouped GEMM for [vanilla FP8 data inputs](./examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm.cu) and [NVFP4 block scaled inputs](./examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm_block_scaled.cu).
+  - Convolution kernels for [fprop](./examples/76_blackwell_conv/76_blackwell_conv_fprop.cu), [dgrad](./examples/76_blackwell_conv/76_blackwell_conv_dgrad.cu), and [wgrad](./examples/76_blackwell_conv/76_blackwell_conv_wgrad.cu).
+  - [Fused multi-head attention fprop kernel](./examples/77_blackwell_fmha/77_blackwell_fmha.cu) supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
+  - A new BF16x9 GEMM [kernel](./examples/78_blackwell_emulated_bf16x9_gemm/78_blackwell_emulated_bf16x9_gemm.cu) that emulates FP32 GEMM (SGEMM) using BF16 operations.
+* Set of examples that demonstrate the usage of the 3.x API for targeting Hopper architecture:
+  - A set of new [Hopper grouped GEMM kernels](./examples/69_hopper_mixed_dtype_grouped_gemm/) that support mixed A and B datatypes.
+  - A new [Hopper FP8 GEMM with groupwise scaling](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu).
+* Documentation updates:
+  - [Quickstart - instantiating a Blackwell block-scaled GEMM](./media/docs/quickstart.md#instantiating-a-blackwell-gemm-kernel).
+  - Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/blackwell_functionality.md)
+  - A new [functionality documentation](./media/docs/functionality.md) specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
+  - Updates to [compatibility](./README.md#compatibility) section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and [Target Architecture](./README.md#Target-Architecture).
+  - Updates to [profiler documentation](./media/docs/profiler.md) for testing mixed input GEMM kernels on Hopper.
+
 ## [3.7.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.7.0) (2025-01-11)
 - [Hopper blockwise scaling FP8 GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) uses 2D scaling tensor, assigning one value per threadblock.  This allows a finer-grained scaling to be applied for each output tile per gemm-k iteration. The operands and scaling tensors are loaded from global memory to shared memory using TMA and cp_async, respectively. The scaling is applied inside the mainloop.  Details with figures are [here](https://github.com/NVIDIA/cutlass/pull/1932#issue-2645398439).
 - [Distributed GEMM](./examples/65_distributed_gemm/65_distributed_gemm.cu) is a new (experimental) API which can turn existing CUTLASS GEMM kernels into pipelined Tensor Parallel GEMMs that run efficiently on NVLink-based network of GPUs. Its pipelining schedules can hide most of the communication behind computation, and relies on point-to-point communication, which can simply use CUDA runtime's peer device access feature. It also utilizes remote TMA loads and memcopies with CUDA graphs to handle communication primarily through the Copy Engine, leaving all SMs free for Hopper's persistent kernels.  For more details you can refer to the [DistGEMM blog post](https://blog.shi-labs.com/distributed-gemm-88be6a481e2b).
 - Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new `make_kernel_hardware_info` API as shown in [example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
 - Enabled high precision accumulation for Hopper FP8 Sparse GEMM.
+- Potential API breaking changes:
+  + Fix `cute::UniversalCopy` for type safety.
+  + No longer implicitly select `cute::SM80_CP_ASYNC_*` based on input tensors. This avoids implicit downstream synchronization requirements. To use `SM80_CP_ASYNC`, users must explicitly select the appropriate CopyAtom.
+  + Fix `cute::SM80_CP_ASYNC_CACHEALWAYS`, `cute::SM80_CP_ASYNC_CACHEGLOBAL`, `cute::SM80_CP_ASYNC_CACHEALWAYS_ZFILL`, `cute::SM80_CP_ASYNC_CACHEGLOBAL_ZFILL` to avoid implicitly selecting `ZFILL` behavior on predication.
+  + Remove `cute::copy_vec<T>` in favor of `cute::copy_aligned` and `cute::copy(AutoVectorizingCopyWithAssumedAlignment<NumBits>,...)`.
+  + A refactor of default epilogue struct `DefaultEpilogue` [API](./include/cutlass/epilogue/collective/default_epilogue.hpp) to avoid reading non-void `ElementC` value for `ElementC = void` kernel.
+- New CUTLASS profiler flags: `profiling-duration`, `min-iterations`, and `kernels-file` documented in [profiler.md](./media/docs/profiler.md#cutlass-profiler).
 - Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
 - Optimal code generation with CUDA toolkit versions 12.6.

@ -15,12 +81,7 @@
  + [INT8](./test/unit/gemm/device/sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu)
  + [TF32](./test/unit/gemm/device/sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu)
 - A refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` [API](./include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp) to bring it in line with `gemm::GemmUniversal`. Now the 3.x convolution API is no longer considered as a beta API.
- Improve [mixed input GEMM](./examples/55_hopper_mixed_dtype_gemm/README.md).
-  + Added a [lookup table implementation](./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu) for `INT4`x`FP8` scale-only mode.
-  + Added [layout pre-shuffling](./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu#L50-55) to optimize memory loading.
-  + Added [interleaved conversion](./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu#L50-52) for `{INT4, UINT4, INT8}` x `{FP16, BF16}`.
-  + Other general optimizations.
- The suffixes of the mixed input kernel schedules have been removed. Use `KernelTmaWarpSpecialized`, `KernelTmaWarpSpecializedPingpong` and `KernelTmaWarpSpecializedCooperative` instead.
+- [An improved mixed input GEMM](./examples/55_hopper_mixed_dtype_gemm/README.md) and a [lookup table implementation](./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu) for `INT4`x`FP8` scale-only mode.
 - [EVT nodes for Top-K selection and softmax](./include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp) and [GEMM example using those](./examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu).
 - [Programmatic Dependent Launch](./include/cutlass/arch/grid_dependency_control.h) (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding [documentations](./media/docs/dependent_kernel_launch.md).
 - [A new debugging tool, synclog](./include/cutlass/arch/synclog.hpp), for dumping out all synchronization events from within a kernel to a file. Please see [synclog documentation](./media/docs/utilities.md#debugging-asynchronous-kernels-with-cutlasss-built-in-synclog-tool) for details.
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -114,6 +114,13 @@ set(CUTLASS_TEST_LEVEL "0" CACHE STRING "Level of tests to compile.")
 find_package(Python3 3.5 COMPONENTS Interpreter REQUIRED)

 ################################################################################
+
+
+include(customConfigs.cmake)
+
+################################################################################
+
+
 set(CUTLASS_ENABLE_HEADERS_ONLY OFF CACHE BOOL "Enable only the header library")

 if(CUTLASS_ENABLE_HEADERS_ONLY)
@ -143,14 +150,14 @@ set(CUTLASS_ENABLE_PERFORMANCE ${CUTLASS_ENABLE_PROFILER} CACHE BOOL "Enable CUT
 set(CUTLASS_ENABLE_TESTS ${CUTLASS_ENABLE_TESTS_INIT} CACHE BOOL "Enable CUTLASS Tests")
 set(CUTLASS_ENABLE_GTEST_UNIT_TESTS ${CUTLASS_ENABLE_TESTS} CACHE BOOL "Enable CUTLASS GTest-based Unit Tests")
 set(CUTLASS_USE_SYSTEM_GOOGLETEST OFF CACHE BOOL "Use system/external installation of GTest")
-set(CUTLASS_USE_PACKED_TUPLE ON CACHE BOOL "If ON, make cute::tuple be new standard-layout tuple type; if OFF, use the original cute::tuple implementation that is _not_ standard-layout.")
-if (CUTLASS_USE_PACKED_TUPLE)
-  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTE_USE_PACKED_TUPLE=1)
-  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DCUTLASS_USE_PACKED_TUPLE=1")
-  message(STATUS "Make cute::tuple be the new standard-layout tuple type")
-elseif()
-  message(STATUS "Use the original cute::tuple implementation that is _not_ standard-layout")
+
+if (CUTLASS_ENABLE_TESTS AND CUTLASS_ENABLE_PROFILER)
+  set(CUTLASS_ENABLE_PROFILER_UNIT_TESTS_INIT ON)
+else()
+  set(CUTLASS_ENABLE_PROFILER_UNIT_TESTS_INIT OFF)
 endif()
+set(CUTLASS_ENABLE_PROFILER_UNIT_TESTS ${CUTLASS_ENABLE_PROFILER_UNIT_TESTS_INIT} CACHE BOOL "Enable CUTLASS Profiler-based Unit Tests")
+set(CUTLASS_ENABLE_SELF_CONTAINED_INCLUDES_CHECK ON CACHE BOOL "Enable CUTLASS check for self-contained header includes")

 ################################################################################

@ -164,6 +171,11 @@ endif()
 if (CUDA_VERSION VERSION_GREATER_EQUAL 12.0)
  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 90a)
 endif()
+
+if (CUDA_VERSION VERSION_GREATER_EQUAL 12.8)
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 100 100a)
+endif()
+
 set(CUTLASS_NVCC_ARCHS ${CUTLASS_NVCC_ARCHS_SUPPORTED} CACHE STRING "The SM architectures requested.")
 set(CUTLASS_NVCC_ARCHS_ENABLED ${CUTLASS_NVCC_ARCHS} CACHE STRING "The SM architectures to build code for.")

@ -383,9 +395,18 @@ endif()



+
+###################################################################################################
+#
+# Blackwell features
+#
+###################################################################################################
+
+
+
 # Warnings-as-error exceptions and warning suppressions for Clang builds
 if (CUTLASS_CLANG_HOST_COMPILE)
-  
+
  set(FLAGS_TO_ADD
    "-Wno-error=implicit-int-conversion"
    "-Wno-error=pass-failed"
@ -393,13 +414,13 @@ if (CUTLASS_CLANG_HOST_COMPILE)
    "-Wno-sign-conversion"
    "-Wno-unused-parameter"
  )
-  
+
  foreach(FLAG ${FLAGS_TO_ADD})
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${FLAG}")
    list(APPEND CUTLASS_CUDA_NVCC_FLAGS "${FLAG}")
    list(APPEND CUTLASS_CUDA_CLANG_FLAGS "${FLAG}")
  endforeach()
-  
+
 endif()

 if (NOT MSVC AND CUTLASS_NVCC_KEEP)
@ -465,7 +486,7 @@ if (CUTLASS_CLANG_DEVICE_COMPILE)

  link_libraries(nvidia::cudart)
  link_libraries(nvidia::cuda_driver)
-  
+
 endif()

 #Report CUDA build flags
@ -540,7 +561,7 @@ function(cutlass_apply_cuda_gencode_flags TARGET)
      list(APPEND __CMAKE_CUDA_ARCHS ${ARCH}-real)
    endif()
    if(CUTLASS_NVCC_EMBED_PTX AND NOT CUTLASS_CLANG_DEVICE_COMPILE)
-      # If we're using clang for device compilation, the ptx is inserted 
+      # If we're using clang for device compilation, the ptx is inserted
      # via another command line option and the `-virtual` flags will cause an error.
      list(APPEND __CMAKE_CUDA_ARCHS ${ARCH}-virtual)
    endif()
@ -901,7 +922,7 @@ function(cutlass_add_executable_tests NAME TARGET)
    if (NOT __DO_NOT_LOWERCASE_TEST_NAME)
      string(TOLOWER "${TESTCASE_NAME}" TESTCASE_NAME)
    endif()
-    
+
    # The following rigmarole is needed to deal with spaces and possible quotes in
    # command line arguments. The options are passed "by reference" as the actual
    # variable names holding the real options. We then expand these in a way that
@ -958,6 +979,99 @@ function(cutlass_add_executable_tests NAME TARGET)

 endfunction()

+
+
+function(cutlass_generate_profiler_tests NAME)
+
+  set(options)
+  set(oneValueArgs)
+  set(multiValueArgs DEPENDS DEPENDEES CUTLASS_PROFILER_EXTRA_OPTIONS)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  if (NOT CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS AND NOT CUTLASS_BUILD_FOR_PROFILER_PERFORMANCE_REGRESSIONS)
+    return()
+  endif()
+
+  install(
+    FILES ${CUTLASS_PROFILER_REGRESSION_LIST_FILE}
+    DESTINATION ${CMAKE_INSTALL_INFODIR}/cutlass/
+    RENAME profiler_regressions.csv
+    )
+
+  # Generate cmake test targets for each entry in the testlist csv
+
+  if (NOT EXISTS "${CUTLASS_PROFILER_REGRESSION_LIST_FILE}")
+    message(SEND_ERROR "Profiler unit tests list path is invalid: CUTLASS_PROFILER_REGRESSION_LIST_FILE = ${CUTLASS_PROFILER_REGRESSION_LIST_FILE}")
+  else()
+    message(STATUS "Using ${CUTLASS_PROFILER_REGRESSION_LIST_FILE} to generate profiler-based tests.")
+  endif()
+
+  file(STRINGS ${CUTLASS_PROFILER_REGRESSION_LIST_FILE} TEST_LIST)
+  foreach(TEST IN LISTS TEST_LIST)
+    set(TEMP_TEST ${TEST})
+    if ("${TEST}" MATCHES " *cutlass_profiler.*")
+
+      # Generate a flattened name for the test from the test command line.
+      string(REPLACE "," ";" TEST_NAME_LIST ${TEMP_TEST})
+      string(REGEX REPLACE "\\*" "_" TEST_NAME "${TEMP_TEST}")
+      string(REGEX REPLACE "\\\"\\{\\\"\\\"input_params.*\\{.*\\}\\}\\\"" "" TEST_NAME "${TEST_NAME}")
+      string(REGEX REPLACE "\\\"\\{\\\"\\\"input_params.*\\{.*\\}\\}\\\"" "" TEST "${TEST}")
+      string(REGEX REPLACE "," ";" TEST "${TEST}")
+      string(REGEX MATCHALL "[a-zA-Z0-9_=]+" TEST_NAME "${TEST_NAME}")
+        list(FILTER TEST_NAME EXCLUDE REGEX "cutlass_profiler|mode=trace|providers=cutlass")
+      list(JOIN TEST_NAME "_" TEST_NAME)
+      string(REGEX REPLACE "_verification_required=(true|false)" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "_verification_providers=device" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "batch_count=" "batch" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "cluster_m=" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "_cluster_n=" "x" TEST_NAME "${TEST_NAME}")
+      string(REGEX REPLACE "_cluster_k=[0-9]+" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "cluster_m_fallback=" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "_cluster_n_fallback=" "x" TEST_NAME "${TEST_NAME}")
+      string(REGEX REPLACE "_cluster_k_fallback=[0-9]+" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "runtime_input_datatype_a=" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "runtime_input_datatype_b=" "" TEST_NAME "${TEST_NAME}")
+      string(REGEX REPLACE "verification_enabled=(true|false)" "" TEST_NAME "${TEST_NAME}")
+      string(REGEX REPLACE "warmup_iterations=[0-9]+" "" TEST_NAME "${TEST_NAME}")
+      string(REGEX REPLACE "profiling_iterations=[0-9]+" "" TEST_NAME "${TEST_NAME}")
+      string(REGEX REPLACE "sleep_duration=[0-9]+" "" TEST_NAME "${TEST_NAME}")
+      string(REGEX REPLACE "profiling_enabled=(true|false)" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "=" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "_error_on_no_match" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "_error_if_nothing_is_profiled" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "kernels" "" TEST_NAME "${TEST_NAME}")
+      string(REPLACE "operation" "" TEST_NAME "${TEST_NAME}")
+
+      if (NOT __DO_NOT_LOWERCASE_TEST_NAME)
+        string(TOLOWER "${TEST_NAME}" TEST_NAME)
+      endif()
+
+      # Munge the test command
+
+      string(REPLACE "cutlass_profiler" "" TEST "${TEST}")
+      set(TEST "${TEST}" ${__CUTLASS_PROFILER_EXTRA_OPTIONS} "--junit-output=${TEST_NAME}")
+      set(TEST_COMMAND_${TEST_NAME} "${TEST}")
+      list(APPEND TEST_COMMAND_VARS ${TEST_NAME})
+    endif()
+
+  endforeach()
+
+  cutlass_add_executable_tests(
+    ${NAME} cutlass_profiler
+    DEPENDS ${__DEPENDS}
+    DEPENDEES ${__DEPENDEES}
+    TEST_COMMAND_OPTIONS ${TEST_COMMAND_VARS}
+    TEST_COMMAND_OPTIONS_PREFIX TEST_COMMAND_
+    DISABLE_EXECUTABLE_INSTALL_RULE
+    # Uncomment the following line when alloc/dealloc tracking
+    # is fixed for all configurations.
+    # TEST_SETS_SUPPORTED tmem_alloc_tracking
+    )
+
+endfunction()
+
+
+
 if (CUTLASS_ENABLE_TOOLS)
  add_subdirectory(tools)
  if (CUTLASS_ENABLE_PROFILER)
@ -975,6 +1089,14 @@ if (CUTLASS_ENABLE_TESTS)
  if (CUTLASS_ENABLE_GTEST_UNIT_TESTS)
  add_dependencies(test_all test_unit)
  endif()
+  if (CUTLASS_ENABLE_PROFILER_UNIT_TESTS AND CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS)
+    # Generate profiler based unit test
+    cutlass_generate_profiler_tests(
+      tup
+      DEPENDEES test_unit
+    )
+  endif()
+
 endif()

 if (CUTLASS_INSTALL_TESTS)
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@ -2,51 +2,104 @@

 [README](./README.md#documentation) > **Contributors**

-# CUTLASS Developers and Contributors
+# CUTLASS Developers **

-This is the official list of CUTLASS developers and contributors.
-
-## DEVELOPERS
-Vijay Thakkar<br />
-Pradeep Ramani<br />
-Cris Cecka<br />
-Aniket Shivam<br />
-Jack Kosaian<br />
-Mark Hoemmen<br />
-Richard Cai<br />
-Honghao Lu<br />
-Ethan Yan<br />
-Haicheng Wu<br />
 Andrew Kerr<br />
-Dustyn Blasig<br />
-Fengqi Qiao<br />
-Duane Merrill<br />
-Yujia Zhai<br />
-Rawn Henry<br />
-Sergey Klevtsov<br />
-Shang Zhang<br />
-Piotr Majcher<br />
 Paul Springer<br />
-Markus Hohnerbach<br />
-Jin Wang<br />
+Dustyn Blasig<br />
+Albert Xu<br />
+Junkai Wu<br />
+Xiuxia Zhang<br />
+Haicheng Wu<br />
+Jack Yang<br />
+Pradeep Ramani<br />
 Aditya Atluri<br />
+Han Li<br />
+Nick Zhao<br />
+Ivan Yin<br />
+Yu-Jung Chen<br />
+Markus Hoehnerbach<br />
+Honghao Lu<br />
+Mihir Awatramani<br />
+Hao Sheng<br />
+Zekun Fan<br />
+Aniket Shivam<br />
+Siyu Liu<br />
+Richard Cai<br />
+Vikas Gupta<br />
+Ethan Yan<br />
+Vijay Thakkar<br />
+Cris Cecka<br />
+Lawrence Ryan<br />
+Qun Song<br />
+Daniel Ricketts<br />
+dePaul Miller<br />
+Yuhan Li<br />
+Saman Ashkiani<br />
+Jack Chen<br />
+Shang Zhang<br />
+Petrick Liu<br />
+Questa Wang<br />
+Pramod Shenoy<br />
+Jack Kosaian<br />
+Yujia Zhai<br />
+Zhaodong Chen<br />
+Manas Sahni<br />
+Shunfan Shao<br />
+Fengqi Qiao<br />
+Serif Yesil<br />
+Aragorn Guan<br />
+Heidi He<br />
+Xiao Song<br />
+Sergey Klevtsov<br />
+Jiang Shao<br />
+Ruqing Xu<br />
+Mengyu Guo<br />
+Tao Xie<br />
+Linfeng Zheng<br />
+Harrison Barclay<br />
+Wenfei Tang<br />
+Diksha Gohlyan<br />
+Alexander Zhurkevich<br />
+Siyuan Fu<br />
+Hua Huang<br />
+Xiufan Liang<br />
+Ian Tramble<br />
+Ali Hassani<br />
+Shreya Gaur<br />
+
+** _The list is sorted in order of the author's first contribution to the CUTLASS project._
+
+
+# CUTE Developers

-## CuTe
 Cris Cecka<br />
 Vijay Thakkar<br />

-## CUTLASS Product Manager
+
+# CUTLASS Product Manager
+
 Matthew Nicely<br />

-## Former CUTLASS Developers
-Manish Gupta<br />
-Naila Farooqui<br />
-David Tanner<br />
-Manikandan Ananth<br />
-Zhaodong Chen<br />
-Chinmay Talegaonkar<br />

-## CONTRIBUTORS
+# Former CUTLASS Developers
+
+Manish Gupta<br />
+Duane Merrill<br />
+Piotr Majcher<br />
+Naila Farooqui<br />
+Mark Hoemmen<br />
+Rawn Henry<br />
+Jin Wang<br />
+Timmy Liu<br />
+Manikandan Ananth<br />
+David Tanner<br />
+
+
+# Acknowledgements
+
+Tri Dao<br />
+Jay Shah<br />
 Timothy Costa<br />
 Julien Demouth<br />
 Brian Fahs<br />
@ -56,25 +109,15 @@ Mostafa Hagog<br />
 Fei Hu<br />
 Alan Kaatz<br />
 Tina Li<br />
-Timmy Liu<br />
 Wei Liu<br />
 Tim Martin<br />
-Duane Merrill<br />
 Kevin Siu<br />
 Markus Tavenrath<br />
 John Tran<br />
 Vicki Wang<br />
-Junkai Wu<br />
 Fung Xie<br />
-Albert Xu<br />
 Yang Xu<br />
-Jack Yang<br />
 Scott Yokim<br />
-Xiuxia Zhang<br />
-Nick Zhao<br />
-
-## ACKNOWLEDGEMENTS
-
 Girish Bharambe<br />
 Luke Durant<br />
 Carter Edwards<br />
--- a/PUBLICATIONS.md
+++ b/PUBLICATIONS.md
@ -1,5 +1,9 @@
 # Publications Using Cutlass

+## 2025
+
+- ["ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization"](https://arxiv.org/abs/2502.02631). Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra. _arXiv_, February 2025.
+
 ## 2024

 - ["ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference"](https://arxiv.org/abs/2410.21465). Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen. _arXiv_, October 2024.
--- a/README.md
+++ b/README.md
@ -1,8 +1,8 @@
 ![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")

-# CUTLASS 3.7.0
+# CUTLASS 3.8.0

-_CUTLASS 3.7.0 - January 2025_
+_CUTLASS 3.8.0 - January 2025_

 CUTLASS is a collection of CUDA C++ template abstractions for implementing
 high-performance matrix-matrix multiplication (GEMM) and related computations at all levels 
@ -16,63 +16,103 @@ as building blocks within custom kernels and applications.

 To support a wide variety of applications, CUTLASS provides extensive support for
 mixed-precision computations, providing specialized data-movement and
-multiply-accumulate abstractions for half-precision floating
-point (FP16), BFloat16 (BF16), Tensor Float 32 (TF32),
-single-precision floating point (FP32),
-[FP32 emulation via tensor core instruction](./examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm),
-double-precision floating
-point (FP64) types, integer data types (4b and 8b), and binary data types (1b).
-CUTLASS demonstrates warp-synchronous matrix multiply operations
+multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16,
+[FP32 emulation via tensor core instruction](./examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm), 
+ 8b floating point types (e5m2 and e4m3),
+ block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8),
+ narrow integer types (4 and 8b signed and unsigned integers),
+ and binary 1b data types (where architectures allow for the
+native support of such data types).
+CUTLASS demonstrates optimal matrix multiply operations
 targeting the programmable, high-throughput _Tensor Cores_ implemented by
-NVIDIA's Volta, Turing, Ampere, and Hopper architectures.
+NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.
+
+In addition to GEMMs, CUTLASS implements high-performance convolution via
+the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution
+operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline.
+This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.

 See the [Quick Start Guide](./media/docs/quickstart.md) to get started quickly.

-See the [functionality listing](./media/docs/functionality.md) for the list of operations
-supported at each level of the execution model hierarchy.
+See the [functionality docs](./media/docs/functionality.md) for a more comprehensive
+list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU
+architecture.

-CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data.
-CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly package the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations.
+# What's New in CUTLASS 3.8

-The core abstractions of CuTe are hierarchically multidimensional layouts which can be composed with data arrays to represent tensors. The representation of layouts is powerful enough to represent nearly everything we need to implement efficient dense linear algebra. Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
+CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.
+For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.

-CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates. This greatly simplifies the design
-and improves code composability and readability. More documentation specific to CuTe can be found in its [dedicated documentation directory](./media/docs/cute/00_quickstart.md).
+* Support for new CuTe building blocks specifically for Blackwell SM100 architecture:
+  - [5th generation Blackwell Tensor Core instructions (TCGen05)](./include/cute/atom/mma_traits_sm100.hpp) via CuTe MMA atoms.
+  - Extensions to [Tensor Memory Accelerator](./include/cute/atom/copy_traits_sm100_tma.hpp) via CuTe Copy atoms.
+  - Exposure of Blackwell's new tensor memory (note: distinct from TMA) as [`tmem`](./include/cute/pointer.hpp) across CuTe as a first class data locale.
+  - Exposure of [`tmem->rmem`, `rmem->tmem` and `smem->tmem data movement instructions`](./include/cute/atom/copy_traits_sm100.hpp) as copy atoms in CuTe.
+  - [`make_tmem_copy()`](./include/cute/atom/copy_traits_sm100.hpp) utility method to ease creation of tiled copies for tmem copy atoms.
+  - Support for [new variants of LDSM on Blackwell](./include/cute/atom/copy_traits_sm100.hpp) via CuTe Copy atoms.
+* Support for new CUTLASS building blocks specifically for Blackwell SM100 architecture:
+  - Various narrow precision [FP4, FP6, and FP8](./include/cutlass/exmy_base.h) formats as well as their [block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8](./include/cutlass/float_subbyte.h)
+  - [Pipelines that implement Blackwell specific synchronization](./include/cutlass/pipeline/sm100_pipeline.hpp).
+  - [Cluster launch control API supporting preferred and fallback cluster shapes](./include/cutlass/cluster_launch.hpp).
+  - Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
+  - Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](./media/docs/blackwell_cluster_launch_control.md) to implement dynamic persistence scheduling for [GEMMs](./include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp), and [stream-K](./include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp).
+  - Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
+* Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
+  - [Blackwell specific kernel layers](./include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp) that
+    + Implement a new warp-specialization recipe tuned specifically for Blackwell SM100 architecture.
+    + Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.
+    + Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
+  - Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
+    * [Non-block scaled data types without support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_mma_warpspecialized.hpp)
+    * [Non-block scaled data types with support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_mma_array_warpspecialized.hpp)
+    * [Block scaled data types without support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp)
+    * [Block scaled data types with support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp)
+  - Blackwell [collective mainloop for convolution kernels](./include/cutlass/conv/collective/sm100_implicit_gemm_umma_warpspecialized.hpp) supporting non-block scaled data types for fprop, dgrad, and wgrad.
+  - New [GEMM](./include/cutlass/gemm/dispatch_policy.hpp), [convolution](./include/cutlass/conv/dispatch_policy.hpp), and [epilogue](./include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.
+  - [Blackwell epilogue that supports loading accumulators from `tmem`](./include/cutlass/epilogue/collective/sm100_epilogue_tma_warpspecialized.hpp) and [full set of EVT fusions]().
+* CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
+  - Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
+  - Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
+  - Support for mixed input GEMM kernels on Hopper in the profiler.
+* New CUTLASS profiler flag `use-cuda-graphs` to reduce overheads when benchmarking launch-bound kernels.
+* A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler (`./cutlass_profiler --operation=GroupedGemm --help` for details).
+* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
+  - [Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples](./examples/70_blackwell_gemm/), demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
+  - GEMM with [opt-in collective builder schedules showcasing available recipes](./examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu) for Blackwell.
+  - Block scaled data type GEMMs targeting Blackwell's native block scaled Tensor Cores:
+    + [NVFP4 inputs with BF16 output](./examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu)
+    + [NVFP4 inputs with NVFP4 output](./examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu)
+    + [Mixed MXFP8 and MXFP6 inputs with BF16 output](./examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm.cu)
+  - GEMM example demonstrating [Blackwell's new preferred cluster support via dynamic cluster shapes](./examples/73_blackwell_gemm_preferred_cluster/blackwell_gemm_preferred_cluster.cu) for increased occupancy.
+  - [GEMM with CLC based StreamK scheduler for load balancing](./examples/74_blackwell_gemm_streamk/blackwell_gemm_streamk.cu).
+  - Grouped GEMM for [vanilla FP8 data inputs](./examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm.cu) and [NVFP4 block scaled inputs](./examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm_block_scaled.cu).
+  - Convolution kernels for [fprop](./examples/76_blackwell_conv/76_blackwell_conv_fprop.cu), [dgrad](./examples/76_blackwell_conv/76_blackwell_conv_dgrad.cu), and [wgrad](./examples/76_blackwell_conv/76_blackwell_conv_wgrad.cu).
+  - [Fused multi-head attention fprop kernel](./examples/77_blackwell_fmha/77_blackwell_fmha.cu) supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
+  - A new BF16x9 GEMM [kernel](./examples/78_blackwell_emulated_bf16x9_gemm/78_blackwell_emulated_bf16x9_gemm.cu) that emulates FP32 GEMM (SGEMM) using BF16 operations.
+* Set of examples that demonstrate the usage of the 3.x API for targeting Hopper architecture:
+  - A set of new [Hopper grouped GEMM kernels](./examples/69_hopper_mixed_dtype_grouped_gemm/) that support mixed A and B datatypes.
+  - A new [Hopper FP8 GEMM with groupwise scaling](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu).
+* Documentation updates:
+  - [Quickstart - instantiating a Blackwell block-scaled GEMM](./media/docs/quickstart.md#instantiating-a-blackwell-gemm-kernel).
+  - Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/blackwell_functionality.md)
+  - A new [functionality documentation](./media/docs/functionality.md) specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
+  - Updates to [compatibility](./README.md#compatibility) section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and [Target Architecture](./README.md#Target-Architecture).

-In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
+Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
+CUTLASS team is working on a fix.

-# What's New in CUTLASS 3.7
-
-CUTLASS 3.7.0 is an update to CUTLASS adding:
-
- A new [Hopper blockwise scaling FP8 GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) where the operands and block scaling tensor are staged via shared memory.
- [Distributed GEMM](./examples/65_distributed_gemm/65_distributed_gemm.cu) is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new `make_kernel_hardware_info` API as shown in [example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
- Enabled high precision accumulation for Hopper FP8 Sparse GEMM.
-
-Minimum requirements:
-
- Architecture: Volta
- Compiler: Must support at least C++17
- CUDA Toolkit version: 11.4
-
-Starting from CUTLASS 3.0, CUTLASS removed support for the following:
-
- Maxwell and Pascal GPU architectures
- Ubuntu 16.04
- CUDA 10.2
- C++ language versions less than 17.
-
-**See the [CHANGELOG](CHANGELOG.md) for a detailed listing of releases and updates.**
+**See the [CHANGELOG](CHANGELOG.md) for details of all past releases and updates.**

 # Performance

-<p align="center"><img src=media/images/cutlass-3.5.1-gemm-peak-performance.png></p>
-<p align="center"><img src=media/images/cutlass-3.5.1-gemm-peak-performance-fp8.png></p>
-
 CUTLASS primitives are very efficient.  When used to construct device-wide GEMM kernels,
-they exhibit peak performance comparable to cuBLAS for scalar GEMM
-computations. The above figure shows the continual CUTLASS performance improvements 
+they exhibit nearly optimal utilization of peak theoretical throughput. The figure below
+shows CUTLASS 3.8's performance as a % of theoretical peak utilization 
+on various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU.
+
+<p align="center"><img src=media/images/cutlass-3.8-blackwell-gemm-peak-performance.svg></p>
+
+The two figures below show the continual CUTLASS performance improvements 
 on an [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) (NVIDIA Hopper architecture) since
 CUTLASS 3.1.
 CUTLASS 3.5.1 was compiled with the [CUDA 12.5u1 Toolkit](https://developer.nvidia.com/cuda-downloads). 
@ -80,20 +120,45 @@ Tensor Core operations are implemented using CUDA's
 [mma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma) and
 [wgmma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions) instructions.

-<p align="center"><img src=media/images/cutlass-2.9-implicit-gemm-performance.png></p>
+<p align="center"><img src=media/images/cutlass-3.5.1-gemm-peak-performance.png></p>
+<p align="center"><img src=media/images/cutlass-3.5.1-gemm-peak-performance-fp8.png></p>

-When using CUTLASS building blocks to construct device-wide implicit gemm (Fprop, Dgrad, and Wgrad)
-kernels, CUTLASS performance is also comparable to cuDNN when running Resnet-50 layers on an [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/)
-as shown in the above figure.  Tensor Core operations are implemented using CUDA's
-[mma instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma).
+# CuTe
+
+CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data.
+CuTe is a collection of C++ CUDA template abstractions for
+defining and operating on hierarchically multidimensional layouts of threads and data.
+CuTe provides `Layout` and `Tensor` objects that compactly package the type,
+shape, memory space, and layout of data, while performing the complicated indexing for the user.
+This lets programmers focus on the logical descriptions of their algorithms while
+CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design,
+implement, and modify all dense linear algebra operations.
+
+The core abstractions of CuTe are hierarchically multidimensional layouts
+which can be composed with data arrays to represent tensors.
+The representation of layouts is powerful enough to represent nearly
+everything we need to implement efficient dense linear algebra.
+Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
+
+CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates.
+This greatly simplifies the design and improves code composability and readability.
+More documentation specific to CuTe can be found in its
+[dedicated documentation directory](./media/docs/cute/00_quickstart.md).

 # Compatibility

+Minimum requirements:
+
+- Architecture: Volta (compute capability 7.0)
+- Compiler: Must support at least C++17
+- CUDA Toolkit version: 11.4
+
 CUTLASS requires a C++17 host compiler and 
-performs best when built with the [**CUDA 12.4 Toolkit**](https://developer.nvidia.com/cuda-downloads).
-It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, CUDA 12.0, CUDA 12.1, CUDA 12.2.2, CUDA 12.3.1 and CUDA 12.3.2.
+performs best when built with the [**CUDA 12.8 Toolkit**](https://developer.nvidia.com/cuda-downloads).
+It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, and all other CUDA 12.x versions.

 ## Operating Systems
+
 We have tested the following environments.

 |**Operating System** | **Compiler** |
@ -101,47 +166,74 @@ We have tested the following environments.
 | Ubuntu 18.04 | GCC 7.5.0  |
 | Ubuntu 20.04 | GCC 10.3.0 |
 | Ubuntu 22.04 | GCC 11.2.0 |
-| Ubuntu 22.04 | Clang 10.0.0 |
-| Ubuntu 22.04 | Clang 14.0.6 |
-| Ubuntu 22.04 | Clang 17.0.6 |
-| Windows 10.0 | Visual Studio 2019 v16.11.27 |

 Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.

+Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
+CUTLASS team is working on a fix.
+
 ## Hardware
+
 CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs.

 |**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit Required by CUTLASS-3**|
 |---|---|---|
 |NVIDIA V100 Tensor Core GPU            |7.0|11.4|
 |NVIDIA TitanV                          |7.0|11.4|
-|NVIDIA GeForce RTX 2080 TI, 2080, 2070 |7.5|11.4|
+|NVIDIA GeForce RTX 20x0 series         |7.5|11.4|
 |NVIDIA T4                              |7.5|11.4|
 |NVIDIA A100 Tensor Core GPU            |8.0|11.4|
 |NVIDIA A10                             |8.6|11.4|
-|NVIDIA GeForce RTX 3090                |8.6|11.4|
-|NVIDIA GeForce RTX 4090                |8.9|11.8|
+|NVIDIA GeForce RTX 30x0 series         |8.6|11.4|
+|NVIDIA GeForce RTX 40x0 series         |8.9|11.8|
 |NVIDIA L40                             |8.9|11.8|
 |NVIDIA H100 Tensor Core GPU            |9.0|11.8|
+|NVIDIA H200 Tensor Core GPU            |9.0|11.8|
+|NVIDIA B200 Tensor Core GPU            |10.0|12.8|

 ## Target Architecture

-In general, PTX code generated for one target architecture can be run on future architectures (i.e., it is forward compatible).  However, CUDA 12.0 introduced the concept of "architecture-accelerated features" whose PTX does not have forward compatibility guarantees. Several Hopper PTX instructions fall under this category of architecture-accelerated features, and thus require a `sm_90a` target architecture (note the "a" appended). For more details on this and other architecture-accelerated instructions, please refer to the [CUDA Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availability).
+In general, PTX code generated for one target architecture can be run on future architectures
+(i.e., it is forward compatible).
+However, CUDA 12.0 introduced the concept of "architecture-accelerated features" whose
+PTX does not have forward compatibility guarantees.
+Several Hopper and Blackwell PTX instructions fall under this category of
+architecture-accelerated features, and thus require a `sm_90a` or `sm100a` target architecture
+(note the "a" appended). For more details on this and other architecture-accelerated instructions,
+please refer to the [CUDA Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availability).

-The target architecture information is passed on to CUTLASS via the cmake flag `CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100, users are required to build CUTLASS with `90a` as the target architecture. If a user accidentally builds a kernel which uses SM90a features (e.g. Hopper Tensor Core Instructions), using the SM90 target (note the lack of "a"), with either CUDA Toolkit 12 or 11.8, the kernel is expected to fail with a runtime error.
+The target architecture information is passed on to CUTLASS via the cmake flag
+`CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100,
+users are required to build CUTLASS with `90a` as the target architecture.
+If a user accidentally builds a kernel which uses SM90a features
+(e.g. Hopper Tensor Core Instructions), using the SM90 target
+(note the lack of "a"), with either CUDA Toolkit 12 or 11.8,
+the kernel is expected to fail with a runtime error.

 ```
-cmake .. -DCUTLASS_NVCC_ARCHS="90a" 
+cmake .. -DCUTLASS_NVCC_ARCHS="90a"
+```
+Or 
+
+```
+cmake .. -DCUTLASS_NVCC_ARCHS="100a" 
 ```

-Please refer to the [functionality documentation](./media/docs/functionality.md) for details on which kernels require which target architectures.
+Note: The NVIDIA Blackwell SM100 architecture used in the datacenter 
+products has a different compute capability than the one underpinning 
+NVIDIA Blackwell GeForce RTX 50 series GPUs. As a result, kernels 
+compiled for Blackwell SM100 architecture with arch conditional features 
+(using `sm100a`) are not compatible with RTX 50 series GPUs. 
+
+Please refer to the [functionality documentation](./media/docs/functionality.md)
+for details on which kernels require which target architectures.

 # Documentation

 CUTLASS is described in the following documents and the accompanying
 [Doxygen documentation](https://nvidia.github.io/cutlass).

- [Quick Start Guide](./media/docs/quickstart.md) - build and run CUTLASS
+- [Quick Start Guide](./media/docs/quickstart.md) - basics of building and running CUTLASS
 - [Functionality](./media/docs/functionality.md) - summarizes functionality available in CUTLASS
 - [Efficient GEMM in CUDA](./media/docs/efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
 - [CUTLASS 3.x Design](./media/docs/cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
@ -155,7 +247,7 @@ CUTLASS is described in the following documents and the accompanying
 - [Layouts](./media/docs/layout.md) - describes layouts of matrices and tensors in memory
 - [Tile Iterators](./media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
 - [CUTLASS Profiler](./media/docs/profiler.md) - command-line driven profiling application
- [CUTLASS Utilities](./media/docs/utilities.md) - additional templates used to facilate rapid development
+- [CUTLASS Utilities](./media/docs/utilities.md) - additional templates used to facilitate rapid development
 - [Dependent kernel launch](./media/docs/dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent 
 kernels in the same stream, and how it is used in CUTLASS.

@ -163,11 +255,11 @@ kernels in the same stream, and how it is used in CUTLASS.
 We have also described the structure of an efficient GEMM in our talk at the
 [GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).

- - [CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA](https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8854/)
- - [Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100](https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/)
- - [Accelerating Convolution with Tensor Cores in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31883/)
- - [Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41996/)
- - [CUTLASS: Python API, Enhancements, and NVIDIA Hopper](https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41131/)
+- [CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA](https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8854/)
+- [Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100](https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/)
+- [Accelerating Convolution with Tensor Cores in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31883/)
+- [Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41996/)
+- [CUTLASS: Python API, Enhancements, and NVIDIA Hopper](https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41131/)

 # Building CUTLASS

--- a/customConfigs.cmake
+++ b/customConfigs.cmake
@ -0,0 +1,92 @@
+# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+
+
+
+# Profiler based functional testing
+set(CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS OFF CACHE BOOL "Utilize profiler-based functional regressions")
+set(CUTLASS_PROFILER_REGRESSION_TEST_LEVEL  ${CUTLASS_TEST_LEVEL} CACHE STRING "Profiler functional regression test level")
+
+find_package(Python3 3.5 COMPONENTS Interpreter REQUIRED)
+
+function(cutlass_generate_kernel_filter_and_testlists_files)
+
+  set(options)
+  set(oneValueArgs TEST_SET_NAME)
+  set(multiValueArgs)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  execute_process(
+    COMMAND ${CMAKE_COMMAND} -E env PYTHONPATH=${CUTLASS_LIBRARY_PACKAGE_DIR}
+      ${Python3_EXECUTABLE} ${CUTLASS_SOURCE_DIR}/python/cutlass_library/generator.py 
+      --generator-target=${__TEST_SET_NAME} 
+      --cuda-version=${CUDA_VERSION_MAJOR}.${CUDA_VERSION_MINOR}
+      --architectures=${CUTLASS_NVCC_ARCHS}
+      --kernels=\*
+      --disable-cutlass-package-imports
+    WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
+    RESULT_VARIABLE cutlass_FILTER_GENERATION_RESULT
+    OUTPUT_VARIABLE cutlass_FILTER_GENERATION_OUTPUT
+    OUTPUT_FILE ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log
+    ERROR_FILE ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log
+  )
+
+  if(NOT cutlass_FILTER_GENERATION_RESULT EQUAL 0)
+    message(FATAL_ERROR "Error generating kernel filters and testlists files. See ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log")
+  endif()
+endfunction()
+
+if(CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS)
+
+    set(PROFILER_ARCH_LIST 100a)
+    foreach(ARCH IN LISTS CUTLASS_NVCC_ARCHS)
+      if(NOT (ARCH IN_LIST PROFILER_ARCH_LIST))
+        message(FATAL_ERROR "Only SM100a compute capability is supported with profiler-based unit tests")
+      endif()
+    endforeach()
+
+    if(CUTLASS_PROFILER_REGRESSION_TEST_LEVEL  EQUAL 0)
+
+      message(STATUS "Building for L0 profiler-based functional regressions")
+      cutlass_generate_kernel_filter_and_testlists_files(TEST_SET_NAME kernel_testlist_l0)
+      set(KERNEL_FILTER_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L0_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm_kernel_filter.list CACHE STRING "Kernel set")
+      set(CUTLASS_PROFILER_REGRESSION_LIST_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L0_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm.csv CACHE STRING "Regression set")
+
+    elseif (CUTLASS_PROFILER_REGRESSION_TEST_LEVEL  EQUAL 1)
+      
+      message(STATUS "Building for L1 profiler-based functional regressions")
+      cutlass_generate_kernel_filter_and_testlists_files(TEST_SET_NAME kernel_testlist_l1)
+      set(KERNEL_FILTER_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L1_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm_kernel_filter.list CACHE STRING "Kernel set")
+      set(CUTLASS_PROFILER_REGRESSION_LIST_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L1_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm.csv CACHE STRING "Regression set")
+
+    endif()
+endif()
+
+
--- a/examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu
+++ b/examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu
@ -483,12 +483,15 @@ int main(int argc, char const **args) {
  CUDA_CHECK(cudaGetDevice(&current_device_id));
  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
  cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
    std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater).\n";
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
    return 0;
  }
+
+  
+  
+
  //
  // Parse options
  //
--- a/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu
+++ b/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu
@ -540,6 +540,15 @@ int main(int argc, char const **args) {
      << "later (compute capability 90 or greater) and CUDA 12.0 or greater.\n";
      return 0;
  }
+  
+  else if (__CUDACC_VER_MAJOR__ < 12 || props.major != 9 || props.minor != 0) {
+    std::cout
+      << "This example requires a GPU of NVIDIA's Hopper Architecture "
+      << "(compute capability 90) and CUDA 12.0 or greater.\n";
+    return 0;
+  }
+  
+
  //
  // Parse options
  //
--- a/examples/50_hopper_gemm_with_epilogue_swizzle/50_hopper_gemm_with_epilogue_swizzle.cu
+++ b/examples/50_hopper_gemm_with_epilogue_swizzle/50_hopper_gemm_with_epilogue_swizzle.cu
@ -356,6 +356,15 @@ int main(int argc, char const **args) {
      << "later (compute capability 90 or greater) and CUDA 12.0 or greater.\n";
      return 0;
  }
+  
+  else if (__CUDACC_VER_MAJOR__ < 12 || props.major != 9 || props.minor != 0) {
+    std::cout
+      << "This example requires a GPU of NVIDIA's Hopper Architecture "
+      << "(compute capability 90) and CUDA 12.0 or greater.\n";
+    return 0;
+  }
+  
+
  //
  // Parse options
  //
--- a/examples/52_hopper_gather_scatter_fusion/52_hopper_gather_scatter_fusion.cu
+++ b/examples/52_hopper_gather_scatter_fusion/52_hopper_gather_scatter_fusion.cu
@ -626,6 +626,13 @@ int main(int argc, const char ** argv) {
    std::cerr << "This example requires a device with compute capability 90 or higher.\n";
    notSupported = true;
  }
+  
+  else if (props.major != 9 || props.minor != 0) {
+    std::cerr << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
+    notSupported = true;
+  }
+  
+
  if (notSupported) {
    return EXIT_SUCCESS; // Do not fail CI checks on unsupported systems
  }
--- a/examples/53_hopper_gemm_permute/53_hopper_gemm_permute.cu
+++ b/examples/53_hopper_gemm_permute/53_hopper_gemm_permute.cu
@ -750,6 +750,13 @@ int main(int argc, char const **argv)
    std::cerr << "This example requires a device with compute capability 90 or higher.\n";
    notSupported = true;
  }
+  
+  else if (props.major != 9 || props.minor != 0) {
+    std::cerr << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
+    notSupported = true;
+  }
+  
+
  if (notSupported) {
    return EXIT_SUCCESS; // Do not fail CI checks on unsupported systems
  }
--- a/examples/54_hopper_fp8_warp_specialized_gemm/54_hopper_fp8_warp_specialized_gemm.cu
+++ b/examples/54_hopper_fp8_warp_specialized_gemm/54_hopper_fp8_warp_specialized_gemm.cu
@ -566,12 +566,15 @@ int main(int argc, char const **args) {
  CUDA_CHECK(cudaGetDevice(&current_device_id));
  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
  cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
    std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater).\n";
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
    return 0;
  }
+
+  
+  
+
  //
  // Parse options
  //
--- a/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu
+++ b/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu
@ -103,11 +103,10 @@
 #include "cutlass/util/tensor_view_io.h"
 #include "cutlass/util/reference/device/tensor_fill.h"
 #include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/mixed_dtype_utils.hpp"

 #include "helper.h"
 #include "mixed_dtype_utils.hpp"
-#include "packed_scale.hpp"
-#include "reorder_utils.hpp"

 using namespace cute;

@ -144,8 +143,8 @@ using StrideB = cutlass::detail::TagToStrideB_t<LayoutB>;
 using ValueShuffle = Layout<Shape<_2,_4>, Stride<_4,_1>>; // order [0,2,4,6,1,3,5,7]
 int constexpr NumShuffleAtoms = 1;
 using MmaAtomShape = Layout<Shape<_1,Int<NumShuffleAtoms>>>;
-using LayoutAtomQuant = decltype(compute_memory_reordering_atom<MmaType, MmaAtomShape, ValueShuffle>());
-using LayoutB_Reordered = decltype(tile_to_shape(LayoutAtomQuant{}, Layout<Shape<int,int,int>, StrideB>{}));
+using LayoutAtomQuant = decltype(cutlass::compute_memory_reordering_atom<MmaType, MmaAtomShape, ValueShuffle>());
+using LayoutB_Reordered = decltype(cute::tile_to_shape(LayoutAtomQuant{}, Layout<Shape<int,int,int>, StrideB>{}));

 using ElementScale = MmaType;
 using ElementZero = ElementScale;
@ -438,14 +437,15 @@ void initialize(Options const& options) {
  auto shape_scale_zero = cute::make_shape(options.n, scale_k, options.l);
  stride_S = cutlass::make_cute_packed_stride(StrideS{}, cute::make_shape(options.n, scale_k, options.l));
  stride_S_ref = cutlass::make_cute_packed_stride(StrideS_ref{}, cute::make_shape(options.n, scale_k, options.l));
-  auto layout_scale_zero = make_layout(shape_scale_zero, stride_S_ref);
+  auto layout_scale_zero = cute::make_layout(shape_scale_zero, stride_S_ref);

-  dequantize_weight(block_B_dq.get(), block_B.get(), layout_B, block_scale.get(), block_zero.get(), layout_scale_zero, options.g);
+  cudaStream_t stream = cudaStreamDefault;
+  cutlass::dequantize(block_B_dq.get(), block_B.get(), layout_B, block_scale.get(), block_zero.get(), layout_scale_zero, options.g, stream);

  if (options.shuffle) {
    // Repeat the reorder layout atom to tile the whole tensor shape 
-    layout_B_reordered = tile_to_shape(LayoutAtomQuant{}, shape_B);
-    reorder_tensor(block_B.get(), layout_B, layout_B_reordered);
+    layout_B_reordered = cute::tile_to_shape(LayoutAtomQuant{}, shape_B);
+    cutlass::reorder_tensor(block_B.get(), layout_B, layout_B_reordered);

    print("Quantized tensor layout: ");
    print(layout_B_reordered);
@ -613,12 +613,15 @@ int main(int argc, char const **args) {
  CUDA_CHECK(cudaGetDevice(&current_device_id));
  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
  cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
    std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater).\n";
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
    return 0;
  }
+
+  
+  
+
  //
  // Parse options
  //
--- a/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu
+++ b/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu
@ -107,11 +107,10 @@
 #include "cutlass/util/tensor_view_io.h"
 #include "cutlass/util/reference/device/tensor_fill.h"
 #include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/mixed_dtype_utils.hpp"

 #include "helper.h"
 #include "mixed_dtype_utils.hpp"
-#include "packed_scale.hpp"
-#include "reorder_utils.hpp"

 using namespace cute;

@ -144,8 +143,8 @@ using StrideB = cutlass::detail::TagToStrideB_t<LayoutB>;
 // Define the CuTe layout for reoredered quantized tensor B
 // LayoutAtomQuant places values that will be read by the same thread in contiguous locations in global memory.
 // It specifies the reordering within a single warp's fragment
-using LayoutAtomQuant = decltype(compute_memory_reordering_atom<MmaType>());
-using LayoutB_Reordered = decltype(tile_to_shape(LayoutAtomQuant{}, Layout<Shape<int,int,int>, StrideB>{}));
+using LayoutAtomQuant = decltype(cutlass::compute_memory_reordering_atom<MmaType>());
+using LayoutB_Reordered = decltype(cute::tile_to_shape(LayoutAtomQuant{}, Layout<Shape<int,int,int>, StrideB>{}));

 using ElementScale = MmaType;
 using ElementZero = ElementScale; // only for verify
@ -349,10 +348,10 @@ void initialize(Options const& options) {

  initialize_tensor(block_A, seed + 2022);
  initialize_quant_tensor(block_B, seed + 2021);
-  unify_quant_encoding(block_B, block_B_modified);
+  cutlass::unified_encode_int4b(block_B.get(), block_B_modified.get(), block_B.size());
  initialize_tensor(block_C, seed + 2020);
  initialize_scale(block_scale, options);
-  initialize_packed_scale(block_scale, block_scale_packed);
+  cutlass::pack_scale_fp8(block_scale.get(), block_scale_packed.get(), block_scale.size());
  initialize_zero(block_zero, options);

  auto shape_scale_zero = cute::make_shape(options.n, scale_k, options.l);
@ -360,12 +359,13 @@ void initialize(Options const& options) {
  stride_S_ref = cutlass::make_cute_packed_stride(StrideS_ref{}, cute::make_shape(options.n, scale_k, options.l));
  auto layout_scale_zero = make_layout(shape_scale_zero, stride_S_ref);

-  dequantize_weight(block_B_dq.get(), block_B.get(), layout_B, block_scale.get(), block_zero.get(), layout_scale_zero, options.g);
+  cudaStream_t stream = cudaStreamDefault;
+  cutlass::dequantize(block_B_dq.get(), block_B.get(), layout_B, block_scale.get(), block_zero.get(), layout_scale_zero, options.g, stream);

  if (options.shuffle) {
    // Repeat the reorder layout atom to tile the whole tensor shape 
-    layout_B_reordered = tile_to_shape(LayoutAtomQuant{}, shape_B);
-    reorder_tensor(block_B_modified.get(), layout_B, layout_B_reordered);
+    layout_B_reordered = cute::tile_to_shape(LayoutAtomQuant{}, shape_B);
+    cutlass::reorder_tensor(block_B_modified.get(), layout_B, layout_B_reordered);

    print("Quantized tensor layout: ");
    print(layout_B_reordered);
@ -518,12 +518,15 @@ int main(int argc, char const **args) {
  CUDA_CHECK(cudaGetDevice(&current_device_id));
  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
  cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
    std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater).\n";
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
    return 0;
  }
+
+  
+  
+
  //
  // Parse options
  //
--- a/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu
+++ b/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu
@ -100,6 +100,7 @@
 #include "cutlass/util/tensor_view_io.h"
 #include "cutlass/util/reference/device/tensor_fill.h"
 #include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/mixed_dtype_utils.hpp"

 #include "helper.h"
 #include "mixed_dtype_utils.hpp"
@ -322,9 +323,10 @@ void initialize(MixedDtypeOptions const& options) {
  auto shape_scale_zero = cute::make_shape(options.n, scale_k, options.l);
  stride_S = cutlass::make_cute_packed_stride(StrideS{}, cute::make_shape(options.n, scale_k, options.l));
  stride_S_ref = cutlass::make_cute_packed_stride(StrideS_ref{}, cute::make_shape(options.n, scale_k, options.l));
-  auto layout_scale_zero = make_layout(shape_scale_zero, stride_S_ref);
+  auto layout_scale_zero = cute::make_layout(shape_scale_zero, stride_S_ref);

-  dequantize_weight(block_B_dq.get(), block_B.get(), layout_B, block_scale.get(), block_zero.get(), layout_scale_zero, options.g);
+  cudaStream_t stream = cudaStreamDefault;
+  cutlass::dequantize(block_B_dq.get(), block_B.get(), layout_B, block_scale.get(), block_zero.get(), layout_scale_zero, options.g, stream);
 }

 /// Populates a Gemm::Arguments structure from the given commandline options
@ -483,12 +485,15 @@ int main(int argc, char const **args) {
  CUDA_CHECK(cudaGetDevice(&current_device_id));
  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
  cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
    std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater).\n";
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
    return 0;
  }
+
+  
+  
+
  //
  // Parse options
  //
--- a/examples/55_hopper_mixed_dtype_gemm/README.md
+++ b/examples/55_hopper_mixed_dtype_gemm/README.md
@ -7,14 +7,18 @@ When relying on `KernelScheduleAuto`, the main loop supporting different A and B

 This first version only supports mixed type GEMMs using TMA.

+
 ## Performance

-While the example offers a harness for straightforward benchmarking, this initial implementation isn't optimized for performance in the majority of scenarios. We expect this implementation to be performant for `{fp16, bf16} x {int8, int4}` and `{fp8} x {int4}` for problems that are compute bound. Additionally, we expect good performance for `fp16, bf16` or `fp32` scales and zero-points. For best performance, it is ideal to have the scales and zero-points be the same type.
+While the example offers a harness for straightforward benchmarking, this initial implementation isn't optimized for performance in the majority of scenarios. We expect this implementation to be performant for `{fp16, bf16} x {int8, int4, int2}` and `{fp8} x {int4}` for problems that are compute bound. Additionally, we expect good performance for `fp16`, `bf16` or `fp32` scales and zero-points. For best performance, it is ideal to have the scales and zero-points be the same type as mma's type.

 The scale only mode for `fp8 x int4` is significantly slower than direct conversion mode. There is a lookup-table workaround targeting this mode, as shown in `55_hopper_int4_fp8_gemm.cu`. To use this feature, use `cutlass::Array<ElementScale, 8>` as the scale type in the collective builder. However, it requires modifications to the encoding of quantized weights and scale factors. Also, scale with zero point mode is not supported for now.

+Additionally, it's recommended to reorder the narrow data type tensor such that elements read into register file by the same thread are contiguous in global and shared memory. The user can use the helper function `compute_memory_reordering_atom` and `reorder_tensor` to achieve this. See `55_hopper_int4_fp8_gemm.cu` and `55_hopper_int4_bf16_gemm.cu` for more details.
+
 We are currently optimizing the following cases:
 1. Memory bound cases for all types
+2. `fp8 x {int2, uint2}` case

 ## Limitations

--- a/examples/55_hopper_mixed_dtype_gemm/mixed_dtype_utils.hpp
+++ b/examples/55_hopper_mixed_dtype_gemm/mixed_dtype_utils.hpp
@ -60,8 +60,8 @@ struct MixedDtypeOptions {

  float alpha = 1.0f;
  float beta = 0.0f;
-  int iterations = 1000;
-  int warmup = 1000;
+  int iterations = 100;
+  int warmup = 10;
  int mode = 1;
  int m = 5120, n = 4096, k = 4096;
  int g = 128;
@ -151,16 +151,16 @@ void mixed_dtype_profiling(
  runtimes.reserve(options.iterations);

  for (int iter = 0; iter < options.warmup + options.iterations; ++iter) {
-      cudaEventRecord(start);
-      CUTLASS_CHECK(gemm.run());
-      cudaEventRecord(stop);
-      cudaEventSynchronize(stop);
+    cudaEventRecord(start);
+    CUTLASS_CHECK(gemm.run());
+    cudaEventRecord(stop);
+    cudaEventSynchronize(stop);

-      if (iter >= options.warmup) {
-          float milliseconds = 0;
-          cudaEventElapsedTime(&milliseconds, start, stop);
-          runtimes.push_back(milliseconds);
-      }
+    if (iter >= options.warmup) {
+      float milliseconds = 0;
+      cudaEventElapsedTime(&milliseconds, start, stop);
+      runtimes.push_back(milliseconds);
+    }
  }

  cudaEventDestroy(start);
@ -228,22 +228,18 @@ bool initialize_scale(
  MixedDtypeOptions const& options,
  uint64_t seed = 2023) {
  
-  if (options.mode == MixedDtypeGemmMode::ConvertOnly) {
-    // No scales, so just initialize with 1 so we can use the same kernel to dequantize the data.
-    std::vector<Element> stage(block.size(), Element(1.0f));
-    block.copy_from_host(stage.data());
-  } 
-  else {
+  // If no scales, initialize with 1 so we can use the same kernel to dequantize the data
+  float scope_max = 1.0f, scope_min = 1.0f;
+  if (options.mode != MixedDtypeGemmMode::ConvertOnly) {
    float elt_max_f = float(cutlass::platform::numeric_limits<Element>::max());
    const float max_dequant_val = 4.f;
    const float min_dequant_val = 0.5f;
-
-    float scope_max(max_dequant_val / elt_max_f);
-    float scope_min(min_dequant_val / elt_max_f);
-
-    cutlass::reference::device::BlockFillRandomUniform(
-      block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
+    scope_max = max_dequant_val / elt_max_f;
+    scope_min = min_dequant_val / elt_max_f;
  }
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
+
  return true;
 }

@ -253,139 +249,14 @@ bool initialize_zero(
  MixedDtypeOptions const& options,
  uint64_t seed = 2023) {
  
+  // If no bias, initialize with 0 so we can use the same kernel to dequantize the data
+  float scope_max = 0.0f, scope_min = 0.0f;
  if (options.mode == MixedDtypeGemmMode::ScaleWithZeroPoint) {
-    cutlass::reference::device::BlockFillRandomUniform(
-      block.get(), block.size(), seed, Element(2.0f), Element(-2.0f));
-  } else {
-    // No bias, so just initialize with 1 so we can use the same kernel to dequantize the data.
-    std::vector<Element> stage(block.size(), Element(0.0f));
-    block.copy_from_host(stage.data());
+    scope_max = 2.0f;
+    scope_min = -2.0f;
  }
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
+
  return true;
 }
-
-/// Dequantize the weights for verification
-
-template <class QuantizedElement, 
-          class DequantizedElement,
-          class OperandLayout,
-          class ElementScale,
-          class ElementZero,
-          class ScaleBroadCastLayout,
-          class ThrLayout>
-__global__ void dequantize_weight_kernel(DequantizedElement* dq_buffer,
-                                         QuantizedElement const* q_buffer,
-                                         OperandLayout const operand_layout,
-                                         ElementScale const* scale_buffer,
-                                         ElementZero const* zero_buffer,
-                                         ScaleBroadCastLayout const broadcasted_scale_layout,
-                                         ThrLayout thr_layout) {
-  using namespace cute;
-
-  // Represent the full tensors to gmem elements. 
-  // These are expected to have shape [MN, K, L]
-  cute::Tensor gmem_op_dq = cute::make_tensor(cute::make_gmem_ptr(dq_buffer), operand_layout);
-  auto init_quantized_iterator = [&]() {
-    if constexpr (cute::sizeof_bits_v<QuantizedElement> >= 8) {
-      return cute::make_gmem_ptr(q_buffer);
-    } else {
-      return cute::subbyte_iterator<const QuantizedElement>(q_buffer);
-    }
-  };
-  cute::Tensor gmem_op_q  = cute::make_tensor(init_quantized_iterator(), operand_layout);
-  // While the scales are expected to have shape [MN, G, L] but with a stride to allow broadcasting
-  // It is expected that K % G == 0
-  cute::Tensor gmem_scale_broadcasted = cute::make_tensor(make_gmem_ptr(scale_buffer), broadcasted_scale_layout);
-  cute::Tensor gmem_zero_broadcasted = cute::make_tensor(make_gmem_ptr(zero_buffer), broadcasted_scale_layout);
-
-  // Assign 1 thread per element in the thread block
-  auto blk_shape = make_shape(size<0>(thr_layout), _1{}, _1{}); // 
-  auto blk_coord = make_coord(_, blockIdx.x, blockIdx.y);  // (MN, K, L)
-
-  // Tile across the block
-  auto gOp_dq = cute::local_tile(gmem_op_dq, blk_shape, blk_coord);
-  auto gScale = cute::local_tile(gmem_scale_broadcasted, blk_shape, blk_coord);
-  auto gZero  = cute::local_tile(gmem_zero_broadcasted,  blk_shape, blk_coord);
-  auto gOp_q  = cute::local_tile(gmem_op_q, blk_shape, blk_coord);
-  
-  auto tOpDq_gOpDq = cute::local_partition(gOp_dq, thr_layout, threadIdx.x);
-  auto tScale_gScale = cute::local_partition(gScale, thr_layout, threadIdx.x);
-  auto tZero_gZero = cute::local_partition(gZero, thr_layout, threadIdx.x);
-  auto tOpQ_gOpQ = cute::local_partition(gOp_q, thr_layout, threadIdx.x);
-
-  // Make a fragment of registers to hold gmem loads
-  cute::Tensor rmem_op_q = cute::make_fragment_like(tOpQ_gOpQ(_, _, _, 0));
-  cute::Tensor rmem_scale = cute::make_fragment_like(tScale_gScale(_, _, _, 0));
-  cute::Tensor rmem_zero = cute::make_fragment_like(tZero_gZero(_, _, _, 0));
-  cute::Tensor rmem_op_dq = cute::make_fragment_like(tOpDq_gOpDq(_, _, _, 0));
-  cute::Tensor rmem_op_scaled = cute::make_fragment_like<ElementScale>(rmem_op_dq);
-  cute::Tensor rmem_zero_buf = cute::make_fragment_like<ElementScale>(rmem_zero);
-
-  cute::Tensor pred_id = cute::make_identity_tensor(shape(operand_layout));
-  auto pred_blk_tile = cute::local_tile(pred_id, blk_shape, blk_coord);
-  auto pred_thr_partition = cute::local_partition(pred_blk_tile, thr_layout, threadIdx.x);
-
-  const auto num_iters = cute::size<3>(tOpDq_gOpDq);
-  
-  for (int ii = 0; ii < num_iters; ++ii) {
-    const auto thread_offset = cute::get<0>(pred_thr_partition(0, 0, 0, ii));
-    if (thread_offset < cute::size<0>(operand_layout)) {
-      cute::copy(tOpQ_gOpQ(_, _, _, ii), rmem_op_q);
-      cute::copy(tScale_gScale(_, _, _, ii), rmem_scale);
-      cute::copy(tZero_gZero(_, _, _, ii), rmem_zero);
-      cute::transform(rmem_op_q, rmem_op_scaled, [] (const QuantizedElement& elt) { return ElementScale(elt); } );
-      cute::transform(rmem_zero, rmem_zero_buf, [] (const ElementZero& elt) { return ElementScale(elt); } );
-      cute::transform(rmem_op_scaled, rmem_scale, rmem_op_scaled, multiplies{});
-      cute::transform(rmem_op_scaled, rmem_zero_buf, rmem_op_scaled, plus{});
-      cute::transform(rmem_op_scaled, rmem_op_dq, [] (const ElementScale& elt) { return DequantizedElement(elt); } );
-      cute::copy(rmem_op_dq, tOpDq_gOpDq(_, _, _, ii));
-    }
-  }
-}
-
-template <class QuantizedElement, 
-          class DequantizedElement,
-          class OperandLayout,
-          class ElementScale,
-          class ElementZero,
-          class ScaleLayout>
-void dequantize_weight(DequantizedElement* dq_buffer,
-                       QuantizedElement const* q_buffer,
-                       OperandLayout const operand_layout,
-                       ElementScale const* scale_buffer,
-                       ElementZero const* zero_buffer,
-                       ScaleLayout const scale_layout,
-                       int const group_size) {
-  
-  using namespace cute;
-
-  constexpr int tpb = 128;
-  auto thr_layout = make_layout(make_shape(Int<tpb>{}));
-
-  const auto num_rows = get<0>(shape(operand_layout));
-  const auto gemm_k = get<1>(shape(operand_layout));   // [MN, K, L]
-  const auto batches = get<2>(shape(operand_layout));  // [MN, K, L]
-  const auto scale_k = get<1>(shape(scale_layout));    // [MN, Scale_K, L]
-
-  if (num_rows != size<0>(scale_layout)) {
-    std::cerr << "Invalid first dimension for scales. Must match first dim for weights."
-              << " But got shapes " << shape(operand_layout) << " " << shape(scale_layout) 
-              << std::endl;
-    exit(-1);
-  }
-
-  const auto scale_stride0 = get<0>(stride(scale_layout));
-  const auto scale_stride1 = get<1>(stride(scale_layout));
-  const auto scale_stride2 = get<2>(stride(scale_layout));
-
-  auto scale_shape_bcast = make_shape(num_rows, make_shape(group_size, scale_k), batches);
-  auto scale_stride_bcast = make_stride(scale_stride0, make_stride(0, scale_stride1), scale_stride2);
-  auto scale_layout_bcast = make_layout(scale_shape_bcast, scale_stride_bcast);
-
-  const auto blocks_x = gemm_k;
-  const auto blocks_y = batches;
-
-  dim3 blocks(blocks_x, blocks_y, 1);
-  dequantize_weight_kernel<<<blocks, tpb>>>(dq_buffer, q_buffer, operand_layout, scale_buffer, zero_buffer, scale_layout_bcast, thr_layout);
-  CUDA_CHECK(cudaDeviceSynchronize());
-}
--- a/examples/55_hopper_mixed_dtype_gemm/packed_scale.hpp
+++ b/examples/55_hopper_mixed_dtype_gemm/packed_scale.hpp
@ -1,210 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: BSD-3-Clause
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- * list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- *
- * 3. Neither the name of the copyright holder nor the names of its
- * contributors may be used to endorse or promote products derived from
- * this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
- * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
- * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-
-#pragma once
-
-#include <cstdint>
-
-#include "cutlass/float8.h"
-#include "cutlass/util/reference/device/tensor_fill.h"
-
-#include "cute/tensor.hpp"
-#include "cute/util/type_traits.hpp"
-
-namespace cutlass
-{
-template<typename T>
-class packed_scale_t {
-public:
-  static_assert(cute::is_same_v<T, cutlass::int8_t> ||
-                cute::is_same_v<T, cutlass::uint8_t> ||
-                cute::is_same_v<T, cutlass::float_e4m3_t> ||
-                cute::is_same_v<T, cutlass::float_e5m2_t>,
-                "only 8 bit arithmetic types are supported.");
-  CUTLASS_HOST_DEVICE
-  explicit packed_scale_t(T val) {
-    if constexpr (!cute::is_unsigned_v<T>) {
-      // Only pack negative values. The positive values are generated in flight in the mainloop.
-      storage[0] = pack4(T(float(val) * -8.f), T(float(val) * -7.f), T(float(val) * -6.f), T(float(val) * -5.f));
-      storage[1] = pack4(T(float(val) * -4.f), T(float(val) * -3.f), T(float(val) * -2.f), -val);
-    }
-    else {
-      storage[0] = pack4(T(float(val) * 8.f), T(float(val) * 7.f), T(float(val) * 6.f), T(float(val) * 5.f));
-      storage[1] = pack4(T(float(val) * 4.f), T(float(val) * 3.f), T(float(val) * 2.f), val);
-    }
-  }
-  CUTLASS_HOST_DEVICE
-  packed_scale_t() = default;
-  CUTLASS_HOST_DEVICE
-  explicit operator float() const {
-    return float(get());
-  }
-  CUTLASS_HOST_DEVICE
-  bool operator==(packed_scale_t const& rhs) const {
-    return storage[0] == rhs.storage[0] && storage[1] == rhs.storage[1];
-  }
-  CUTLASS_HOST_DEVICE
-  bool operator!=(packed_scale_t const& rhs) const {
-    return !(*this == rhs);
-  }
-  CUTLASS_HOST_DEVICE
-  friend packed_scale_t operator+(packed_scale_t const& lhs, packed_scale_t const& rhs) {
-    return packed_scale_t(lhs.get() + rhs.get());
-  }
-  CUTLASS_HOST_DEVICE
-  friend packed_scale_t operator-(packed_scale_t const& lhs, packed_scale_t const& rhs) {
-    return packed_scale_t(lhs.get() - rhs.get());
-  }
-  CUTLASS_HOST_DEVICE
-  friend packed_scale_t operator*(packed_scale_t const& lhs, packed_scale_t const& rhs) {
-    return packed_scale_t(lhs.get() * rhs.get());
-  }
-  CUTLASS_HOST_DEVICE
-  friend packed_scale_t operator/(packed_scale_t const& lhs, packed_scale_t const& rhs) {
-    return packed_scale_t(lhs.get() / rhs.get());
-  }
-
-private:
-  using Storage = uint32_t;
-  using Stage = uint8_t;
-
-  Storage storage[2] {};
-
-  CUTLASS_HOST_DEVICE
-  static Storage pack4(T c1, T c2, T c3, T c4) {
-    Storage result = 0;
-    result |= (static_cast<Storage>(reinterpret_cast<Stage const&>(c4)) << 24);
-    result |= (static_cast<Storage>(reinterpret_cast<Stage const&>(c3)) << 16);
-    result |= (static_cast<Storage>(reinterpret_cast<Stage const&>(c2)) << 8);
-    result |= static_cast<Storage>(reinterpret_cast<Stage const&>(c1));
-    return result;
-  }
-  CUTLASS_HOST_DEVICE
-  T get() const {
-    auto stage = static_cast<Stage>(storage[0] >> 8);
-    #if defined(__CUDA_ARCH__)
-    return reinterpret_cast<T const&>(stage);
-    #else
-    T tmp;
-    std::memcpy(&tmp, &stage, sizeof(Stage));
-    return tmp;
-    #endif
-  }
-  CUTLASS_HOST_DEVICE
-  T get(int idx) const {
-    Stage stage;
-    if (idx < 4) stage = static_cast<Stage>(storage[0] >> (8 * idx));
-    else         stage = static_cast<Stage>(storage[1] >> (8 * idx - 32));
-    #if defined(__CUDA_ARCH__)
-    return reinterpret_cast<T const&>(stage);
-    #else
-    T tmp;
-    std::memcpy(&tmp, &stage, sizeof(Stage));
-    return tmp;
-    #endif
-  }
-};
-}
-
-/// Helpers to initialize scale lookup table
-
-// In the mainloop, PRMT selects 1 byte from only 8 bytes so the sign bit is handled in an extra PRMT.
-// Here the encodings of positive values and negative values are unified (except for the sign bit). 
-// For instance, 1 becomes 0b0111, which is the same encoding as -1 (0b1111).
-bool unify_quant_encoding(
-  cutlass::DeviceAllocation<cutlass::int4b_t> const& block_in,
-  cutlass::DeviceAllocation<cutlass::int4b_t>& block_out) {
-
-  using StorageType = cutlass::int4b_t::Storage;
-
-  if (block_in.size() != block_out.size()) {
-    std::cerr << "block_in and block_out must have same size.\n";
-    return false;
-  }
-  constexpr int pack = cute::sizeof_bits_v<StorageType> / 4;
-  std::vector<StorageType> data(block_in.size() / pack);
-  cutlass::device_memory::copy_to_host(data.data(), (StorageType*)block_in.get(), block_in.size() / pack);
-
-  for (auto&& d : data) {
-    StorageType out = 0;
-    StorageType mask = 0x0f;
-    for (int i = 0; i < pack; ++i) {
-      cutlass::int4b_t curr;
-      curr.storage = (d >> (i * 4)) & 0x0f;
-      switch (curr) {
-        case 1: curr.storage = StorageType(0b0111); break; // 2's complement
-        case 2: curr.storage = StorageType(0b0110); break; // 2's complement
-        case 3: curr.storage = StorageType(0b0101); break; // 2's complement
-        case 4: curr.storage = StorageType(0b0100); break; // 2's complement
-        case 5: curr.storage = StorageType(0b0011); break; // 2's complement
-        case 6: curr.storage = StorageType(0b0010); break; // 2's complement
-        case 7: curr.storage = StorageType(0b0001); break; // 2's complement
-        default: break;
-      }
-      out |= (curr.storage << (4 * i)) & mask;
-      mask <<= 4;
-    }
-    d = out;
-  }
-
-  cutlass::device_memory::copy_to_device((StorageType*)block_out.get(), data.data(), block_out.size() / pack);
-  return true;
-}
-
-template <class ElementScale>
-bool initialize_packed_scale(
-  cutlass::DeviceAllocation<ElementScale> const& block_in, 
-  cutlass::DeviceAllocation<cutlass::Array<ElementScale, 8> > & block_out) {
-  
-  std::vector<ElementScale> data_in(block_in.size());
-  std::vector<cutlass::Array<ElementScale, 8> > data_out(block_in.size());
-  try {
-    block_in.copy_to_host(data_in.data());
-  } catch (cutlass::cuda_exception const& e)
-  {
-    std::cerr << "CUDA Error: " << cudaGetErrorString(e.cudaError()) << std::endl;
-    return false;
-  }
-  for (size_t i = 0; i < block_in.size(); ++i)
-  {
-    cutlass::packed_scale_t<ElementScale> tmp(data_in[i]);
-    data_out[i] = reinterpret_cast<cutlass::Array<ElementScale, 8> const&>(tmp);
-    // std::cout << data_in[i] << ":" << std::hex << static_cast<uint16_t>(data_in[i].storage) << ",\t" << -data_in[i] << ":" << std::hex << static_cast<uint16_t>((-data_in[i]).storage) << std::endl;
-  }
-  try {
-    block_out.copy_from_host(data_out.data());
-  } catch (cutlass::cuda_exception const& e)
-  {
-    std::cerr << "CUDA Error: " << cudaGetErrorString(e.cudaError()) << std::endl;
-    return false;
-  }
-  return true;
-}
--- a/examples/55_hopper_mixed_dtype_gemm/reorder_utils.hpp
+++ b/examples/55_hopper_mixed_dtype_gemm/reorder_utils.hpp
@ -1,162 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: BSD-3-Clause
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- * list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- *
- * 3. Neither the name of the copyright holder nor the names of its
- * contributors may be used to endorse or promote products derived from
- * this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
- * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
- * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-
-#include "cute/layout.hpp"
-#include "cute/tensor.hpp"
-#include "cute/arch/mma_sm90.hpp"
-
-#include "cutlass/util/device_memory.h"
-
-// Given a type of MMA instruction, compute a memory reordering atom that places all values
-// owned by each thread in contiguous memory locations. This improves smem load vectorization,
-// particularly for mixed dtype GEMMs where a narrow type is loaded in the thread/value order
-// of the wider type and may result in inefficient sub-bank (8-bit or 16-bit) accesses.
-// In addition, we can reorder the values across several MMA instructions to get even wider
-// vectorization (AtomLayout parameter) and permute the values within each instruction to get
-// more optimal conversion instruction sequences (ValLayout parameter).
-template<class ElementMma, 
-         class AtomLayout = cute::Layout<cute::_1>,
-         class ValLayout  = cute::Layout<cute::_1>>
-constexpr auto compute_memory_reordering_atom(AtomLayout atom_layout = {}, ValLayout val_layout = {})
-{
-  using namespace cute;
-
-  static_assert(is_static_v<ValLayout>, "ValLayout must be static");
-  static_assert(is_static_v<AtomLayout>, "AtomLayout must be static");
-
-  // 1. Choose an MMA atom to access TV layout and MN shape
-  // Note: parameters like GMMA Major, TileShape, ElementC don't affect TV layout of A, use arbitrary
-  using MmaAtom = decltype(SM90::GMMA::rs_op_selector<ElementMma, ElementMma, float, Shape<_64,_16,_32>>());
-  using MmaTraits = MMA_Traits<MmaAtom>;
-  auto mk_shape_mma = select<0,2>(typename MmaTraits::Shape_MNK{});
-  auto tv_layout_mma = typename MmaTraits::ALayout{};
-  static_assert(size<1>(tv_layout_mma) % size(val_layout) == 0, "Value layout must evenly divide the MMA value layout");
-
-  // 2. Create a single warp's TV layout from that of the whole MMA and invert to get (m,k -> thr,val)
-  // Note: this assumes A is partitioned between warps along M mode
-  auto tv_tiler_warp = make_shape(Int<32>{}, size<1>(tv_layout_mma));
-  auto mk_shape_warp = shape_div(mk_shape_mma, size(typename MmaTraits::ThrID{}) / Int<32>{});
-  auto tv_layout_mma_warp = make_layout_like(composition(tv_layout_mma, tv_tiler_warp));
-  auto mk_layout_mma_warp = right_inverse(tv_layout_mma_warp).with_shape(mk_shape_warp);
-
-  // 3. Repeat the warp layout NumAtoms times along K mode to get wider vectorization
-  auto mk_layout_mma_trgt = blocked_product(mk_layout_mma_warp, atom_layout);
-
-  // 4. Compose with a contiguous layout of values in each thread (required for smem vectorization)
-  auto val_to_offset = logical_product(val_layout, size<1>(tv_layout_mma) / size(val_layout) * size(atom_layout));
-  auto thr_to_offset = make_layout(size<0>(tv_layout_mma_warp));
-  auto tv_to_offset = select<1,0>(logical_product(val_to_offset, thr_to_offset));
-  auto layout_atom = composition(tv_to_offset, mk_layout_mma_trgt);
-
-  return layout_atom;
-}
-
-template<class TileShape, class EngineSrc, class LayoutSrc, class EngineDst, class LayoutDst, class TiledCopy>
-__global__ void reorder_tensor_kernel(
-  cute::Tensor<EngineSrc, LayoutSrc> S,
-  cute::Tensor<EngineDst, LayoutDst> D,
-  TiledCopy tiled_copy)
-{
-  using namespace cute;
-
-  using T = typename EngineDst::value_type;
-
-  Tensor gS = local_tile(S, TileShape{}, make_coord(blockIdx.x, _, blockIdx.z));
-  Tensor gD = local_tile(D, TileShape{}, make_coord(blockIdx.x, _, blockIdx.z));
-
-  auto thread_copy = tiled_copy.get_slice(threadIdx.x);
-  Tensor tS = thread_copy.partition_S(gS);
-  Tensor tD = thread_copy.partition_D(gD);
-
-  copy(tiled_copy, tS, tD);
-}
-
-template<class EngineSrc, class LayoutSrc, class EngineDst, class LayoutDst>
-void reorder_tensor(
-  cute::Tensor<EngineSrc, LayoutSrc> S,
-  cute::Tensor<EngineDst, LayoutDst> D)
-{
-  using namespace cute;
-
-  using T = typename EngineDst::value_type;
-  static_assert(is_same_v<remove_const_t<typename EngineSrc::value_type>, T>, "Type mismatch");
-
-  // Construct a value layout that assigns at least 8 bits of contiguous elements in destination tensor to a thread
-  // This avoids a race condition when writing out subbyte types (e.g. int4b_t).
-  auto has_major_mode = [](auto s) {
-    return any_of(s, [](auto a){ return is_constant<1, decltype(a)>{}; });
-  };
-  static_assert(has_major_mode(stride<0>(LayoutDst{})) ^ has_major_mode(stride<1>(LayoutDst{})), 
-                "Could not find stride-1 mode in destination layout");
-  constexpr int N = shape_div(Int<8>{}, sizeof_bits<T>{});
-  auto val_layout = conditional_return<has_major_mode(stride<0>(LayoutDst{}))>(
-    make_layout(make_shape(Int<N>{}, Int<1>{}), GenColMajor{}),
-    make_layout(make_shape(Int<1>{}, Int<N>{}), GenRowMajor{}));
-
-  // Make a tiled copy with a simple row-major thread order and above layout
-  int constexpr NumThreads = 128;
-  auto const thr_layout = make_layout(make_shape(Int<1>{}, Int<NumThreads>{}));
-  auto tiled_copy = make_tiled_copy(Copy_Atom<DefaultCopy, T>{}, thr_layout, val_layout);
-
-  // Assign a group of 16 rows to a threadblock; this matches the shuffle atom size for Hopper
-  using TileShape = Shape<_16>;
-  auto tiled_D = group_modes<3,rank_v<LayoutDst>>(tiled_divide(D, TileShape{}));
-  dim3 blocks{unsigned(size<1>(tiled_D)), 1u, unsigned(size<3>(tiled_D))};
-
-  reorder_tensor_kernel<TileShape><<<blocks, NumThreads>>>(S, D, tiled_copy);
-  CUDA_CHECK(cudaDeviceSynchronize());
-}
-
-// In-place version
-template<class T, class LayoutSrc, class LayoutDst>
-void reorder_tensor(
-  T const* src,
-  LayoutSrc const& layout_src,
-  T * dst,
-  LayoutDst const& layout_dst)
-{
-  using namespace cute;
-  reorder_tensor(make_tensor(make_gmem_ptr<T>(src), layout_src),
-                 make_tensor(make_gmem_ptr<T>(dst), layout_dst));
-}
-
-// In-place version
-template<class T, class LayoutSrc, class LayoutDst>
-void reorder_tensor(
-  T * data,
-  LayoutSrc const& layout_src,
-  LayoutDst const& layout_dst)
-{
-  using namespace cute;
-  cutlass::DeviceAllocation<T> temp(size(layout_src));
-  reorder_tensor(data, layout_src, temp.get(), layout_dst);
-  cutlass::device_memory::copy_device_to_device(data, temp.get(), static_cast<size_t>(size(layout_src)));
-}
--- a/examples/56_hopper_ptr_array_batched_gemm/56_hopper_ptr_array_batched_gemm.cu
+++ b/examples/56_hopper_ptr_array_batched_gemm/56_hopper_ptr_array_batched_gemm.cu
@ -513,12 +513,15 @@ int main(int argc, char const **args) {
  CUDA_CHECK(cudaGetDevice(&current_device_id));
  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
  cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
    std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater).\n";
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
    return 0;
  }
+
+  
+  
+
  //
  // Parse options
  //
--- a/examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm.cu
+++ b/examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm.cu
@ -731,12 +731,15 @@ int main(int argc, char const **args) {
  CUDA_CHECK(cudaGetDevice(&current_device_id));
  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
  cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
    std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater).\n";
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
    return 0;
  }
+
+  
+  
+
  //
  // Parse options
  //
--- a/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu
+++ b/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu
@ -768,16 +768,26 @@ int main(int argc, char const** argv) {
    return -1;
  }

-  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 4) ||
-      (props.major != 8 && props.minor != 9)) {
+  bool satisfied;
+  if (props.major < 10) {
+    // Pre-Blackwell 
+    satisfied =  (__CUDACC_VER_MAJOR__ > 12) || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 4);
+    satisfied &= (props.major > 8) || (props.major == 8 && props.minor == 9); 
+  }
+  else {
+    satisfied = (__CUDACC_VER_MAJOR__ > 12) || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 8);
+  }

+  if (!satisfied) {
    //
-    // This example requires an NVIDIA Ada-architecture GPU.
+    // This example requires an NVIDIA GPU with compute capability 8.9 or greater.
    //

    std::cout
-      << "CUTLASS's FP8 SM89 example requires a GPU of NVIDIA's Ada architecture "
-      << "and CUDA toolkit version 12.4 or later.\n";
+      << "CUTLASS's FP8 SM89 example requires an NVIDIA GPU with compute capability 8.9 or greater "
+      << "and CUDA toolkit version 12.4 or later"
+      << " (12.8 or later needed for SM100+)" 
+      << std::endl;

    return 0;
  }
--- a/examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu
+++ b/examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu
@ -37,8 +37,11 @@

    Those assumptions are as:
      1. Fusion is over the N dimension.
-      2. Top-K is either 2 or 4 elements, and the value is static (meaning two kernels have to be
-         compiled to support both.)
+      2. Top-K value is static (meaning multiple kernels have to be compiled to support 
+         different values.)
+         * NOTE: Only K=2 and K=4 cases are performance-optimized and enabled by default.
+           There is also a generic sort that supports all K values greater than 1, but it can lead to serious performance implications to the underlying kernel.
+           If necessary, users can simply remove the K==2 || K ==4 assertion under cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp, and the generic sort will automatically be used for all other Ks.
      3. The GEMM tile shape along N is greater than or equal to problem size
         along N.

@ -501,12 +504,15 @@ int main(int argc, char const **args) {
  CUDA_CHECK(cudaGetDevice(&current_device_id));
  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
  cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
    std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater).\n";
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
    return 0;
  }
+
+  
+  
+
  //
  // Parse options
  //
--- a/examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu
+++ b/examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu
@ -570,12 +570,15 @@ int main(int argc, char const **args) {
  CUDA_CHECK(cudaGetDevice(&current_device_id));
  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
  cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
    std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater).\n";
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
    return 0;
  }
+
+  
+  
+
  //
  // Parse options
  //
--- a/examples/63_hopper_gemm_with_weight_prefetch/63_hopper_gemm_with_weight_prefetch.cu
+++ b/examples/63_hopper_gemm_with_weight_prefetch/63_hopper_gemm_with_weight_prefetch.cu
@ -469,12 +469,13 @@ int main(int argc, char const **args) {
  CUDA_CHECK(cudaGetDevice(&current_device_id));
  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
  cudaError_t error = cudaGetDeviceProperties(&props, 0);
-  if (props.major < 9) {
+  if (props.major != 9 || props.minor != 0) {
    std::cerr
-      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
-      << "later (compute capability 90 or greater).\n";
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
    return 0;
  }
+
+
  //
  // Parse options
  //
--- a/examples/65_distributed_gemm/65_distributed_gemm.cu
+++ b/examples/65_distributed_gemm/65_distributed_gemm.cu
@ -133,7 +133,8 @@ using namespace cute;
 using TP = _8;
 static constexpr int TP_ = TP{};

-#if (defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED) && (__CUDACC_VER_MAJOR__ >= 12) && (__CUDACC_VER_MINOR__ >= 4))
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED) && \
+  (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 4))

 // Distributed GEMM tiling/sharding schedule
 // Choices:
@ -344,7 +345,8 @@ struct Result {

 };

-#if (defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED) && (__CUDACC_VER_MAJOR__ >= 12) && (__CUDACC_VER_MINOR__ >= 4))
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED) && \
+  (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 4))

 /////////////////////////////////////////////////////////////////////////////////////////////////
 /// GEMM setup and evaluation
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_deepgemm.cu
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_deepgemm.cu
@ -0,0 +1,712 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Grouped scale Hopper FP8 GEMM example using CUTLASS 3.0 APIs for NVIDIA Hopper architecture
+    This example demonstrate a grouped scaled FP8 GEMM using the new CUTLASS 3.0.
+    APIs on NVIDIA Hopper architecture. New features that will be showcased in this example are as follows:
+    1. NVIDIA Hopper architecture introduces a new series of tensor core instructions (GMMA)
+    which are more efficient than the Ampere tensor core instructions.
+    2. NVIDIA Hopper architecture includes new Tensor Memory Accelerator (TMA) unit to transfer large
+    blocks of data efficiently between global memory and shared memory. TMA also supports asynchronous
+    copies between thread blocks in a cluster.
+    3. This example uses the Warp Specialized kernel design (see /media/docs/efficient_gemm.md for details).
+    4. This example shows all important fusions used by FP8 gemm kernels, i.e., grouped scale factor along M for
+    A, blocked scale factor along K for A tensor, blocked scale factor for B tensor, the abs_max value of D tensor.
+    5. A simple way to tune the CTA rasterization direction and swizzle pattern of Hopper kernels. Both the
+    CTA rasterization direction and swizzle pattern impact cross-CTA locality of accesses. By tuning we can
+    improve performance.
+    Examples:
+      $ ./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_deepgemm  \
+        --m=4096 --iterations=1000
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cutlass/numeric_types.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+// #include "cutlass/util/reference/host/tensor_copy.h"
+// #include "cutlass/util/reference/host/tensor_compare.h"
+// #include "cutlass/util/reference/host/tensor_norm.h"
+
+// Includes from examples directory
+#include "helper.h"
+// #include "reference/host/gemm_with_groupwise_scaling.h"
+
+#include "deep_gemm/include/deep_gemm/fp8_gemm.cuh"
+
+// using namespace cute;
+using namespace deep_gemm;
+
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+
+// Command line options parsing
+struct Options {
+  bool help;
+  int iterations;
+  int m, n, k, num_groups;
+
+  Options():
+    help(false),
+    m(4096),
+    n(4096),
+    k(4096),
+    num_groups(4),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+    Options defaults;
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m, defaults.m);
+    cmd.get_cmd_line_argument("num_groups", num_groups, defaults.num_groups);
+    cmd.get_cmd_line_argument("iterations", iterations, defaults.iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "67_hopper_fp8_deepgemm\n\n"
+      << "  Hopper FP8 DeepGEMM kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the m size\n"
+      << "  --num_groups=<int>          Sets the number of groups\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    return out;
+  }
+
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+constexpr int cdiv(int a, int b) {
+  return (a + b - 1) / b;
+}
+
+
+// #if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  cutlass::Distribution::Kind dist_kind,
+  uint64_t seed) {
+
+  if (dist_kind == cutlass::Distribution::Uniform) {
+
+    double scope_max, scope_min;
+    int bits_input = cutlass::sizeof_bits<Element>::value;
+    int bits_output = cutlass::sizeof_bits<Element>::value;
+
+    if (bits_input == 1) {
+      scope_max = 2;
+      scope_min = 0;
+    } else if (bits_input <= 8) {
+      scope_max = 2;
+      scope_min = -2;
+    } else if (bits_output == 16) {
+      scope_max = 5;
+      scope_min = -5;
+    } else {
+      scope_max = 8;
+      scope_min = -8;
+    }
+
+    cutlass::reference::host::TensorFillRandomUniform(
+      view, seed, scope_max, scope_min, 0);
+  }
+  else if (dist_kind == cutlass::Distribution::AllZeros) {
+    cutlass::reference::host::TensorFill(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Identity) {
+
+    cutlass::reference::host::TensorFillIdentity(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Gaussian) {
+
+    cutlass::reference::host::TensorFillRandomGaussian(view, seed, 0, 0.5);
+  }
+  else if (dist_kind == cutlass::Distribution::Sequential) {
+    cutlass::reference::host::BlockFillSequential(view.data(), view.capacity());
+  }
+  else {
+    throw std::runtime_error("Not implementated.");
+  }
+
+  return true;
+}
+
+/// Helper to initialize a block of device data (scale_tensors)
+template <typename Element, typename Layout>
+bool initialize_scale_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  cutlass::Distribution::Kind dist_kind,
+  uint64_t seed) {
+
+  if (dist_kind == cutlass::Distribution::Uniform) {
+
+    double scope_max, scope_min;
+
+    scope_min = -1;
+    scope_max = 1;
+
+    cutlass::reference::host::TensorFillRandomUniform(
+      view, seed, scope_max, scope_min, 0);
+  }
+  else if (dist_kind == cutlass::Distribution::AllZeros) {
+    cutlass::reference::host::TensorFill(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Identity) {
+
+    cutlass::reference::host::TensorFillIdentity(view);
+  }
+  else if (dist_kind == cutlass::Distribution::Gaussian) {
+
+    cutlass::reference::host::TensorFillRandomGaussian(view, seed, 0, 0.5);
+  }
+  else if (dist_kind == cutlass::Distribution::Sequential) {
+    cutlass::reference::host::BlockFillSequential(view.data(), view.capacity());
+  }
+  else {
+    throw std::runtime_error("Not implementated.");
+  }
+
+  return true;
+}
+
+
+/// Todo: add reference check
+bool verify(const Options &options) {
+  //
+  // Compute reference output
+  //
+  return true;
+}
+
+
+struct TestGemm {
+  using Element = cutlass::float_e4m3_t;
+  using ElementScale = float;
+  using ElementAcc = float;
+  using ElementOut = cutlass::bfloat16_t;
+
+  cutlass::HostTensor<Element, cutlass::layout::RowMajor> Tensor_lhs;
+  cutlass::HostTensor<Element, cutlass::layout::RowMajor> Tensor_rhs;
+  cutlass::HostTensor<ElementScale, cutlass::layout::ColumnMajor> Tensor_lhs_scale;
+  cutlass::HostTensor<ElementScale, cutlass::layout::RowMajor> Tensor_rhs_scale;
+  cutlass::HostTensor<ElementOut, cutlass::layout::RowMajor> Tensor_out;
+
+
+  /// Initialize operands to be used in the GEMM
+  void initialize(
+    const Options &options,
+    uint64_t seed = 2025) {
+
+    Tensor_lhs.resize({options.m, options.k}); //[m, k]
+    Tensor_rhs.resize({options.n, options.k}); //[n, k]
+    Tensor_lhs_scale.resize({options.m, cdiv(options.k, 128)}); // [m, cdiv(k, 128)] column major
+    Tensor_rhs_scale.resize({cdiv(options.n, 128), cdiv(options.k, 128)}); // [cdiv(n, 128), cdiv(k, 128)]
+    Tensor_out.resize({options.m, options.n}); // [m, n]
+
+    initialize_tensor(Tensor_lhs.host_view(), cutlass::Distribution::Uniform, seed + 1);
+    initialize_tensor(Tensor_rhs.host_view(), cutlass::Distribution::Uniform, seed + 2);
+    initialize_scale_tensor(Tensor_lhs_scale.host_view(), cutlass::Distribution::Uniform, seed + 3);
+    initialize_scale_tensor(Tensor_rhs_scale.host_view(), cutlass::Distribution::Uniform, seed + 4);
+
+    Tensor_lhs.sync_device();
+    Tensor_rhs.sync_device();
+    Tensor_lhs_scale.sync_device();
+    Tensor_rhs_scale.sync_device();
+    Tensor_out.sync_device();
+
+  }
+
+  void run(Options &options)
+  {
+    cudaDeviceProp props;
+    int current_device;
+    CUDA_CHECK(cudaGetDevice(&current_device));
+    CUDA_CHECK(cudaGetDeviceProperties(&props, current_device));
+
+    initialize(options);
+
+    cudaStream_t stream{nullptr};
+    constexpr auto N = 4096;
+    constexpr auto K = 4096;
+    constexpr auto BLOCK_M = 128;
+    constexpr auto BLOCK_N = 128;
+    constexpr auto kNumStages = 5;
+    constexpr auto kNumTMAMulticast = 2;
+    const int num_sms = 132; // for H100 
+    const int best_smem_size = 199376; 
+
+    // Make a templated GEMM
+    using GemmKernel = Gemm<N, K, BLOCK_M, BLOCK_N, 128, 1, kNumStages, kNumTMAMulticast, GemmType::Normal>;
+
+    int m = options.m;
+    // DeepGEMM requires __nv_fp8_e4m3 input and __nv_bfloat16 output
+    __nv_fp8_e4m3* lhs = reinterpret_cast<__nv_fp8_e4m3*>(Tensor_lhs.device_data());
+    __nv_fp8_e4m3* rhs = reinterpret_cast<__nv_fp8_e4m3*>(Tensor_rhs.device_data());
+    float* lhs_scales = Tensor_lhs_scale.device_data();
+    float* rhs_scales = Tensor_rhs_scale.device_data();
+    __nv_bfloat16* out = reinterpret_cast<__nv_bfloat16*>(Tensor_out.device_data());
+
+    // Launch kernel
+    auto tma_a_desc = GemmKernel::make_2d_tma_a_desc(lhs, m);
+    auto tma_b_desc = GemmKernel::make_2d_tma_b_desc(rhs);
+    auto tma_scales_a_desc = GemmKernel::make_2d_tma_scales_a_desc(lhs_scales, m);
+    auto tma_d_desc = GemmKernel::make_2d_tma_d_desc(out, m);
+    GemmKernel::run(out, rhs_scales, nullptr,
+                  m,
+                  tma_a_desc, tma_b_desc, tma_scales_a_desc, tma_d_desc,
+                  stream, num_sms, best_smem_size);
+
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+
+    CUDA_CHECK(cudaDeviceSynchronize());
+
+    std::cout << "run Gemm...\n";
+    // TODO: reference check
+    Result result;
+    // result.passed = verify(options, ScaleMsPerTile, ScaleNsPerTile);
+
+    // std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+    // if (!result.passed) {
+    //  exit(-1);
+    // }
+
+    // Run profiling loop
+    if (options.iterations > 0)
+    {
+      GpuTimer timer;
+      timer.start();
+      for (int iter = 0; iter < options.iterations; ++iter) {
+        // initialize(options);
+        GemmKernel::run(out, rhs_scales, nullptr,
+                    m,
+                    tma_a_desc, tma_b_desc, tma_scales_a_desc, tma_d_desc,
+                    stream, num_sms, best_smem_size);
+      }
+      timer.stop();
+
+      // Compute average runtime and GFLOPs.
+      float elapsed_ms = timer.elapsed_millis();
+      result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+      result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+      std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+      std::cout << "  Tile shape (M, N, K): (128, 128, 128)" << std::endl;
+      std::cout << "  ScaleGranularityM: 1 (ScaleMsPerTile: 128)" << std::endl;
+      std::cout << "  ScaleGranularityN: 128 (ScaleNsPerTile: 1)" << std::endl;
+      std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+      std::cout << "  GFLOPS: " << result.gflops << std::endl;
+      fflush(stdout);
+    }
+  }
+};
+
+struct TestGroupedGemm_Contiguous {
+  using Element = cutlass::float_e4m3_t;
+  using ElementScale = float;
+  using ElementAcc = float;
+  using ElementOut = cutlass::bfloat16_t;
+
+  cutlass::HostTensor<Element, cutlass::layout::RowMajor> Tensor_lhs;
+  cutlass::HostTensor<Element, cutlass::layout::RowMajor> Tensor_rhs;
+  cutlass::HostTensor<ElementScale, cutlass::layout::ColumnMajor> Tensor_lhs_scale;
+  cutlass::HostTensor<ElementScale, cutlass::layout::RowMajor> Tensor_rhs_scale;
+  cutlass::HostTensor<ElementOut, cutlass::layout::RowMajor> Tensor_out;
+  cutlass::HostTensor<int, cutlass::layout::RowMajor> Tensor_grouped_layout;
+
+  /// Initialize operands to be used in the GEMM
+  void initialize(
+    const Options &options,
+    uint64_t seed = 2025) {
+
+    Tensor_lhs.resize({options.m, options.k}); //[m, k]
+    Tensor_rhs.resize({options.num_groups * options.n, options.k}); //[num_groups, n, k]
+    Tensor_lhs_scale.resize({options.m, cdiv(options.k, 128)}); // [m, cdiv(k, 128)] column major
+    Tensor_rhs_scale.resize({options.num_groups * cdiv(options.n, 128), cdiv(options.k, 128)}); // [num_groups, cdiv(n, 128), cdiv(k, 128)]
+    Tensor_out.resize({options.m, options.n}); // [m, n]
+    Tensor_grouped_layout.resize({1,options.m}); // [num_groups,]
+
+    std::vector<int> group_start {0, options.m/4, 2*options.m/4, 3*options.m/4, options.m}; // sum(grouped_layout) = options.m
+    for (int i = 0; i < options.m; ++i) {
+      for(int j = 0; j < options.num_groups; ++j) {
+        if(i >= group_start[j] && i < group_start[j+1]) {
+          Tensor_grouped_layout.host_data()[i] = j;
+          break;
+        }
+      }
+    }
+
+    initialize_tensor(Tensor_lhs.host_view(), cutlass::Distribution::Uniform, seed + 1);
+    initialize_tensor(Tensor_rhs.host_view(), cutlass::Distribution::Uniform, seed + 2);
+    initialize_scale_tensor(Tensor_lhs_scale.host_view(), cutlass::Distribution::Uniform, seed + 3);
+    initialize_scale_tensor(Tensor_rhs_scale.host_view(), cutlass::Distribution::Uniform, seed + 4);
+
+    Tensor_lhs.sync_device();
+    Tensor_rhs.sync_device();
+    Tensor_lhs_scale.sync_device();
+    Tensor_rhs_scale.sync_device();
+    Tensor_out.sync_device();
+    Tensor_grouped_layout.sync_device();
+
+  }
+
+  void run(Options &options)
+  {
+    cudaDeviceProp props;
+    int current_device;
+    CUDA_CHECK(cudaGetDevice(&current_device));
+    CUDA_CHECK(cudaGetDeviceProperties(&props, current_device));
+
+    initialize(options);
+
+    cudaStream_t stream{nullptr};
+    constexpr auto N = 4096;
+    constexpr auto K = 4096;
+    constexpr auto BLOCK_M = 128;
+    constexpr auto BLOCK_N = 128;
+    constexpr auto num_groups = 4;
+    constexpr auto kNumStages = 5;
+    constexpr auto kNumTMAMulticast = 2;
+    const int num_sms = 132; // for H100 
+    const int best_smem_size = 199376; 
+
+    // Make a templated GEMM
+    using GemmKernel = Gemm<N, K, BLOCK_M, BLOCK_N, 128, num_groups, kNumStages, kNumTMAMulticast, GemmType::GroupedContiguous>;
+
+    int m = options.m;
+    // DeepGEMM requires __nv_fp8_e4m3 input and __nv_bfloat16 output
+    __nv_fp8_e4m3* lhs = reinterpret_cast<__nv_fp8_e4m3*>(Tensor_lhs.device_data());
+    __nv_fp8_e4m3* rhs = reinterpret_cast<__nv_fp8_e4m3*>(Tensor_rhs.device_data());
+    float* lhs_scales = Tensor_lhs_scale.device_data();
+    float* rhs_scales = Tensor_rhs_scale.device_data();
+    __nv_bfloat16* out = reinterpret_cast<__nv_bfloat16*>(Tensor_out.device_data());
+    int* grouped_layout = Tensor_grouped_layout.device_data();
+    // Launch kernel
+    auto tma_a_desc = GemmKernel::make_2d_tma_a_desc(lhs, m);
+    auto tma_b_desc = GemmKernel::make_2d_tma_b_desc(rhs);
+    auto tma_scales_a_desc = GemmKernel::make_2d_tma_scales_a_desc(lhs_scales, m);
+    auto tma_d_desc = GemmKernel::make_2d_tma_d_desc(out, m);
+    GemmKernel::run(out, rhs_scales, grouped_layout,
+                  m,
+                  tma_a_desc, tma_b_desc, tma_scales_a_desc, tma_d_desc,
+                  stream, num_sms, best_smem_size);
+
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+
+    CUDA_CHECK(cudaDeviceSynchronize());
+
+    std::cout << "run GroupedGemm Contiguous...\n";
+    // TODO: reference check
+    Result result;
+    // result.passed = verify(options, ScaleMsPerTile, ScaleNsPerTile);
+
+    // std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+    // if (!result.passed) {
+    //  exit(-1);
+    // }
+
+    // Run profiling loop
+    if (options.iterations > 0)
+    {
+      GpuTimer timer;
+      timer.start();
+      for (int iter = 0; iter < options.iterations; ++iter) {
+        // initialize(options);
+        GemmKernel::run(out, rhs_scales, grouped_layout,
+                    m,
+                    tma_a_desc, tma_b_desc, tma_scales_a_desc, tma_d_desc,
+                    stream, num_sms, best_smem_size);
+      }
+      timer.stop();
+
+      // Compute average runtime and GFLOPs.
+      float elapsed_ms = timer.elapsed_millis();
+      result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+      result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+      std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+      std::cout << "  Number of groups: " << options.num_groups << std::endl;
+      std::cout << "  Tile shape (M, N, K): (128, 128, 128)" << std::endl;
+      std::cout << "  ScaleGranularityM: 1 (ScaleMsPerTile: 128)" << std::endl;
+      std::cout << "  ScaleGranularityN: 128 (ScaleNsPerTile: 1)" << std::endl;
+      std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+      std::cout << "  GFLOPS: " << result.gflops << std::endl;
+      fflush(stdout);
+    }
+  }
+};
+
+struct TestGroupedGemm_Masked {
+  using Element = cutlass::float_e4m3_t;
+  using ElementScale = float;
+  using ElementAcc = float;
+  using ElementOut = cutlass::bfloat16_t;
+
+  cutlass::HostTensor<Element, cutlass::layout::RowMajor> Tensor_lhs;
+  cutlass::HostTensor<Element, cutlass::layout::RowMajor> Tensor_rhs;
+  cutlass::HostTensor<ElementScale, cutlass::layout::ColumnMajor> Tensor_lhs_scale;
+  cutlass::HostTensor<ElementScale, cutlass::layout::RowMajor> Tensor_rhs_scale;
+  cutlass::HostTensor<ElementOut, cutlass::layout::RowMajor> Tensor_out;
+  cutlass::HostTensor<int, cutlass::layout::RowMajor> Tensor_masked_m;
+
+  /// Initialize operands to be used in the GEMM
+  void initialize(
+    const Options &options,
+    uint64_t seed = 2025) {
+
+    int m_max = options.m;
+    Tensor_lhs.resize({options.num_groups * m_max, options.k}); //[num_groups, m, k]
+    Tensor_rhs.resize({options.num_groups * options.n, options.k}); //[num_groups, n, k]
+    Tensor_lhs_scale.resize({options.num_groups * m_max, cdiv(options.k, 128)}); // [num_groups, m, cdiv(k, 128)] column major
+    Tensor_rhs_scale.resize({options.num_groups * cdiv(options.n, 128), cdiv(options.k, 128)}); // [num_groups, cdiv(n, 128), cdiv(k, 128)]
+    Tensor_out.resize({options.num_groups * m_max, options.n}); // [num_groups, m, n]
+    Tensor_masked_m.resize({1,options.num_groups}); // [num_groups,]
+
+    std::vector<int> masked_m {options.m/4,2*options.m/4,3*options.m/4,options.m}; // max(masked_m) <= options.m
+    for (int i = 0; i < options.num_groups; ++i) {
+      Tensor_masked_m.host_data()[i] = masked_m[i];
+    }
+
+    initialize_tensor(Tensor_lhs.host_view(), cutlass::Distribution::Uniform, seed + 1);
+    initialize_tensor(Tensor_rhs.host_view(), cutlass::Distribution::Uniform, seed + 2);
+    initialize_scale_tensor(Tensor_lhs_scale.host_view(), cutlass::Distribution::Uniform, seed + 3);
+    initialize_scale_tensor(Tensor_rhs_scale.host_view(), cutlass::Distribution::Uniform, seed + 4);
+
+    Tensor_lhs.sync_device();
+    Tensor_rhs.sync_device();
+    Tensor_lhs_scale.sync_device();
+    Tensor_rhs_scale.sync_device();
+    Tensor_out.sync_device();
+    Tensor_masked_m.sync_device();
+
+  }
+
+  void run(Options &options)
+  {
+    cudaDeviceProp props;
+    int current_device;
+    CUDA_CHECK(cudaGetDevice(&current_device));
+    CUDA_CHECK(cudaGetDeviceProperties(&props, current_device));
+
+    initialize(options);
+
+    cudaStream_t stream{nullptr};
+    constexpr auto N = 4096;
+    constexpr auto K = 4096;
+    constexpr auto BLOCK_M = 128;
+    constexpr auto BLOCK_N = 128;
+    constexpr auto num_groups = 4;
+    constexpr auto kNumStages = 5;
+    constexpr auto kNumTMAMulticast = 2;
+    const int num_sms = 132; // for H100 
+    const int best_smem_size = 199376; 
+
+    // Make a templated GEMM
+    using GemmKernel = Gemm<N, K, BLOCK_M, BLOCK_N, 128, num_groups, kNumStages, kNumTMAMulticast, GemmType::GroupedMasked>;
+
+    int m = options.m;
+    // DeepGEMM requires __nv_fp8_e4m3 input and __nv_bfloat16 output
+    __nv_fp8_e4m3* lhs = reinterpret_cast<__nv_fp8_e4m3*>(Tensor_lhs.device_data());
+    __nv_fp8_e4m3* rhs = reinterpret_cast<__nv_fp8_e4m3*>(Tensor_rhs.device_data());
+    float* lhs_scales = Tensor_lhs_scale.device_data();
+    float* rhs_scales = Tensor_rhs_scale.device_data();
+    __nv_bfloat16* out = reinterpret_cast<__nv_bfloat16*>(Tensor_out.device_data());
+    int* masked_m = Tensor_masked_m.device_data();
+    // Launch kernel
+    auto tma_a_desc = GemmKernel::make_2d_tma_a_desc(lhs, m);
+    auto tma_b_desc = GemmKernel::make_2d_tma_b_desc(rhs);
+    auto tma_scales_a_desc = GemmKernel::make_2d_tma_scales_a_desc(lhs_scales, m);
+    auto tma_d_desc = GemmKernel::make_2d_tma_d_desc(out, m);
+    GemmKernel::run(out, rhs_scales, masked_m,
+                  m,
+                  tma_a_desc, tma_b_desc, tma_scales_a_desc, tma_d_desc,
+                  stream, num_sms, best_smem_size);
+
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+
+    CUDA_CHECK(cudaDeviceSynchronize());
+
+    std::cout << "run GroupedGemm Contiguous...\n";
+    // TODO: reference check
+    Result result;
+    // result.passed = verify(options, ScaleMsPerTile, ScaleNsPerTile);
+
+    // std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+    // if (!result.passed) {
+    //  exit(-1);
+    // }
+
+    // Run profiling loop
+    if (options.iterations > 0)
+    {
+      GpuTimer timer;
+      timer.start();
+      for (int iter = 0; iter < options.iterations; ++iter) {
+        // initialize(options);
+        GemmKernel::run(out, rhs_scales, masked_m,
+                    m,
+                    tma_a_desc, tma_b_desc, tma_scales_a_desc, tma_d_desc,
+                    stream, num_sms, best_smem_size);
+      }
+      timer.stop();
+
+      // Compute average runtime and GFLOPs.
+      float elapsed_ms = timer.elapsed_millis();
+      result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+      result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+      std::cout << "  Problem Size: M " << 'x' << options.n << 'x' << options.k << std::endl;
+      std::cout << "  Number of groups: " << options.num_groups << std::endl;
+      std::cout << "  Number of masked rows: " ;
+      for (int i = 0; i < options.num_groups; ++i) {
+        std::cout << Tensor_masked_m.host_data()[i] << " ";
+      }
+      std::cout << std::endl;
+      std::cout << "  Tile shape (M, N, K): (128, 128, 128)" << std::endl;
+      std::cout << "  ScaleGranularityM: 1 (ScaleMsPerTile: 128)" << std::endl;
+      std::cout << "  ScaleGranularityN: 128 (ScaleNsPerTile: 1)" << std::endl;
+      std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+      std::cout << "  GFLOPS: " << result.gflops << std::endl;
+      fflush(stdout);
+    }
+  }
+};
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 9) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
+      << "later (compute capability 90 or greater).\n";
+    return 0;
+  }
+  //
+  // Parse options
+  //
+  #if defined (CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  TestGemm testgemm{};
+  testgemm.run(options);
+
+  TestGroupedGemm_Contiguous testgroupedgemm_contiguous{};
+  testgroupedgemm_contiguous.run(options);
+
+  TestGroupedGemm_Masked testgroupedgemm_masked{};
+  testgroupedgemm_masked.run(options);
+
+  #endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu
@ -123,7 +123,7 @@ using ArchTag             = cutlass::arch::Sm90;                            // T
 using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
 using TileShape           = Shape<_128,_128,_128>;                           // Threadblock-level tile size
 using ClusterShape        = Shape<_1,_2,_1>;                                // Shape of the threadblocks in a cluster
-using KernelSchedule      = cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum;
+using KernelSchedule      = cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum<>;
 using EpilogueSchedule    = cutlass::epilogue::TmaWarpSpecializedCooperative;

 using EpilogueTileType    = cutlass::epilogue::collective::EpilogueTileAuto;
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu
@ -0,0 +1,847 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Grouped scale Hopper FP8 GEMM example using CUTLASS 3.0 APIs for NVIDIA Hopper architecture
+    This example demonstrate a grouped scaled FP8 GEMM using the new CUTLASS 3.0.
+    APIs on NVIDIA Hopper architecture. New features that will be showcased in this example are as follows:
+    1. NVIDIA Hopper architecture introduces a new series of tensor core instructions (GMMA)
+    which are more efficient than the Ampere tensor core instructions.
+    2. NVIDIA Hopper architecture includes new Tensor Memory Accelerator (TMA) unit to transfer large
+    blocks of data efficiently between global memory and shared memory. TMA also supports asynchronous
+    copies between thread blocks in a cluster.
+    3. This example uses the Warp Specialized kernel design (see /media/docs/efficient_gemm.md for details).
+    4. This example shows all important fusions used by FP8 gemm kernels, i.e., grouped scale factor along M for
+    A, blocked scale factor along K for A tensor, blocked scale factor for B tensor, the abs_max value of D tensor.
+    5. A simple way to tune the CTA rasterization direction and swizzle pattern of Hopper kernels. Both the
+    CTA rasterization direction and swizzle pattern impact cross-CTA locality of accesses. By tuning we can
+    improve performance.
+    Examples:
+      $ ./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling  \
+        --m=2816 --n=3072 --k=16384 \
+        --save_aux=false --save_amax=false \
+        --device_scale=false --raster=h --swizzle=2
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cutlass/numeric_types.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+
+// Includes from examples directory
+#include "helper.h"
+#include "hopper_fp8_commandline.hpp"
+#include "reference/host/gemm_with_groupwise_scaling.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = cutlass::float_e4m3_t;                          // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::float_e4m3_t;                          // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C matrix configuration
+using         ElementC    = cutlass::float_e4m3_t;                          // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// D matrix configuration
+using         ElementD    = ElementC;
+using         LayoutD     = LayoutC;
+constexpr int AlignmentD  = AlignmentC;
+
+// Auxiliary matrix configuration and other fusion types
+using         ElementAux   = ElementC;
+using         LayoutAux    = LayoutC;
+using         ElementAmax  = float;
+using         ElementBias  = float;
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ElementBlockScale   = float;                                          // Element type for blockscaling during accumulation
+using ElementCompute      = float;                                          // Element type for epilogue computation
+
+using TileShape_  = Shape<_128,_128,_128>;  // This one is just to make the compiler happy with verify()...
+
+// ScaleGranularity{M,N}: number of {rows in A}/{columns in B} that share the same scaling factor
+// Given TileShape = Shape<_128,_128,_128>:
+//   ScaleGranularityM == 128 and ScaleGranularityN == 128 --> 2Dx2D (the shape of the scaling factor)
+//   ScaleGranularityM == 1   and ScaleGranularityN == 128 --> 1Dx2D scaling
+//   ScaleGranularityM == 128 and ScaleGranularityN == 1   --> 2Dx1D scaling
+//   ScaleGranularityM == 1   and ScaleGranularityN == 1   --> 1Dx1D scaling
+template <int ScaleGranularityM_, int ScaleGranularityN_>
+struct GroupScaleConfig {
+  using ArchTag       = cutlass::arch::Sm90;                          // Tag indicating the minimum SM that supports the intended feature
+  using OperatorClass = cutlass::arch::OpClassTensorOp;               // Operator class tag
+  using TileShape     = Shape<_128,_128,_128>;                        // Threadblock-level tile size
+  using ClusterShape  = Shape<_1,_2,_1>;                              // Shape of the threadblocks in a cluster
+
+  static constexpr int ScaleGranularityM = ScaleGranularityM_;
+  static constexpr int ScaleGranularityN = ScaleGranularityN_;
+  static constexpr int ScaleMsPerTile = size<0>(TileShape{}) / ScaleGranularityM;
+  static constexpr int ScaleNsPerTile = size<1>(TileShape{}) / ScaleGranularityN;
+
+  static_assert(size<0>(TileShape{}) == ScaleGranularityM * ScaleMsPerTile,
+              "FP8 scaling granularity must evenly divide tile shape along M.");
+  static_assert(size<1>(TileShape{}) == ScaleGranularityN * ScaleNsPerTile,
+              "FP8 scaling granularity must evenly divide tile shape along N.");
+
+  using KernelSchedule    = cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8BlockScaledAccum<ScaleGranularityM_, ScaleGranularityN_>;
+  using EpilogueSchedule  = cutlass::epilogue::TmaWarpSpecializedCooperative;
+  using EpilogueTileType  = cutlass::epilogue::collective::EpilogueTileAuto;
+  using FusionOperation   = cutlass::epilogue::fusion::ScaledLinCombPerRowBiasEltActAmaxAux<
+    LayoutAux, cutlass::epilogue::thread::ReLU, ElementD, ElementCompute, ElementAux, ElementAmax, ElementBias, ElementC>;
+};
+
+using GroupScale1D1DConfig = GroupScaleConfig<                    1,                     1>;
+using GroupScale1D2DConfig = GroupScaleConfig<                    1, size<1>(TileShape_{})>;
+using GroupScale2D1DConfig = GroupScaleConfig<size<0>(TileShape_{}),                     1>;
+using GroupScale2D2DConfig = GroupScaleConfig<size<0>(TileShape_{}), size<1>(TileShape_{})>;
+
+template <typename ScheduleConfig>
+struct GroupScaleGemm {
+  using ArchTag           = typename ScheduleConfig::ArchTag;
+  using OperatorClass     = typename ScheduleConfig::OperatorClass;
+  using TileShape         = typename ScheduleConfig::TileShape;
+  using ClusterShape      = typename ScheduleConfig::ClusterShape;
+  using KernelSchedule    = typename ScheduleConfig::KernelSchedule;
+  using EpilogueSchedule  = typename ScheduleConfig::EpilogueSchedule;
+  using EpilogueTileType  = typename ScheduleConfig::EpilogueTileType;
+  using FusionOperation   = typename ScheduleConfig::FusionOperation;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    TileShape, ClusterShape,
+    EpilogueTileType,
+    ElementAccumulator, ElementCompute,
+    ElementC, LayoutC, AlignmentC,
+    ElementD, LayoutD, AlignmentD,
+    EpilogueSchedule,
+    FusionOperation
+  >::CollectiveOp;
+
+  using CollectiveMainloopWithGroupWiseScaling = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))
+    >,
+    KernelSchedule
+  >::CollectiveOp;
+
+  using GemmKernelDefault = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloopWithGroupWiseScaling,
+      CollectiveEpilogue
+  >;
+
+  using GemmKernelStreamK = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloopWithGroupWiseScaling,
+      CollectiveEpilogue,
+      cutlass::gemm::StreamKScheduler
+  >;
+
+  using GemmDefault = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelDefault>;
+  using GemmStreamK = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelStreamK>;
+};
+
+using GroupScale1D1DGemm = GroupScaleGemm<GroupScale1D1DConfig>;
+using GroupScale1D2DGemm = GroupScaleGemm<GroupScale1D2DConfig>;
+using GroupScale2D1DGemm = GroupScaleGemm<GroupScale2D1DConfig>;
+using GroupScale2D2DGemm = GroupScaleGemm<GroupScale2D2DConfig>;
+
+// Extract information from Gemm kernel.
+using EpilogueOutputOp  = typename GroupScale1D1DGemm::GemmDefault::EpilogueOutputOp;
+using ElementScalar     = typename EpilogueOutputOp::ElementScalar;
+using ElementAmax       = typename EpilogueOutputOp::ElementAmax;
+using ActivationFunctor = typename EpilogueOutputOp::ActivationFn;
+
+using StrideA = typename GroupScale1D1DGemm::GemmDefault::GemmKernel::StrideA;
+using StrideB = typename GroupScale1D1DGemm::GemmDefault::GemmKernel::StrideB;
+using StrideC = typename GroupScale1D1DGemm::GemmDefault::GemmKernel::StrideC;
+using StrideD = typename GroupScale1D1DGemm::GemmDefault::GemmKernel::StrideD;
+using StrideAux = StrideD;
+
+constexpr bool IsDFp8 =
+    cute::is_same_v<ElementD, cutlass::float_e4m3_t> or
+    cute::is_same_v<ElementD, cutlass::float_e5m2_t>;
+
+constexpr bool IsAuxFp8 =
+    cute::is_same_v<ElementAux, cutlass::float_e4m3_t> or
+    cute::is_same_v<ElementAux, cutlass::float_e5m2_t>;
+
+static_assert(cute::is_same_v<ElementAccumulator, ElementBlockScale>,
+             "ElementAccumulator and ElementBlockScale should be same datatype");
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+StrideAux stride_aux;
+uint64_t seed;
+
+cutlass::HostTensor<ElementA  , LayoutA  > tensor_A;
+cutlass::HostTensor<ElementB  , LayoutB  > tensor_B;
+cutlass::HostTensor<ElementC  , LayoutC  > tensor_C;
+cutlass::HostTensor<ElementD  , LayoutD  > tensor_D;
+uint32_t mma_promotion_interval;
+cutlass::HostTensor<ElementBlockScale, LayoutA> blockscale_tensor_A;
+cutlass::HostTensor<ElementBlockScale, LayoutB> blockscale_tensor_B;
+cutlass::HostTensor<ElementD  , LayoutD  > tensor_ref_D;
+cutlass::HostTensor<ElementAux, LayoutAux> tensor_aux;
+cutlass::HostTensor<ElementAux, LayoutAux> tensor_ref_aux;
+
+using LayoutScalar = cutlass::layout::PackedVectorLayout;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scalar_alpha;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scalar_beta;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scale_A;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scale_B;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scale_C;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scale_D;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scale_aux;
+cutlass::HostTensor<ElementAmax  , LayoutScalar> abs_max_D;
+cutlass::HostTensor<ElementAmax  , LayoutScalar> reference_abs_max_D;
+cutlass::HostTensor<ElementAmax  , LayoutScalar> abs_max_aux;
+cutlass::HostTensor<ElementAmax  , LayoutScalar> reference_abs_max_aux;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+using RasterOrderOptions = typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90Params::RasterOrderOptions;
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+  template <typename Element, typename Layout>
+  bool initialize_tensor(
+    cutlass::TensorView<Element, Layout> view,
+    cutlass::Distribution::Kind dist_kind,
+    uint64_t seed) {
+
+    if (dist_kind == cutlass::Distribution::Uniform) {
+
+      double scope_max, scope_min;
+      int bits_input = cutlass::sizeof_bits<Element>::value;
+      int bits_output = cutlass::sizeof_bits<Element>::value;
+
+      if (bits_input == 1) {
+        scope_max = 2;
+        scope_min = 0;
+      } else if (bits_input <= 8) {
+        scope_max = 2;
+        scope_min = -2;
+      } else if (bits_output == 16) {
+        scope_max = 5;
+        scope_min = -5;
+      } else {
+        scope_max = 8;
+        scope_min = -8;
+      }
+
+      cutlass::reference::host::TensorFillRandomUniform(
+        view, seed, scope_max, scope_min, 0);
+    }
+    else if (dist_kind == cutlass::Distribution::AllZeros) {
+      cutlass::reference::host::TensorFill(view);
+    }
+    else if (dist_kind == cutlass::Distribution::Identity) {
+
+      cutlass::reference::host::TensorFillIdentity(view);
+    }
+    else if (dist_kind == cutlass::Distribution::Gaussian) {
+
+      cutlass::reference::host::TensorFillRandomGaussian(view, seed, 0, 0.5);
+    }
+    else if (dist_kind == cutlass::Distribution::Sequential) {
+      cutlass::reference::host::BlockFillSequential(view.data(), view.capacity());
+    }
+    else {
+      throw std::runtime_error("Not implementated.");
+    }
+
+    return true;
+  }
+
+/// Helper to initialize a block of device data (scale_tensors)
+  template <typename Element, typename Layout>
+  bool initialize_scale_tensor(
+    cutlass::TensorView<Element, Layout> view,
+    cutlass::Distribution::Kind dist_kind,
+    uint64_t seed) {
+
+    if (dist_kind == cutlass::Distribution::Uniform) {
+
+      double scope_max, scope_min;
+
+      scope_min = -1;
+      scope_max = 1;
+
+      cutlass::reference::host::TensorFillRandomUniform(
+        view, seed, scope_max, scope_min, 0);
+    }
+    else if (dist_kind == cutlass::Distribution::AllZeros) {
+      cutlass::reference::host::TensorFill(view);
+    }
+    else if (dist_kind == cutlass::Distribution::Identity) {
+
+      cutlass::reference::host::TensorFillIdentity(view);
+    }
+    else if (dist_kind == cutlass::Distribution::Gaussian) {
+
+      cutlass::reference::host::TensorFillRandomGaussian(view, seed, 0, 0.5);
+    }
+    else if (dist_kind == cutlass::Distribution::Sequential) {
+      cutlass::reference::host::BlockFillSequential(view.data(), view.capacity());
+    }
+    else {
+      throw std::runtime_error("Not implementated.");
+    }
+
+    return true;
+  }
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+template <typename GroupScaleConfig>
+void initialize(const Options<RasterOrderOptions> &options) {
+
+  using TileShape = typename GroupScaleConfig::TileShape;
+  const int ScaleMsPerTile = GroupScaleConfig::ScaleMsPerTile;
+  const int ScaleNsPerTile = GroupScaleConfig::ScaleNsPerTile;
+
+  // Find Group Scaling tensor shapes based on `ScaleGranularityM`, problem shape, and TileShape
+  auto gemm_problem_shape = cute::make_shape(options.m, options.n, options.k);
+  auto blockscale_shape = shape(get<1>(cute::zipped_divide(cute::make_layout(gemm_problem_shape), TileShape{})));
+  auto groupscale_m = cute::get<0>(blockscale_shape) * ScaleMsPerTile; // We need to pad along M in scale tensor of A to prevent illegal memory access.
+  auto groupscale_n = cute::get<1>(blockscale_shape) * ScaleNsPerTile; // We need to pad along N in scale tensor of A to prevent illegal memory access.
+  auto blockscale_k = cute::get<2>(blockscale_shape);
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(options.n, options.k, options.l));
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(options.m, options.n, options.l));
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(options.m, options.n, options.l));
+  stride_aux = stride_D;
+
+  auto a_coord = cutlass::make_Coord(options.m * options.l, options.k);
+  auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
+  auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
+  auto groupscale_a_coord = cutlass::make_Coord(groupscale_m * options.l, blockscale_k);
+  auto groupscale_b_coord = cutlass::make_Coord(groupscale_n * options.l, blockscale_k);
+
+  tensor_A.resize(a_coord);
+  tensor_B.resize(b_coord);
+  blockscale_tensor_A.resize(groupscale_a_coord);
+  blockscale_tensor_B.resize(groupscale_b_coord);
+  tensor_C.resize(c_coord);
+  tensor_D.resize(c_coord);
+  tensor_ref_D.resize(c_coord);
+
+  cutlass::Distribution::Kind dist_A = cutlass::Distribution::Uniform;
+  cutlass::Distribution::Kind dist_B = cutlass::Distribution::Uniform;
+  cutlass::Distribution::Kind dist_C = cutlass::Distribution::Uniform;
+  cutlass::Distribution::Kind dist_scaleA = cutlass::Distribution::Uniform;
+  cutlass::Distribution::Kind dist_scaleB = cutlass::Distribution::Uniform;
+
+  initialize_tensor(tensor_A.host_view(), dist_A, seed + 2022);
+  initialize_tensor(tensor_B.host_view(), dist_B, seed + 2023);
+  initialize_tensor(tensor_C.host_view(), dist_C, seed + 2024);
+  initialize_scale_tensor(blockscale_tensor_A.host_view(), dist_scaleA, seed + 2025);
+  initialize_scale_tensor(blockscale_tensor_B.host_view(), dist_scaleB, seed + 2026);
+
+#if 0 // Dump blockscaled tensors
+  std::cout << "blockscale_tensor_A: " << groupscale_a_coord << std::endl;
+  std::cout << blockscale_tensor_A.host_view() << "\n";
+  std::cout << "blockscale_tensor_B: " << groupscale_b_coord << std::endl;
+  std::cout << blockscale_tensor_B.host_view() << "\n";
+#endif
+
+  // Print group scaling tensors on the host side.
+  tensor_A.sync_device();
+  tensor_B.sync_device();
+  tensor_C.sync_device();
+  tensor_D.sync_device();
+  blockscale_tensor_A.sync_device();
+  blockscale_tensor_B.sync_device();
+
+  mma_promotion_interval = 4;
+
+  if (options.save_aux) {
+    tensor_aux.resize(c_coord);
+    tensor_aux.sync_device();
+    tensor_ref_aux.resize(c_coord);
+  }
+
+  if (options.device_scale) {
+    scalar_alpha.resize(cutlass::make_Coord(1));
+    scalar_beta.resize(cutlass::make_Coord(1));
+    scale_A.resize(cutlass::make_Coord(1));
+    scale_B.resize(cutlass::make_Coord(1));
+    scale_C.resize(cutlass::make_Coord(1));
+    scale_D.resize(cutlass::make_Coord(1));
+    scale_aux.resize(cutlass::make_Coord(1));
+
+    cutlass::reference::host::TensorFill(scalar_alpha.host_view(), options.alpha);
+    cutlass::reference::host::TensorFill(scalar_beta.host_view(), options.beta);
+    cutlass::reference::host::TensorFill(scale_A.host_view(), options.scale_a);
+    cutlass::reference::host::TensorFill(scale_B.host_view(), options.scale_b);
+    cutlass::reference::host::TensorFill(scale_C.host_view(), options.scale_c);
+    cutlass::reference::host::TensorFill(scale_D.host_view(), options.scale_d);
+    cutlass::reference::host::TensorFill(scale_aux.host_view(), options.scale_aux);
+
+    scalar_alpha.sync_device();
+    scalar_beta.sync_device();
+    scale_A.sync_device();
+    scale_B.sync_device();
+    scale_C.sync_device();
+    scale_D.sync_device();
+    scale_aux.sync_device();
+  }
+
+  if (IsDFp8 && options.save_amax) {
+    abs_max_D.resize(cutlass::make_Coord(1));
+    initialize_tensor(abs_max_D.host_view(), cutlass::Distribution::AllZeros, 0);
+    abs_max_D.sync_device();
+    reference_abs_max_D.resize(cutlass::make_Coord(1));
+    initialize_tensor(reference_abs_max_D.host_view(), cutlass::Distribution::AllZeros, 0);
+  }
+
+  if (IsAuxFp8 && options.save_aux && options.save_amax) {
+    abs_max_aux.resize(cutlass::make_Coord(1));
+    initialize_tensor(abs_max_aux.host_view(), cutlass::Distribution::AllZeros, 0);
+    abs_max_aux.sync_device();
+    reference_abs_max_aux.resize(cutlass::make_Coord(1));
+    initialize_tensor(reference_abs_max_aux.host_view(), cutlass::Distribution::AllZeros, 0);
+  }
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+template<typename GemmArguments>
+GemmArguments args_from_options(const Options<RasterOrderOptions> &options)
+{
+  GemmArguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, options.l},
+    {tensor_A.device_data(),
+     stride_A,
+     tensor_B.device_data(),
+     stride_B,
+     mma_promotion_interval,
+     blockscale_tensor_A.device_data(),
+     blockscale_tensor_B.device_data()
+     },
+    {
+      {}, // epilogue.thread
+      tensor_C.device_data(), stride_C,
+      tensor_D.device_data(), stride_D
+    }
+  };
+
+  auto &fusion_args = arguments.epilogue.thread;
+  fusion_args.alpha = options.alpha;
+  fusion_args.beta = options.beta;
+  fusion_args.alpha_ptr = scalar_alpha.device_data();
+  fusion_args.beta_ptr = scalar_beta.device_data();
+  fusion_args.scale_a = options.scale_a;
+  fusion_args.scale_b = options.scale_b;
+  fusion_args.scale_c = options.scale_c;
+  fusion_args.scale_a_ptr = scale_A.device_data();
+  fusion_args.scale_b_ptr = scale_B.device_data();
+  fusion_args.scale_c_ptr = scale_C.device_data();
+
+  // ignored if tensor types are not fp8
+  fusion_args.scale_d = options.scale_d;
+  fusion_args.scale_aux = options.scale_aux;
+  fusion_args.scale_d_ptr = scale_D.device_data();
+  fusion_args.scale_aux_ptr = scale_aux.device_data();
+
+  // leaving/setting these as nullptr disables the fusion at runtime
+  fusion_args.bias_ptr = nullptr;
+
+  if (options.save_aux) {
+    fusion_args.aux_ptr = tensor_aux.device_data();
+    fusion_args.dAux = stride_aux;
+    if (options.save_amax) {
+      fusion_args.amax_aux_ptr = abs_max_aux.device_data();
+    }
+  }
+
+  if (options.save_amax) {
+    fusion_args.amax_D_ptr = abs_max_D.device_data();
+  }
+
+  arguments.scheduler.raster_order = options.raster;
+  // The tile scheduler will swizzle up to 8 and with the nearest multiple of 2 (i.e., 1, 2, 4, and 8)
+  arguments.scheduler.max_swizzle_size = options.swizzle;
+
+  return arguments;
+}
+
+/// Don't know why the compiler does not like verify() being templated...
+bool verify(const Options<RasterOrderOptions> &options, const int ScaleMsPerTile, const int ScaleNsPerTile) {
+  //
+  // Compute reference output
+  //
+
+  // Group scaling tensors shapes based `ScaleGranularityM`, CTA Block (TileShape) and GEMM Problem shape
+  auto gemm_problem_shape = cute::make_shape(options.m, options.n, options.k);
+  auto blockscale_shape = shape(get<1>(cute::zipped_divide(cute::make_layout(gemm_problem_shape), TileShape_{})));
+  auto blockscale_m = cute::get<0>(blockscale_shape);
+  auto blockscale_n = cute::get<1>(blockscale_shape);
+  auto blockscale_k = cute::get<2>(blockscale_shape);
+
+  // Create instantiation for device reference gemm kernel
+  auto A = cute::make_tensor(tensor_A.host_data(),
+                             cute::make_layout(
+                                cute::make_shape(options.m, options.k, options.l),
+                                stride_A
+                              )
+                            );
+  auto B = cute::make_tensor(tensor_B.host_data(),
+                             cute::make_layout(
+                               cute::make_shape(options.n, options.k, options.l),
+                               stride_B
+                              )
+                            );
+  auto C = cute::make_tensor(tensor_C.host_data(),
+                             cute::make_layout(
+                                cute::make_shape(options.m, options.n, options.l),
+                                stride_C
+                              )
+                            );
+  auto D = cute::make_tensor(tensor_ref_D.host_data(),
+                             cute::make_layout(
+                                cute::make_shape(options.m, options.n, options.l),
+                                stride_D
+                              )
+                            );
+  auto Aux = cute::make_tensor(tensor_ref_aux.host_data(),
+                               cute::make_layout(
+                                  cute::make_shape(options.m, options.n, options.l),
+                                  stride_aux
+                                )
+                              );
+
+  auto blockscale_A = cute::make_tensor(blockscale_tensor_A.host_data(),
+                                        cute::make_layout(
+                                          cute::make_shape(blockscale_m, ScaleMsPerTile, blockscale_k, options.l),
+                                          cute::make_stride(blockscale_k * ScaleMsPerTile, 1, ScaleMsPerTile, blockscale_m * blockscale_k * ScaleMsPerTile)
+                                        )
+                                      );
+  auto blockscale_B = cute::make_tensor(blockscale_tensor_B.host_data(),
+                                        cute::make_layout(
+                                          cute::make_shape(blockscale_n, ScaleNsPerTile, blockscale_k, options.l),
+                                          cute::make_stride(blockscale_k * ScaleNsPerTile, 1, ScaleNsPerTile, blockscale_n * blockscale_k * ScaleNsPerTile)
+                                        )
+                                      );
+
+  using unused_t = decltype(D);
+
+  cutlass::reference::host::GettMainloopParams<ElementAccumulator,
+                                               decltype(A), decltype(B),
+                                               decltype(blockscale_A), decltype(blockscale_B),
+                                               TileShape_> mainloop_params{
+                                               A, B,                         // Operand Tensors
+                                               blockscale_A, blockscale_B    // Groupwise scaling Tensors
+                                              };
+
+  cutlass::reference::host::GettEpilogueParams<
+      ElementScalar,
+      ElementScalar,
+      ElementAccumulator,
+      ElementCompute,
+      decltype(C),
+      decltype(D),
+      unused_t, // bias
+      decltype(Aux),
+      unused_t, // valpha
+      unused_t, // vbeta
+      ActivationFunctor
+  > epilogue_params;
+
+  epilogue_params.C = C;
+  epilogue_params.D = D;
+  epilogue_params.Aux = Aux;
+  epilogue_params.alpha = options.alpha;
+  epilogue_params.beta = options.beta;
+  epilogue_params.scale_a = options.scale_a;
+  epilogue_params.scale_b = options.scale_b;
+  epilogue_params.scale_c = options.scale_c;
+  epilogue_params.scale_d = options.scale_d;
+  epilogue_params.scale_aux = options.scale_aux;
+  epilogue_params.abs_max_D = reference_abs_max_D.host_data();
+  epilogue_params.abs_max_Aux = reference_abs_max_aux.host_data();
+
+  // get reference result
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  // compare_reference
+  tensor_D.sync_host();
+  bool passed = cutlass::reference::host::TensorEquals(tensor_ref_D.host_view(), tensor_D.host_view());
+
+  if (false) {
+    std::cout << "tensor_ref_D.host_view() {" << std::endl
+              << tensor_ref_D.host_view() << std::endl
+              << "}"  << std::endl;
+    std::cout << "tensor_D.host_view() {" << std::endl
+              << tensor_D.host_view() << std::endl
+              << "}"  << std::endl;
+  }
+
+  if (IsDFp8 && options.save_amax) {
+    abs_max_D.sync_host();
+    passed &= abs_max_D.at(cutlass::make_Coord(0)) == reference_abs_max_D.at(cutlass::make_Coord(0));
+  }
+
+  if (options.save_aux) {
+    tensor_aux.sync_host();
+    passed &= cutlass::reference::host::TensorEquals(tensor_ref_aux.host_view(), tensor_aux.host_view());
+    if (IsAuxFp8 && options.save_amax) {
+      abs_max_aux.sync_host();
+      passed &= abs_max_aux.at(cutlass::make_Coord(0)) == reference_abs_max_aux.at(cutlass::make_Coord(0));
+    }
+  }
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename GroupScaleConfig, typename Gemm>
+int run(Options<RasterOrderOptions> &options)
+{
+  using TileShape = typename GroupScaleConfig::TileShape;
+  const int ScaleGranularityM = GroupScaleConfig::ScaleGranularityM;
+  const int ScaleGranularityN = GroupScaleConfig::ScaleGranularityN;
+  const int ScaleMsPerTile    = GroupScaleConfig::ScaleMsPerTile;
+  const int ScaleNsPerTile    = GroupScaleConfig::ScaleNsPerTile;
+
+  initialize<GroupScaleConfig>(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options<typename Gemm::Arguments>(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options, ScaleMsPerTile, ScaleNsPerTile);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  // if (!result.passed) {
+  //  exit(-1);
+  // }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::string raster = "Heuristic";
+
+    if (options.raster == RasterOrderOptions::AlongN) {
+      raster = "Along N";
+    }
+    else if (options.raster == RasterOrderOptions::AlongM) {
+      raster = "Along M";
+    }
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << 'x' << options.l << std::endl;
+    std::cout << "  Tile shape (M, N, K): " << size<0>(TileShape{}) << ", " << size<1>(TileShape{}) << ", " << size<2>(TileShape{}) << std::endl;
+    std::cout << "  ScaleGranularityM: " << ScaleGranularityM << " (ScaleMsPerTile: " << ScaleMsPerTile << ")" << std::endl;
+    std::cout << "  ScaleGranularityN: " << ScaleGranularityN << " (ScaleNsPerTile: " << ScaleNsPerTile << ")" << std::endl;
+    std::cout << "  Rasterization: " << raster << " with a maximum CTA swizzle of " << options.swizzle << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+    fflush(stdout);
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 9) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
+      << "later (compute capability 90 or greater).\n";
+    return 0;
+  }
+  //
+  // Parse options
+  //
+
+  Options<RasterOrderOptions> options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+  std::cout << "Basic split-K GEMM kernel" << std::endl;
+  run<GroupScale1D1DConfig, GroupScale1D1DGemm::GemmDefault>(options);
+  std::cout << std::endl;
+  run<GroupScale1D2DConfig, GroupScale1D2DGemm::GemmDefault>(options);
+  std::cout << std::endl;
+  run<GroupScale2D1DConfig, GroupScale2D1DGemm::GemmDefault>(options);
+  std::cout << std::endl;
+  run<GroupScale2D2DConfig, GroupScale2D2DGemm::GemmDefault>(options);
+  std::cout << std::endl;
+
+  std::cout << std::endl;
+
+  std::cout << "StreamK GEMM kernel" << std::endl;
+  run<GroupScale1D1DConfig, GroupScale1D1DGemm::GemmStreamK>(options);
+  std::cout << std::endl;
+  run<GroupScale1D2DConfig, GroupScale1D2DGemm::GemmStreamK>(options);
+  std::cout << std::endl;
+  run<GroupScale2D1DConfig, GroupScale2D1DGemm::GemmStreamK>(options);
+  std::cout << std::endl;
+  run<GroupScale2D2DConfig, GroupScale2D2DGemm::GemmStreamK>(options);
+  std::cout << std::endl;
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/CMakeLists.txt
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/CMakeLists.txt
@ -30,3 +30,13 @@ cutlass_example_add_executable(
  67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling
  67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu
  )
+
+cutlass_example_add_executable(
+  67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling
+  67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu
+  )
+
+cutlass_example_add_executable(
+  67_hopper_fp8_deepgemm
+  67_hopper_fp8_deepgemm.cu
+)
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/init.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/init.py
@ -0,0 +1,13 @@
+import torch
+
+from . import jit
+from .jit_kernels import (
+    gemm_fp8_fp8_bf16_nt,
+    m_grouped_gemm_fp8_fp8_bf16_nt_contiguous,
+    m_grouped_gemm_fp8_fp8_bf16_nt_masked,
+    cell_div,
+    set_num_sms, get_num_sms,
+    get_col_major_tma_aligned_tensor,
+    get_m_alignment_for_contiguous_layout
+)
+from .utils import bench, bench_kineto, calc_diff
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/include/deep_gemm/fp8_gemm.cuh
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/include/deep_gemm/fp8_gemm.cuh
@ -0,0 +1,444 @@
+#pragma clang diagnostic push
+#pragma clang diagnostic ignored "-Wunknown-attributes"
+#pragma once
+
+#include <cutlass/arch/barrier.h>
+#include <cutlass/arch/reg_reconfig.h>
+
+#include <cute/arch/cluster_sm90.hpp>
+#include <cute/arch/copy_sm90_desc.hpp>
+#include <cute/arch/copy_sm90_tma.hpp>
+
+#include "mma_utils.cuh"
+#include "scheduler.cuh"
+#include "tma_utils.cuh"
+#include "utils.cuh"
+
+namespace deep_gemm {
+
+enum class Layout {
+    RowMajor,
+    ColMajor
+};
+
+template <uint32_t kNumTMAThreads, uint32_t kNumMathThreadsPerGroup>
+__device__ __host__ constexpr int get_num_threads_per_sm(int block_m) {
+    DG_STATIC_ASSERT(kNumMathThreadsPerGroup == 128, "Only support 128 threads per math group");
+    return (block_m == 64 ? 1 : 2) * kNumMathThreadsPerGroup + kNumTMAThreads;
+}
+
+template <uint32_t SHAPE_N, uint32_t SHAPE_K,
+          uint32_t BLOCK_M, uint32_t BLOCK_N, uint32_t BLOCK_K,
+          uint32_t kNumGroups, uint32_t kNumStages,
+          uint32_t kNumTMAThreads, uint32_t kNumMathThreadsPerGroup,
+          uint32_t kNumTMAMulticast,
+          GemmType kGemmType>
+__global__ void __launch_bounds__(get_num_threads_per_sm<kNumTMAThreads, kNumMathThreadsPerGroup>(BLOCK_M), 1)
+fp8_gemm_kernel(__nv_bfloat16* gmem_d, float* scales_b, int* grouped_layout,
+                uint32_t shape_m,
+                const __grid_constant__ CUtensorMap tensor_map_a,
+                const __grid_constant__ CUtensorMap tensor_map_b,
+                const __grid_constant__ CUtensorMap tensor_map_scales_a,
+                const __grid_constant__ CUtensorMap tensor_map_d) {
+#if (defined(__CUDA_ARCH__) and (__CUDA_ARCH__ >= 900)) or defined(__CLION_IDE__)
+    // Scaling checks
+    DG_STATIC_ASSERT(BLOCK_K == 128, "Only support per-128-channel FP8 scaling");
+    DG_STATIC_ASSERT(cell_div(BLOCK_N, BLOCK_K) == 1, "Too much B scales in a single block");
+
+    // Types
+    using WGMMA = typename FP8MMASelector<BLOCK_N>::type;
+    using Barrier = cutlass::arch::ClusterTransactionBarrier;
+
+    // Shared memory
+    static constexpr uint32_t SMEM_D_SIZE = BLOCK_M * BLOCK_N * sizeof(__nv_bfloat16);
+    static constexpr uint32_t SMEM_A_SIZE_PER_STAGE = BLOCK_M * BLOCK_K * sizeof(__nv_fp8_e4m3);
+    static constexpr uint32_t SMEM_B_SIZE_PER_STAGE = BLOCK_N * BLOCK_K * sizeof(__nv_fp8_e4m3);
+    static constexpr uint32_t SMEM_SCALES_A_SIZE_PER_STAGE = BLOCK_M * sizeof(float);
+    static constexpr uint32_t SHAPE_K_SCALES = cell_div(SHAPE_K, BLOCK_K);
+    static constexpr int kMustUseUniformedScaleB = (BLOCK_K % BLOCK_N == 0);
+
+    // Configs
+    constexpr uint32_t kFullKOfAllStages = kNumStages * BLOCK_K;
+    constexpr uint32_t kNumThreads = get_num_threads_per_sm<kNumTMAThreads, kNumMathThreadsPerGroup>(BLOCK_M);
+    constexpr uint32_t kNumMathThreads = kNumThreads - kNumTMAThreads;
+    constexpr uint32_t kNumIterations = cell_div(SHAPE_K, kFullKOfAllStages);
+    const uint32_t warp_idx = __shfl_sync(0xffffffff, threadIdx.x / 32, 0);
+    const uint32_t lane_idx = get_lane_id();
+
+    // Prefetch TMA descriptors at very beginning
+    if (threadIdx.x == kNumMathThreads) {
+        cute::prefetch_tma_descriptor(reinterpret_cast<cute::TmaDescriptor const*>(&tensor_map_a));
+        cute::prefetch_tma_descriptor(reinterpret_cast<cute::TmaDescriptor const*>(&tensor_map_b));
+        cute::prefetch_tma_descriptor(reinterpret_cast<cute::TmaDescriptor const*>(&tensor_map_scales_a));
+        cute::prefetch_tma_descriptor(reinterpret_cast<cute::TmaDescriptor const*>(&tensor_map_d));
+    }
+    __syncwarp();
+
+    // Align to 1024 bytes for swizzle-128B
+    extern __shared__ __align__(1024) uint8_t smem_buffer[];
+    DG_STATIC_ASSERT(SMEM_D_SIZE % 1024 == 0, "Shared memory of A/B must be aligned to 1024 bytes");
+
+    // Data on shared memory
+    auto smem_d = reinterpret_cast<__nv_bfloat16*>(smem_buffer);
+    __nv_fp8_e4m3* smem_a[kNumStages];
+    __nv_fp8_e4m3* smem_b[kNumStages];
+    float* smem_scales_a[kNumStages];
+    float* smem_scales_b;
+
+    // TMA Barrier for both divisible and non-divisible cases
+    Barrier* full_barriers[kNumStages];
+    Barrier* empty_barriers[kNumStages];
+
+    // Fill shared memory pointers
+    #pragma unroll
+    for (int i = 0; i < kNumStages; ++ i) {
+        smem_a[i] = reinterpret_cast<__nv_fp8_e4m3*>(smem_buffer + SMEM_D_SIZE + i * SMEM_A_SIZE_PER_STAGE);
+        smem_b[i] = reinterpret_cast<__nv_fp8_e4m3*>(smem_buffer + SMEM_D_SIZE + kNumStages * SMEM_A_SIZE_PER_STAGE + i * SMEM_B_SIZE_PER_STAGE);
+        smem_scales_a[i] = reinterpret_cast<float*>(smem_buffer + SMEM_D_SIZE + kNumStages * (SMEM_A_SIZE_PER_STAGE + SMEM_B_SIZE_PER_STAGE) + i * SMEM_SCALES_A_SIZE_PER_STAGE);
+    }
+    smem_scales_b = reinterpret_cast<float*>(smem_buffer + SMEM_D_SIZE + kNumStages * (SMEM_A_SIZE_PER_STAGE + SMEM_B_SIZE_PER_STAGE + SMEM_SCALES_A_SIZE_PER_STAGE));
+
+    // Fill barriers
+    DG_STATIC_ASSERT(sizeof(Barrier) % sizeof(float) == 0, "Misaligned barriers");
+    DG_STATIC_ASSERT(not kMustUseUniformedScaleB or SHAPE_K_SCALES % (sizeof(Barrier) / sizeof(float)) == 0, "Misaligned barriers");
+    auto barrier_start_ptr = reinterpret_cast<Barrier*>(smem_scales_b + SHAPE_K_SCALES * (kMustUseUniformedScaleB ? 1 : 2));
+    #pragma unroll
+    for (int i = 0; i < kNumStages; ++ i) {
+        full_barriers[i] = barrier_start_ptr + i;
+        empty_barriers[i] = barrier_start_ptr + kNumStages + i;
+    }
+
+    // Initialize barriers
+    DG_STATIC_ASSERT(kNumTMAMulticast <= 32, "To many TMA multicast");
+    if (threadIdx.x == kNumMathThreads) {
+        #pragma unroll
+        for (int i = 0; i < kNumStages; ++ i) {
+            full_barriers[i]->init(1);
+            empty_barriers[i]->init(kNumTMAMulticast * kNumMathThreads / 32);
+        }
+
+        // Make initialized barrier visible in async proxy
+        cutlass::arch::fence_view_async_shared();
+        (kNumTMAMulticast > 1) ? cutlass::arch::fence_barrier_init() : void();
+    }
+
+    // Synchronize all threads to make barrier visible in normal memory model
+    (kNumTMAMulticast > 1) ? cute::cluster_sync() : __syncthreads();
+
+    // For pipeline unrolling
+    struct DivisibleK {};
+    struct NotDivisibleK {};
+    auto launch_k_iterations = [](const auto& func) {
+        if constexpr (SHAPE_K % kFullKOfAllStages == 0) {
+            for (int k_iter = 0; k_iter < kNumIterations; ++ k_iter)
+                func(k_iter, DivisibleK{});
+        } else {
+            for (int k_iter = 0; k_iter < kNumIterations - 1; ++ k_iter)
+                func(k_iter, DivisibleK{});
+            func(kNumIterations - 1, NotDivisibleK{});
+        }
+    };
+
+    // Register reconfigurations
+    constexpr int kNumTMARegisters = 40;
+    constexpr int kNumMathRegisters = 232;
+
+    // Block scheduler
+    uint32_t m_block_idx, n_block_idx;
+    auto scheduler = Scheduler<kGemmType, SHAPE_N, BLOCK_M, BLOCK_N, kNumGroups, kNumTMAMulticast>(shape_m, grouped_layout);
+
+    if (threadIdx.x >= kNumMathThreads) {
+        // TMA warp-group for loading data
+        cutlass::arch::warpgroup_reg_dealloc<kNumTMARegisters>();
+
+        // NOTES: only one thread (or warp) will be used
+        if (threadIdx.x == kNumMathThreads) {
+            // Persistently schedule over blocks
+            while (scheduler.get_next_block(m_block_idx, n_block_idx)) {
+                launch_k_iterations([&](int k_iter, auto type) {
+                    constexpr bool kHasDivisibleStages = std::is_same_v<decltype(type), DivisibleK>;
+                    constexpr int kNumInnerStages = kHasDivisibleStages ? kNumStages : (SHAPE_K % kFullKOfAllStages) / BLOCK_K;
+                    DG_STATIC_ASSERT(kNumInnerStages != 0, "Invalid number of inner stages");
+
+                    #pragma unroll
+                    for (uint32_t s = 0; s < kNumInnerStages; ++ s) {
+                        // Wait consumer release
+                        empty_barriers[s]->wait((scheduler.current_iter * kNumIterations + k_iter + 1) & 1);
+
+                        // Issue TMA A with broadcasting
+                        auto& full_barrier = *full_barriers[s];
+                        int k_idx = k_iter * kFullKOfAllStages + s * BLOCK_K;
+                        tma_copy<kNumTMAMulticast>(&tensor_map_a, reinterpret_cast<uint64_t*>(&full_barrier),
+                                                   smem_a[s], k_idx, scheduler.get_global_idx(shape_m, BLOCK_M, m_block_idx));
+                        tma_copy<kNumTMAMulticast>(&tensor_map_scales_a, reinterpret_cast<uint64_t*>(&full_barrier),
+                                                   smem_scales_a[s], m_block_idx * BLOCK_M,
+                                                   scheduler.get_global_idx(SHAPE_K_SCALES, 1, k_idx / BLOCK_K));
+
+                        // Issue TMA B without broadcasting
+                        tma_copy(&tensor_map_b, reinterpret_cast<uint64_t*>(&full_barrier),
+                                 smem_b[s], k_idx, scheduler.get_global_idx<false>(SHAPE_N, BLOCK_N, n_block_idx, m_block_idx));
+                        full_barrier.arrive_and_expect_tx(SMEM_A_SIZE_PER_STAGE + SMEM_B_SIZE_PER_STAGE + SMEM_SCALES_A_SIZE_PER_STAGE);
+                    }
+
+                    // Wait unaligned cases
+                    #pragma unroll
+                    for (uint32_t s = kNumInnerStages; s < kNumStages; ++ s) {
+                        empty_barriers[s]->wait((scheduler.current_iter * kNumIterations + k_iter + 1) & 1);
+                        full_barriers[s]->arrive();
+                    }
+                });
+            }
+
+            // To safely deconstruct distributed shared barriers, we need another round of empty waits
+            if constexpr (kNumTMAMulticast > 1) {
+                #pragma unroll
+                for (uint32_t s = 0; s < kNumStages; ++ s)
+                    empty_barriers[s]->wait((scheduler.current_iter * kNumIterations + 1) & 1);
+            }
+        }
+    } else {
+        // Math warp-groups for WGMMA
+        cutlass::arch::warpgroup_reg_alloc<kNumMathRegisters>();
+
+        // NOTES: use `__shfl_sync` to encourage NVCC to use unified registers
+        const auto math_wg_idx = __shfl_sync(0xffffffff, threadIdx.x / kNumMathThreadsPerGroup, 0);
+        const auto r_0 = warp_idx * 16 + lane_idx / 4, r_1 = r_0 + 8;
+
+        // Persistently schedule over blocks
+        while (scheduler.get_next_block(m_block_idx, n_block_idx)) {
+            // Decide the number of scales B to load
+            DG_STATIC_ASSERT(SHAPE_N % 8 == 0, "Invalid shape N");
+            uint32_t num_former_iters = BLOCK_N / 8, num_full_iters = num_former_iters;
+            if constexpr (not kMustUseUniformedScaleB) {
+                num_former_iters = min(BLOCK_N, BLOCK_K - n_block_idx * BLOCK_N % BLOCK_K) / 8;
+                num_full_iters = min(SHAPE_N - n_block_idx * BLOCK_N, BLOCK_N) / 8;
+            }
+            uint32_t num_scales_b = SHAPE_K_SCALES * (num_former_iters >= num_full_iters ? 1 : 2);
+
+            // Load B scales with math warp-groups
+            // NOTES: except the first warp, we want to overlap loading B scales with TMA stores between tasks
+            if (threadIdx.x >= 32) {
+                auto num_previous_lines = scheduler.get_global_idx<false>(cell_div(SHAPE_N, BLOCK_K), 0, 0, m_block_idx);
+                auto local_scales_b = scales_b + (num_previous_lines + ((n_block_idx * BLOCK_N) / BLOCK_K)) * SHAPE_K_SCALES;
+                #pragma unroll
+                for (uint32_t i = threadIdx.x - 32; i < num_scales_b; i += kNumMathThreads - 32)
+                    st_shared(smem_scales_b + i, __ldg(local_scales_b + i));
+            }
+            cutlass::arch::NamedBarrier(kNumMathThreads).sync();
+
+            // Accumulation for WGMMA or CUDA promotion
+            float accum[WGMMA::kNumAccum], final_accum[WGMMA::kNumAccum] = {0};
+
+            // Empty barrier arrival
+            auto empty_barrier_arrive = [&](int s) {
+                if constexpr (kNumTMAMulticast == 1) {
+                    lane_idx == 0 ? empty_barriers[s]->arrive() : void();
+                } else {
+                    lane_idx < kNumTMAMulticast ? empty_barriers[s]->arrive(lane_idx) : void();
+                }
+            };
+
+            // Launch MMAs
+            launch_k_iterations([&](int k_iter, auto type) {
+                constexpr bool kHasDivisibleStages = std::is_same_v<decltype(type), DivisibleK>;
+                constexpr int kNumInnerStages = kHasDivisibleStages ? kNumStages : (SHAPE_K % kFullKOfAllStages) / BLOCK_K;
+                DG_STATIC_ASSERT(kNumInnerStages != 0, "Invalid number of inner stages");
+
+                #pragma unroll
+                for (int s = 0; s < kNumInnerStages; ++ s) {
+                    // Read B scales
+                    float scale_b_0 = ld_shared(smem_scales_b + k_iter * kNumStages + s), scale_b_1;
+                    // NOTES: even some blocks do not need to read the second row, but we still load one to align with other blocks
+                    if constexpr (not kMustUseUniformedScaleB)
+                        scale_b_1 = ld_shared(smem_scales_b + k_iter * kNumStages + s + SHAPE_K_SCALES);
+
+                    // Wait TMA arrivals
+                    full_barriers[s]->wait((scheduler.current_iter * kNumIterations + k_iter) & 1);
+
+                    // Read A scales
+                    // NOTES: all shared memory read must be prior to `warpgroup_arrive` to avoid next scheduled block polluting the results
+                    auto scale_a_0 = ld_shared(smem_scales_a[s] + r_0), scale_a_1 = ld_shared(smem_scales_a[s] + r_1);
+
+                    // Commit WGMMA instructions
+                    #pragma unroll
+                    for (int i = 0; i < WGMMA::kNumAccum; ++ i)
+                        warpgroup_fence_operand(accum[i]);
+                    warpgroup_arrive();
+                    #pragma unroll
+                    for (int k = 0; k < BLOCK_K / WGMMA::K; ++ k) {
+                        auto desc_a = make_smem_desc(smem_a[s] + math_wg_idx * WGMMA::M * BLOCK_K + k * WGMMA::K, 1);
+                        auto desc_b = make_smem_desc(smem_b[s] + k * WGMMA::K, 1);
+                        WGMMA::wgmma(desc_a, desc_b, accum, k);
+                    }
+                    warpgroup_commit_batch();
+                    #pragma unroll
+                    for (int i = 0; i < WGMMA::kNumAccum; ++ i)
+                        warpgroup_fence_operand(accum[i]);
+                    warpgroup_wait<0>();
+
+                    // Notify barrier arrival
+                    empty_barrier_arrive(s);
+
+                    // Promote with scales
+                    float scale_0_0 = scale_a_0 * scale_b_0, scale_1_0 = scale_a_1 * scale_b_0;
+                    float scale_0_1, scale_1_1;
+                    if constexpr (not kMustUseUniformedScaleB)
+                        scale_0_1 = scale_a_0 * scale_b_1, scale_1_1 = scale_a_1 * scale_b_1;
+                    #pragma unroll
+                    for (int i = 0; i < WGMMA::kNumAccum / 4; ++ i) {
+                        bool predicate = kMustUseUniformedScaleB or i < num_former_iters;
+                        final_accum[i * 4 + 0] += (predicate ? scale_0_0 : scale_0_1) * accum[i * 4 + 0];
+                        final_accum[i * 4 + 1] += (predicate ? scale_0_0 : scale_0_1) * accum[i * 4 + 1];
+                        final_accum[i * 4 + 2] += (predicate ? scale_1_0 : scale_1_1) * accum[i * 4 + 2];
+                        final_accum[i * 4 + 3] += (predicate ? scale_1_0 : scale_1_1) * accum[i * 4 + 3];
+                    }
+                }
+
+                // Wait unaligned cases
+                #pragma unroll
+                for (uint32_t s = kNumInnerStages; s < kNumStages; ++ s) {
+                    full_barriers[s]->wait((scheduler.current_iter * kNumIterations + k_iter) & 1);
+                    empty_barrier_arrive(s);
+                }
+            });
+
+            // Write back to shared memory using STSM
+            DG_STATIC_ASSERT(WGMMA::kNumAccum % 4 == 0, "Invalid STSM x2 vectorization");
+            #pragma unroll
+            for (auto i = 0; i < WGMMA::kNumAccum / 8; ++ i) {
+                SM90_U32x4_STSM_N<nv_bfloat162>::copy(
+                    __float22bfloat162_rn({final_accum[i * 8 + 0], final_accum[i * 8 + 1]}),
+                    __float22bfloat162_rn({final_accum[i * 8 + 2], final_accum[i * 8 + 3]}),
+                    __float22bfloat162_rn({final_accum[i * 8 + 4], final_accum[i * 8 + 5]}),
+                    __float22bfloat162_rn({final_accum[i * 8 + 6], final_accum[i * 8 + 7]}),
+                    smem_d + (warp_idx * 16 + lane_idx % 16) * BLOCK_N + i * 16 + 8 * (lane_idx / 16)
+                );
+            }
+            if constexpr (WGMMA::kNumAccum % 8 != 0) {
+                SM90_U32x2_STSM_N<nv_bfloat162>::copy(
+                    __float22bfloat162_rn({final_accum[WGMMA::kNumAccum / 8 * 8 + 0], final_accum[WGMMA::kNumAccum / 8 * 8 + 1]}),
+                    __float22bfloat162_rn({final_accum[WGMMA::kNumAccum / 8 * 8 + 2], final_accum[WGMMA::kNumAccum / 8 * 8 + 3]}),
+                    smem_d + (warp_idx * 16 + lane_idx % 16) * BLOCK_N + WGMMA::kNumAccum / 8 * 16
+                );
+            }
+            cute::tma_store_fence();
+            cutlass::arch::NamedBarrier(kNumMathThreads).sync();
+
+            // Use TMA store to write back to global memory
+            if (threadIdx.x == 0) {
+                cute::SM90_TMA_STORE_2D::copy(&tensor_map_d, smem_d, n_block_idx * BLOCK_N,
+                                              scheduler.get_global_idx(shape_m, BLOCK_M, m_block_idx));
+                cute::tma_store_arrive();
+                cute::tma_store_wait<0>();
+            }
+            __syncwarp();
+        }
+    }
+#else
+    if (blockIdx.x == 0 and threadIdx.x == 0)
+        DG_DEVICE_ASSERT(false and "This kernel only support sm_90a");
+#endif
+}
+
+template <uint32_t SHAPE_N, uint32_t SHAPE_K,
+          uint32_t BLOCK_M, uint32_t BLOCK_N, uint32_t BLOCK_K,
+          uint32_t kNumGroups, uint32_t kNumStages,
+          uint32_t kNumTMAMulticast,
+          GemmType kGemmType>
+class Gemm {
+private:
+    using Barrier = cuda::barrier<cuda::thread_scope_block>;
+
+public:
+    Gemm() = default;
+
+    static void run(__nv_bfloat16* gmem_d, float* scales_b, int* grouped_layout,
+                    uint32_t shape_m,
+                    const CUtensorMap& tma_a_desc,
+                    const CUtensorMap& tma_b_desc,
+                    const CUtensorMap& tma_scales_a_desc,
+                    const CUtensorMap& tma_d_desc,
+                    cudaStream_t stream,
+                    int num_sms, uint32_t smem_size) {
+        // NOTES: we must use 4 warps to do TMA, because `setmaxnreg.aligned` requires 4 warps
+        constexpr uint32_t kNumTMAThreads = 128;
+        constexpr uint32_t kNumMathThreadsPerGroup = 128;
+        auto kernel = fp8_gemm_kernel<SHAPE_N, SHAPE_K, BLOCK_M, BLOCK_N, BLOCK_K,
+                                      kNumGroups, kNumStages, kNumTMAThreads, kNumMathThreadsPerGroup,
+                                      kNumTMAMulticast, kGemmType>;
+        DG_HOST_ASSERT(cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) == cudaSuccess);
+
+        // Cluster launch
+        cudaLaunchConfig_t config;
+        config.gridDim = num_sms;
+        config.blockDim = get_num_threads_per_sm<kNumTMAThreads, kNumMathThreadsPerGroup>(BLOCK_M);
+        config.dynamicSmemBytes = smem_size;
+        config.stream = stream;
+
+        // Clusters for TMA multicast
+        // NOTES: `>= 4` cluster size will cause performance degradation
+        cudaLaunchAttribute attr;
+        attr.id = cudaLaunchAttributeClusterDimension;
+        attr.val.clusterDim = {kNumTMAMulticast, 1, 1};
+        config.attrs = &attr;
+        config.numAttrs = 1;
+
+        // Launch
+        auto status = cudaLaunchKernelEx(&config, kernel,
+                                         gmem_d, scales_b, grouped_layout,
+                                         shape_m,
+                                         tma_a_desc, tma_b_desc, tma_scales_a_desc, tma_d_desc);
+        DG_HOST_ASSERT(status == cudaSuccess);
+    }
+
+    template <typename T>
+    static CUtensorMap make_2d_tma_a_desc(T* global_address, uint32_t shape_m) {
+        return make_2d_tma_desc(global_address, Layout::RowMajor,
+                                shape_m * (kGemmType == GemmType::GroupedMasked ? kNumGroups : 1), SHAPE_K, BLOCK_M, BLOCK_K);
+    }
+
+    template <typename T>
+    static CUtensorMap make_2d_tma_b_desc(T* global_address) {
+        return make_2d_tma_desc(global_address, Layout::ColMajor,
+                                SHAPE_K, SHAPE_N * (kGemmType != GemmType::Normal ? kNumGroups : 1), BLOCK_K, BLOCK_N);
+    }
+
+    template <typename T>
+    static CUtensorMap make_2d_tma_d_desc(T* global_address, uint32_t shape_m) {
+        return make_2d_tma_desc(global_address, Layout::RowMajor,
+                                shape_m * (kGemmType == GemmType::GroupedMasked ? kNumGroups : 1), SHAPE_N, BLOCK_M, BLOCK_N,
+                                CUtensorMapSwizzle::CU_TENSOR_MAP_SWIZZLE_NONE);
+    }
+
+    template <typename T>
+    static CUtensorMap make_2d_tma_scales_a_desc(T* global_address, uint32_t shape_m) {
+        // Make TMA aligned to 16 bytes
+        constexpr uint32_t kAlignment = 16 / sizeof(T);
+        shape_m = cell_div(shape_m, kAlignment) * kAlignment;
+
+        return make_2d_tma_desc(global_address, Layout::ColMajor,
+                                shape_m, cell_div(SHAPE_K, BLOCK_K) * (kGemmType == GemmType::GroupedMasked ? kNumGroups : 1), BLOCK_M, 1,
+                                CUtensorMapSwizzle::CU_TENSOR_MAP_SWIZZLE_NONE);
+    }
+
+    template <typename T>
+    static CUtensorMap make_2d_tma_desc(
+            T* global_address, Layout layout,
+            uint32_t gmem_rows, uint32_t gmem_cols,
+            uint32_t smem_rows, uint32_t smem_cols,
+            CUtensorMapSwizzle swizzle_type = CUtensorMapSwizzle::CU_TENSOR_MAP_SWIZZLE_128B) {
+        if (layout == Layout::RowMajor) {
+            uint64_t gmem_dim[2] = {gmem_cols, gmem_rows};
+            uint32_t smem_dim[2] = {smem_cols, smem_rows};
+            return make_2d_tma_copy_desc(global_address, gmem_dim, gmem_cols * sizeof(T), smem_dim, swizzle_type);
+        } else {
+            uint64_t gmem_dim[2] = {gmem_rows, gmem_cols};
+            uint32_t smem_dim[2] = {smem_rows, smem_cols};
+            return make_2d_tma_copy_desc(global_address, gmem_dim, gmem_rows * sizeof(T), smem_dim, swizzle_type);
+        }
+    }
+};
+
+};  // namespace deep_gemm
+
+#pragma clang diagnostic pop
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/include/deep_gemm/mma_utils.cuh
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/include/deep_gemm/mma_utils.cuh
@ -0,0 +1,885 @@
+#pragma once
+
+#include <cuda.h>
+
+#include "utils.cuh"
+
+namespace deep_gemm {
+
+struct SM90_64x16x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %10, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7},"
+                     " %8,"
+                     " %9,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                     : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07)
+                     : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 16;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x24x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %14, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n24k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7, "
+                     " %8,   %9,   %10,  %11},"
+                     " %12,"
+                     " %13,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                     : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                       "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+                     : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 24;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x32x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %18, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7, "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15},"
+                     " %16,"
+                     " %17,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                     : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                       "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+                     : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 32;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x40x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %22, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n40k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7, "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15, "
+                     " %16,  %17,  %18,  %19},"
+                     " %20,"
+                     " %21,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 40;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x48x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %26, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7, "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15, "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23},"
+                     " %24,"
+                     " %25,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 48;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x56x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %30, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n56k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27}, "
+                     " %28,"
+                     " %29,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                     : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                       "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                       "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                       "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+                     : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 56;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x64x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %34, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31}, "
+                     " %32,"
+                     " %33,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 64;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x72x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %38, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n72k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35}, "
+                     " %36,"
+                     " %37,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 72;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x80x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %42, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39}, "
+                     " %40,"
+                     " %41,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 80;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x88x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %46, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n88k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43}, "
+                     " %44,"
+                     " %45,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 88;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x96x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %50, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n96k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47}, "
+                     " %48,"
+                     " %49,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 96;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x104x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 float& d48, float& d49, float& d50, float& d51,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %54, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n104k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+                     " %48,  %49,  %50,  %51}, "
+                     " %52,"
+                     " %53,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+                      "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              d[48], d[49], d[50], d[51],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 104;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x112x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 float& d48, float& d49, float& d50, float& d51, float& d52, float& d53, float& d54, float& d55,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %58, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n112k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+                     " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55}, "
+                     " %56,"
+                     " %57,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+                      "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51), "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              d[48], d[49], d[50], d[51], d[52], d[53], d[54], d[55],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 112;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x120x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 float& d48, float& d49, float& d50, float& d51, float& d52, float& d53, float& d54, float& d55,
+                                 float& d56, float& d57, float& d58, float& d59,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %62, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n120k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+                     " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+                     " %56,  %57,  %58,  %59}, "
+                     " %60,"
+                     " %61,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+                      "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51), "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+                      "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              d[48], d[49], d[50], d[51], d[52], d[53], d[54], d[55],
+              d[56], d[57], d[58], d[59],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 120;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x128x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 float& d48, float& d49, float& d50, float& d51, float& d52, float& d53, float& d54, float& d55,
+                                 float& d56, float& d57, float& d58, float& d59, float& d60, float& d61, float& d62, float& d63,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %66, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n128k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+                     " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+                     " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63}, "
+                     " %64,"
+                     " %65,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+                      "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51), "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+                      "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59), "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              d[48], d[49], d[50], d[51], d[52], d[53], d[54], d[55],
+              d[56], d[57], d[58], d[59], d[60], d[61], d[62], d[63],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 128;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x192x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 float& d48, float& d49, float& d50, float& d51, float& d52, float& d53, float& d54, float& d55,
+                                 float& d56, float& d57, float& d58, float& d59, float& d60, float& d61, float& d62, float& d63,
+                                 float& d64, float& d65, float& d66, float& d67, float& d68, float& d69, float& d70, float& d71,
+                                 float& d72, float& d73, float& d74, float& d75, float& d76, float& d77, float& d78, float& d79,
+                                 float& d80, float& d81, float& d82, float& d83, float& d84, float& d85, float& d86, float& d87,
+                                 float& d88, float& d89, float& d90, float& d91, float& d92, float& d93, float& d94, float& d95,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %98, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n192k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+                     " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+                     " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+                     " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+                     " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+                     " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+                     " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},  "
+                     " %96,"
+                     " %97,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+                      "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51), "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+                      "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59), "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+                      "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67), "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+                      "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75), "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+                      "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83), "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+                      "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91), "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              d[48], d[49], d[50], d[51], d[52], d[53], d[54], d[55],
+              d[56], d[57], d[58], d[59], d[60], d[61], d[62], d[63],
+              d[64], d[65], d[66], d[67], d[68], d[69], d[70], d[71],
+              d[72], d[73], d[74], d[75], d[76], d[77], d[78], d[79],
+              d[80], d[81], d[82], d[83], d[84], d[85], d[86], d[87],
+              d[88], d[89], d[90], d[91], d[92], d[93], d[94], d[95],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 192;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+template <typename dtype_t>
+struct SM90_U32x2_STSM_N {
+    __device__ __forceinline__ static void
+    copy(dtype_t src_0, dtype_t src_1, void* smem_dst) {
+        const uint32_t src[2] = {*reinterpret_cast<uint32_t*>(&src_0), *reinterpret_cast<uint32_t*>(&src_1)};
+        asm volatile("stmatrix.sync.aligned.x2.m8n8.shared.b16 [%0], {%1, %2};\n"
+                     :: "l"(smem_dst), "r"(src[0]), "r"(src[1]));
+    }
+};
+
+template <typename dtype_t>
+struct SM90_U32x4_STSM_N {
+    __device__ __forceinline__ static void
+    copy(dtype_t src_0, dtype_t src_1, dtype_t src_2, dtype_t src_3, void* smem_dst) {
+        const uint32_t src[4] = {*reinterpret_cast<uint32_t*>(&src_0), *reinterpret_cast<uint32_t*>(&src_1),
+                                 *reinterpret_cast<uint32_t*>(&src_2), *reinterpret_cast<uint32_t*>(&src_3)};
+        asm volatile("stmatrix.sync.aligned.x4.m8n8.shared.b16 [%0], {%1, %2, %3, %4};\n"
+                     :: "l"(smem_dst), "r"(src[0]), "r"(src[1]), "r"(src[2]), "r"(src[3]));
+    }
+};
+
+__device__ void warpgroup_arrive() {
+    asm volatile("wgmma.fence.sync.aligned;\n" ::: "memory");
+}
+
+__device__ void warpgroup_commit_batch() {
+    asm volatile("wgmma.commit_group.sync.aligned;\n" ::: "memory");
+}
+
+__device__ void warpgroup_fence_operand(float& reg) {
+    asm volatile("" : "+f"(reg) :: "memory");
+}
+
+__forceinline__ __device__ uint32_t get_lane_id() {
+    uint32_t lane_id;
+    asm("mov.u32 %0, %laneid;" : "=r"(lane_id));
+    return lane_id;
+}
+
+__device__  __forceinline__ uint32_t ld_shared(const uint32_t* __restrict__ ptr) {
+    uint32_t ret;
+    asm volatile("ld.shared.u32 %0, [%1];" : "=r"(ret) : "l"(ptr));
+    return ret;
+}
+
+__device__  __forceinline__ int4 ld_shared(const int4* __restrict__ ptr) {
+    int4 ret;
+    asm volatile("ld.shared.v4.s32 {%0, %1, %2, %3}, [%4];" : "=r"(ret.x), "=r"(ret.y), "=r"(ret.z), "=r"(ret.w) : "l"(ptr));
+    return ret;
+}
+
+__device__  __forceinline__ float ld_shared(const float* __restrict__ ptr) {
+    float ret;
+    asm volatile("ld.shared.f32 %0, [%1];" : "=f"(ret) : "l"(ptr));
+    return ret;
+}
+
+__device__ __forceinline__ void st_shared(const float* ptr, float val) {
+    asm volatile("st.shared.f32 [%0], %1;" :: "l"(ptr), "f"(val));
+}
+
+__device__ __forceinline__ void st_shared(const uint32_t* ptr, uint32_t val) {
+    asm volatile("st.shared.u32 [%0], %1;" :: "l"(ptr), "r"(val));
+}
+
+template <int N>
+__device__ void warpgroup_wait() {
+    DG_STATIC_ASSERT(N >= 0 and N <= 7, "WGMMA wait: N must be in range [0, 7]");
+    asm volatile("wgmma.wait_group.sync.aligned %0;\n" :: "n"(N) : "memory");
+}
+
+union GmmaDescriptor {
+    __host__ __device__ constexpr GmmaDescriptor() noexcept: desc_(0) {}
+
+    __host__ __device__ constexpr GmmaDescriptor(uint64_t desc) noexcept: desc_(desc) {}
+
+    __host__ __device__ constexpr GmmaDescriptor(GmmaDescriptor const &t) noexcept: desc_(t.desc_) {}
+
+    __host__ __device__ constexpr GmmaDescriptor(GmmaDescriptor &&t) noexcept: desc_(t.desc_) {}
+
+    __host__ __device__ constexpr GmmaDescriptor &operator=(GmmaDescriptor const &t) noexcept {
+        desc_ = t.desc_;
+        return *this;
+    }
+
+    __host__ __device__ constexpr GmmaDescriptor &operator=(GmmaDescriptor &&t) noexcept {
+        desc_ = t.desc_;
+        return *this;
+    }
+
+    uint64_t desc_;
+    uint32_t reg32_[2];
+    uint16_t reg16_[4];
+
+    struct {
+        uint16_t start_address_: 14, : 2;
+        uint16_t leading_byte_offset_: 14, : 2;
+        uint16_t stride_byte_offset_: 14, : 2;
+        uint8_t : 1, base_offset_: 3, : 4;
+        uint8_t : 6, layout_type_: 2;
+    } bitfield;
+
+    // Decay to an `uint64_t`
+    __host__ __device__ constexpr operator uint64_t() const noexcept { return desc_; }
+};
+
+template <class PointerType>
+__device__ GmmaDescriptor make_smem_desc(PointerType smem_ptr, int layout_type,
+                                         int leading_byte_offset = 0,
+                                         int stride_byte_offset = 1024) {
+    GmmaDescriptor desc;
+    auto uint_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
+    desc.bitfield.start_address_ = uint_ptr >> 4;
+    desc.bitfield.layout_type_ = layout_type;
+    desc.bitfield.leading_byte_offset_ = leading_byte_offset >> 4;
+    desc.bitfield.stride_byte_offset_ = stride_byte_offset >> 4;
+    desc.bitfield.base_offset_ = 0;
+    return desc;
+}
+
+template <int N>
+struct FP8MMASelector {
+    static constexpr auto select_type() {
+        if constexpr (N == 16) return SM90_64x16x32_F32E4M3E4M3_SS();
+        if constexpr (N == 24) return SM90_64x24x32_F32E4M3E4M3_SS();
+        if constexpr (N == 32) return SM90_64x32x32_F32E4M3E4M3_SS();
+        if constexpr (N == 40) return SM90_64x40x32_F32E4M3E4M3_SS();
+        if constexpr (N == 48) return SM90_64x48x32_F32E4M3E4M3_SS();
+        if constexpr (N == 56) return SM90_64x56x32_F32E4M3E4M3_SS();
+        if constexpr (N == 64) return SM90_64x64x32_F32E4M3E4M3_SS();
+        if constexpr (N == 72) return SM90_64x72x32_F32E4M3E4M3_SS();
+        if constexpr (N == 80) return SM90_64x80x32_F32E4M3E4M3_SS();
+        if constexpr (N == 88) return SM90_64x88x32_F32E4M3E4M3_SS();
+        if constexpr (N == 96) return SM90_64x96x32_F32E4M3E4M3_SS();
+        if constexpr (N == 104) return SM90_64x104x32_F32E4M3E4M3_SS();
+        if constexpr (N == 112) return SM90_64x112x32_F32E4M3E4M3_SS();
+        if constexpr (N == 120) return SM90_64x120x32_F32E4M3E4M3_SS();
+        if constexpr (N == 128) return SM90_64x128x32_F32E4M3E4M3_SS();
+        if constexpr (N == 192) return SM90_64x192x32_F32E4M3E4M3_SS();
+    }
+
+    using type = decltype(select_type());
+};
+
+} // namespace deep_gemm
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/include/deep_gemm/scheduler.cuh
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/include/deep_gemm/scheduler.cuh
@ -0,0 +1,103 @@
+#include "utils.cuh"
+
+namespace deep_gemm {
+
+enum class GemmType {
+    Normal,
+    GroupedContiguous,
+    GroupedMasked
+};
+
+#pragma clang diagnostic push
+#pragma ide diagnostic ignored "cppcoreguidelines-pro-type-member-init"
+template <GemmType kGemmType,
+          uint32_t SHAPE_N, uint32_t BLOCK_M, uint32_t BLOCK_N,
+          uint32_t kNumGroups, uint32_t kNumTMAMulticast,
+          uint32_t kNumNBlocks = cell_div(SHAPE_N, BLOCK_N),
+          uint32_t kNumNBlocksPerGroup = 16>
+struct Scheduler {
+    int current_iter = -1;
+    uint32_t num_aligned_m_blocks;
+
+    // For normal GEMM
+    // Maybe not used in the masked grouped GEMM
+    uint32_t num_blocks;
+
+    // For grouped GEMM
+    int* grouped_layout;
+    // Only used for masked layout
+    uint32_t curr_group_idx, curr_cumsum;
+
+    __device__ __forceinline__ explicit Scheduler(const uint32_t shape_m,
+                                                  int* grouped_layout = nullptr) {
+        num_aligned_m_blocks = cell_div(shape_m, BLOCK_M);
+        if constexpr (kGemmType == GemmType::Normal) {
+            num_blocks = num_aligned_m_blocks * kNumNBlocks;
+        } else if (kGemmType == GemmType::GroupedContiguous) {
+            num_blocks = num_aligned_m_blocks * kNumNBlocks;
+            this->grouped_layout = grouped_layout;
+        } else if (kGemmType == GemmType::GroupedMasked) {
+            curr_group_idx = curr_cumsum = 0;
+            this->grouped_layout = grouped_layout;
+        }
+    }
+
+    __device__ __forceinline__ void get_swizzled_block_idx(const uint32_t num_m_blocks, int block_idx, uint32_t& m_block_idx, uint32_t& n_block_idx) {
+        DG_STATIC_ASSERT(kNumNBlocksPerGroup % kNumTMAMulticast == 0, "Invalid group size");
+
+        // Swizzle for better L2 usages
+        auto num_blocks_per_group = num_m_blocks * kNumNBlocksPerGroup;
+        auto group_idx = block_idx / num_blocks_per_group;
+        auto first_n_block_idx = group_idx * kNumNBlocksPerGroup;
+        auto num_n_blocks_in_group = min(kNumNBlocksPerGroup, kNumNBlocks - first_n_block_idx);
+        auto in_group_idx = block_idx % num_blocks_per_group;
+        m_block_idx = in_group_idx / num_n_blocks_in_group;
+        n_block_idx = first_n_block_idx + in_group_idx % num_n_blocks_in_group;
+    }
+
+    template <bool kIgnoreGroupedForGroupedContiguous=true>
+    __device__ __forceinline__ uint32_t get_global_idx(const uint32_t shape_dim, const uint32_t block_size,
+                                                       const uint32_t& block_idx, const uint32_t& m_block_idx=0) {
+        if constexpr (kGemmType == GemmType::Normal) {
+            return block_idx * block_size;
+        } else if (kGemmType == GemmType::GroupedContiguous) {
+            auto offset = kIgnoreGroupedForGroupedContiguous ? 0 : __ldg(grouped_layout + m_block_idx * BLOCK_M);
+            return offset * shape_dim + block_idx * block_size;
+        } else if (kGemmType == GemmType::GroupedMasked) {
+            return curr_group_idx * shape_dim + block_idx * block_size;
+        }
+    }
+
+    __device__ __forceinline__ bool get_next_block(uint32_t& m_block_idx, uint32_t& n_block_idx) {
+        const auto next_block_idx = (++ current_iter) * gridDim.x + blockIdx.x;
+
+        if constexpr (kGemmType == GemmType::GroupedMasked) {
+            uint32_t num_m_blocks;
+            while (true) {
+                // End of the task
+                if (curr_group_idx == kNumGroups)
+                    return false;
+
+                // Within current group
+                num_m_blocks = cell_div(static_cast<uint32_t>(__ldg(grouped_layout + curr_group_idx)), BLOCK_M);
+                auto current_m_block_cumsum = curr_cumsum + num_m_blocks;
+                if (next_block_idx < current_m_block_cumsum * kNumNBlocks)
+                    break;
+
+                // Move to check the next group
+                curr_group_idx ++, curr_cumsum = current_m_block_cumsum;
+            }
+
+            get_swizzled_block_idx(num_m_blocks, next_block_idx - curr_cumsum * kNumNBlocks, m_block_idx, n_block_idx);
+        } else {
+            if (next_block_idx >= num_blocks)
+                return false;
+
+            get_swizzled_block_idx(num_aligned_m_blocks, next_block_idx, m_block_idx, n_block_idx);
+        }
+        return true;
+    }
+};
+#pragma clang diagnostic pop
+
+} // namespace deep_gemm
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/include/deep_gemm/tma_utils.cuh
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/include/deep_gemm/tma_utils.cuh
@ -0,0 +1,96 @@
+#pragma once
+
+#include <cassert>
+#include <cuda.h>
+#include <cudaTypedefs.h>
+#include <cuda_fp8.h>
+#include <cuda_runtime.h>
+#include <cuda/barrier>
+
+#include "utils.cuh"
+
+namespace deep_gemm {
+
+template <class T>
+constexpr CUtensorMapDataType get_CUtensorMapDataType() {
+    if constexpr (std::is_same<T, uint8_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT8;
+    } else if constexpr (std::is_same<T, __nv_fp8_e4m3>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT8;
+    } else if constexpr (std::is_same<T, __nv_fp8_e5m2>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT8;
+    } else if constexpr (std::is_same<T, uint16_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT16;
+    } else if constexpr (std::is_same<T, uint32_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT32;
+    } else if constexpr (std::is_same<T, uint64_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT64;
+    } else if constexpr (std::is_same<T, int32_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_INT32;
+    } else if constexpr (std::is_same<T, int64_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_INT64;
+    } else if constexpr (std::is_same<T, __half>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_FLOAT16;
+    } else if constexpr (std::is_same<T, float>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_FLOAT32;
+    } else if constexpr (std::is_same<T, __nv_bfloat16>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_BFLOAT16;
+    }  else if constexpr (std::is_same<T, double>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_FLOAT64;
+    }
+}
+
+PFN_cuTensorMapEncodeTiled get_cuTensorMapEncodeTiled() {
+    // Get pointer to `cuTensorMapEncodeTiled`
+    cudaDriverEntryPointQueryResult driver_status;
+    void* cuTensorMapEncodeTiled_ptr = nullptr;
+
+#if CUDA_VERSION >= 12050
+    cudaGetDriverEntryPointByVersion("cuTensorMapEncodeTiled", &cuTensorMapEncodeTiled_ptr, 12000,
+                                     cudaEnableDefault, &driver_status);
+#else
+    cudaGetDriverEntryPoint("cuTensorMapEncodeTiled", &cuTensorMapEncodeTiled_ptr,
+                            cudaEnableDefault, &driver_status);
+#endif
+
+    if (driver_status != cudaDriverEntryPointSuccess)
+        throw std::runtime_error("driver_status != cudaDriverEntryPointSuccess");
+    return reinterpret_cast<PFN_cuTensorMapEncodeTiled>(cuTensorMapEncodeTiled_ptr);
+}
+
+template <typename T>
+CUtensorMap make_2d_tma_copy_desc(T* global_address, uint64_t gmem_dim[2],
+                                  uint64_t stride_in_bytes, uint32_t smem_dim[2],
+                                  CUtensorMapSwizzle swizzle_type,
+                                  PFN_cuTensorMapEncodeTiled encode_func = nullptr) {
+    CUtensorMap tensor_map{};
+    constexpr uint32_t rank = 2;
+    uint64_t global_stride[rank - 1] = {stride_in_bytes};
+    uint32_t elem_strides[rank] = {1, 1};
+
+    if (encode_func == nullptr)
+        encode_func = get_cuTensorMapEncodeTiled();
+
+    auto result = encode_func(
+            &tensor_map, get_CUtensorMapDataType<typename std::remove_cv<T>::type>(), rank,
+            global_address, gmem_dim, global_stride, smem_dim, elem_strides,
+            CUtensorMapInterleave::CU_TENSOR_MAP_INTERLEAVE_NONE, swizzle_type,
+            CUtensorMapL2promotion::CU_TENSOR_MAP_L2_PROMOTION_L2_256B,
+            CUtensorMapFloatOOBfill::CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE);
+    DG_HOST_ASSERT(result == CUDA_SUCCESS);
+    return tensor_map;
+}
+
+template <uint32_t kNumTMAMulticast = 1>
+__device__ __forceinline__ void
+tma_copy(void const* desc_ptr, uint64_t* barrier_ptr, void* smem_ptr,
+         int32_t const& crd_0, int32_t const& crd_1) {
+    constexpr auto cache_hint = static_cast<uint64_t>(cute::TMA::CacheHintSm90::EVICT_NORMAL);
+    if constexpr (kNumTMAMulticast == 1) {
+        cute::SM90_TMA_LOAD_2D::copy(desc_ptr, barrier_ptr, cache_hint, smem_ptr, crd_0, crd_1);
+    } else if (cute::block_rank_in_cluster() == 0) {
+        cute::SM90_TMA_LOAD_MULTICAST_2D::copy(desc_ptr, barrier_ptr, (1 << kNumTMAMulticast) - 1, cache_hint, smem_ptr, crd_0, crd_1);
+    }
+}
+
+}  // namespace deep_gemm
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/include/deep_gemm/utils.cuh
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/include/deep_gemm/utils.cuh
@ -0,0 +1,48 @@
+#pragma once
+
+#include <exception>
+
+#ifdef __CLION_IDE__
+__host__ __device__ __forceinline__ void host_device_printf(const char* format, ...) { asm volatile("trap;"); }
+#define printf host_device_printf
+#endif
+
+class AssertionException : public std::exception {
+private:
+    std::string message{};
+
+public:
+    explicit AssertionException(const std::string& message) : message(message) {}
+
+    const char *what() const noexcept override { return message.c_str(); }
+};
+
+#ifndef DG_HOST_ASSERT
+#define DG_HOST_ASSERT(cond)                                        \
+do {                                                                \
+    if (not (cond)) {                                               \
+        printf("Assertion failed: %s:%d, condition: %s\n",          \
+               __FILE__, __LINE__, #cond);                          \
+        throw AssertionException("Assertion failed: " #cond);       \
+    }                                                               \
+} while (0)
+#endif
+
+#ifndef DG_DEVICE_ASSERT
+#define DG_DEVICE_ASSERT(cond)                                                          \
+do {                                                                                    \
+    if (not (cond)) {                                                                   \
+        printf("Assertion failed: %s:%d, condition: %s\n", __FILE__, __LINE__, #cond);  \
+        asm("trap;");                                                                   \
+    }                                                                                   \
+} while (0)
+#endif
+
+#ifndef DG_STATIC_ASSERT
+#define DG_STATIC_ASSERT(cond, reason) static_assert(cond, reason)
+#endif
+
+template <typename T>
+__device__ __host__ constexpr T cell_div(T a, T b) {
+    return (a + b - 1) / b;
+}
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit/init.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit/init.py
@ -0,0 +1,3 @@
+from .compiler import get_nvcc_compiler, build
+from .template import cpp_format, generate
+from .runtime import Runtime
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit/compiler.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit/compiler.py
@ -0,0 +1,146 @@
+import hashlib
+import functools
+import os
+import re
+import subprocess
+import uuid
+from torch.utils.cpp_extension import CUDA_HOME
+from typing import Tuple
+
+from .runtime import Runtime, RuntimeCache
+from .template import typename_map
+
+runtime_cache = RuntimeCache()
+
+
+def hash_to_hex(s: str) -> str:
+    md5 = hashlib.md5()
+    md5.update(s.encode('utf-8'))
+    return md5.hexdigest()[0:12]
+
+
+@functools.lru_cache(maxsize=None)
+def get_jit_include_dir() -> str:
+    return f'{os.path.dirname(os.path.abspath(__file__))}/../include'
+
+
+@functools.lru_cache(maxsize=None)
+def get_deep_gemm_version() -> str:
+    # Update include directories
+    include_dir = f'{get_jit_include_dir()}/deep_gemm'
+    assert os.path.exists(include_dir), f'Cannot find GEMM include directory {include_dir}'
+    md5 = hashlib.md5()
+    for filename in filter(lambda x: x.endswith('.cuh'), sorted(os.listdir(include_dir))):
+        with open(f'{include_dir}/{filename}', 'rb') as f:
+            md5.update(f.read())
+
+    return md5.hexdigest()[0:12]
+
+
+@functools.lru_cache(maxsize=None)
+def get_nvcc_compiler() -> Tuple[str, str]:
+    paths = []
+    if os.getenv('DG_NVCC_COMPILER'):
+        paths.append(os.getenv('DG_NVCC_COMPILER'))
+    paths.append(f'{CUDA_HOME}/bin/nvcc')
+
+    # Try to find the first available NVCC compiler
+    least_version_required = '12.3'
+    version_pattern = re.compile(r'release (\d+\.\d+)')
+    for path in paths:
+        if os.path.exists(path):
+            match = version_pattern.search(os.popen(f'{path} --version').read())
+            version = match.group(1)
+            assert match, f'Cannot get the version of NVCC compiler {path}'
+            assert version >= least_version_required, f'NVCC {path} version {version} is lower than {least_version_required}'
+            return path, version
+    raise RuntimeError('Cannot find any available NVCC compiler')
+
+
+@functools.lru_cache(maxsize=None)
+def get_default_user_dir():
+    if 'DG_CACHE_DIR' in os.environ:
+        path = os.getenv('DG_CACHE_DIR')
+        os.makedirs(path, exist_ok=True)
+        return path
+    return os.path.expanduser('~') + '/.deep_gemm'
+
+
+@functools.lru_cache(maxsize=None)
+def get_tmp_dir():
+    return f'{get_default_user_dir()}/tmp'
+
+
+@functools.lru_cache(maxsize=None)
+def get_cache_dir():
+    return f'{get_default_user_dir()}/cache'
+
+
+def make_tmp_dir():
+    tmp_dir = get_tmp_dir()
+    os.makedirs(tmp_dir, exist_ok=True)
+    return tmp_dir
+
+
+def put(path, data, is_binary=False):
+    # Write and do POSIX atomic replace
+    tmp_file_path = f'{make_tmp_dir()}/file.tmp.{str(uuid.uuid4())}.{hash_to_hex(path)}'
+    with open(tmp_file_path, 'wb' if is_binary else 'w') as f:
+        f.write(data)
+    os.replace(tmp_file_path, path)
+
+
+def build(name: str, arg_defs: tuple, code: str) -> Runtime:
+    # Compiler flags
+    nvcc_flags = ['-std=c++17', '-shared', '-O3', '--expt-relaxed-constexpr', '--expt-extended-lambda',
+                  '-gencode=arch=compute_90a,code=sm_90a',
+                  '--ptxas-options=--register-usage-level=10' + (',--verbose' if 'DG_PTXAS_VERBOSE' in os.environ else ''),
+                  # Suppress some unnecessary warnings, such as unused variables for certain `constexpr` branch cases
+                  '--diag-suppress=177,174,940']
+    cxx_flags = ['-fPIC', '-O3', '-Wno-deprecated-declarations', '-Wno-abi']
+    flags = [*nvcc_flags, f'--compiler-options={",".join(cxx_flags)}']
+    include_dirs = [get_jit_include_dir()]
+
+    # Build signature
+    enable_sass_opt = get_nvcc_compiler()[1] <= '12.8' and int(os.getenv('DG_DISABLE_FFMA_INTERLEAVE', 0)) == 0
+    signature = f'{name}$${get_deep_gemm_version()}$${code}$${get_nvcc_compiler()}$${flags}$${enable_sass_opt}'
+    name = f'kernel.{name}.{hash_to_hex(signature)}'
+    path = f'{get_cache_dir()}/{name}'
+
+    # Check runtime cache or file system hit
+    global runtime_cache
+    if runtime_cache[path] is not None:
+        if os.getenv('DG_JIT_DEBUG', None):
+            print(f'Using cached JIT runtime {name} during build')
+        return runtime_cache[path]
+
+    # Write the code
+    os.makedirs(path, exist_ok=True)
+    args_path = f'{path}/kernel.args'
+    src_path = f'{path}/kernel.cu'
+    put(args_path, ', '.join([f"('{arg_def[0]}', {typename_map[arg_def[1]]})" for arg_def in arg_defs]))
+    put(src_path, code)
+
+    # Compile into a temporary SO file
+    so_path = f'{path}/kernel.so'
+    tmp_so_path = f'{make_tmp_dir()}/nvcc.tmp.{str(uuid.uuid4())}.{hash_to_hex(so_path)}.so'
+
+    # Compile
+    command = [get_nvcc_compiler()[0],
+               src_path, '-o', tmp_so_path,
+               *flags,
+               *[f'-I{d}' for d in include_dirs]]
+    if os.getenv('DG_JIT_DEBUG', None) or os.getenv('DG_JIT_PRINT_NVCC_COMMAND', False):
+        print(f'Compiling JIT runtime {name} with command {command}')
+    assert subprocess.check_call(command) == 0, f'Failed to compile {src_path}'
+
+    # Interleave FFMA reuse
+    if enable_sass_opt:
+        pass
+
+    # Atomic replace SO file
+    os.replace(tmp_so_path, so_path)
+
+    # Put cache and return
+    runtime_cache[path] = Runtime(path)
+    return runtime_cache[path]
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit/runtime.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit/runtime.py
@ -0,0 +1,66 @@
+import ctypes
+import os
+import torch
+from typing import Optional
+
+from .template import map_ctype
+
+
+class Runtime:
+    def __init__(self, path: str) -> None:
+        self.path = path
+        self.lib = None
+        self.args = None
+
+        assert self.is_path_valid(self.path)
+
+    @staticmethod
+    def is_path_valid(path: str) -> bool:
+        # Exists and is a directory
+        if not os.path.exists(path) or not os.path.isdir(path):
+            return False
+
+        # Contains all necessary files
+        files = ['kernel.cu', 'kernel.args', 'kernel.so']
+        return all(os.path.exists(os.path.join(path, file)) for file in files)
+
+    def __call__(self, *args) -> int:
+        # Load SO file
+        if self.lib is None or self.args is None:
+            self.lib = ctypes.CDLL(os.path.join(self.path, 'kernel.so'))
+            with open(os.path.join(self.path, 'kernel.args'), 'r') as f:
+                self.args = eval(f.read())
+
+        # Check args and launch
+        assert len(args) == len(self.args), f'Expected {len(self.args)} arguments, got {len(args)}'
+        cargs = []
+        for arg, (name, dtype) in zip(args, self.args):
+            if isinstance(arg, torch.Tensor):
+                assert arg.dtype == dtype, f'Expected tensor dtype `{dtype}` for `{name}`, got `{arg.dtype}`'
+            else:
+                assert isinstance(arg, dtype), f'Expected built-in type `{dtype}` for `{name}`, got `{type(arg)}`'
+            cargs.append(map_ctype(arg))
+
+        return_code = ctypes.c_int(0)
+        self.lib.launch(*cargs, ctypes.byref(return_code))
+        return return_code.value
+
+
+class RuntimeCache:
+    def __init__(self) -> None:
+        self.cache = {}
+
+    def __getitem__(self, path: str) -> Optional[Runtime]:
+        # In Python runtime
+        if path in self.cache:
+            return self.cache[path]
+
+        # Already compiled
+        if os.path.exists(path) and Runtime.is_path_valid(path):
+            runtime = Runtime(path)
+            self.cache[path] = runtime
+            return runtime
+        return None
+
+    def __setitem__(self, path, runtime) -> None:
+        self.cache[path] = runtime
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit/template.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit/template.py
@ -0,0 +1,93 @@
+import copy
+import ctypes
+import os
+import torch
+
+from typing import Any, Iterable, Dict, Tuple
+
+
+# Name map for Python `eval`
+typename_map: Dict[Any, str] = {
+    **{t: t.__name__ for t in (bool, int, float)},
+    torch.int: 'torch.int',
+    torch.float: 'torch.float',
+    torch.bfloat16: 'torch.bfloat16',
+    torch.float8_e4m3fn: 'torch.float8_e4m3fn',
+    torch.cuda.Stream: 'torch.cuda.Stream',
+}
+
+# `ctype` map for Python casting
+ctype_map: Dict[Any, Any] = {
+    **{t: getattr(ctypes, f'c_{t.__name__}') for t in (bool, int, float)},
+    **{t: ctypes.c_void_p for t in (torch.int, torch.float, torch.bfloat16, torch.float8_e4m3fn, torch.cuda.Stream)},
+}
+
+
+# Type map for both Python API and source code usages
+genc_map = {
+    bool: ('bool', 'bool'),
+    int: ('int', 'int'),
+    float: ('float', 'float'),
+    torch.int: ('void*', 'int*'),
+    torch.float: ('void*', 'float*'),
+    torch.bfloat16: ('void*', '__nv_bfloat16*'),
+    torch.float8_e4m3fn: ('void*', '__nv_fp8_e4m3*'),
+    torch.cuda.Stream: ('void*', 'cudaStream_t'),
+}
+
+
+def map_ctype(value: Any) -> Any:
+    ctype = ctype_map[value.dtype if isinstance(value, torch.Tensor) else type(value)]
+    if isinstance(value, torch.Tensor):
+        return ctype(value.data_ptr())
+    if isinstance(value, torch.cuda.Stream):
+        return ctype(value.cuda_stream)
+    return ctype(value)
+
+
+def cpp_format(template: str, keys: Dict[str, Any]) -> str:
+    # We don't use `str.format` because it's not safe for C++ {} braces
+    new_template = copy.deepcopy(template)
+    for key, value in keys.items():
+        new_template = new_template.replace(f'{{{key}}}', f'{value}')
+    return new_template
+
+
+def generate(includes: Iterable[str], arg_defs: Iterable[Tuple], body: str) -> str:
+    # Common prefix
+    code = '// DeepGEMM auto-generated JIT CUDA source file\n\n'
+
+    # Includes
+    preload_sys_includes = ['<cuda.h>', '<cuda_fp8.h>', '<cuda_runtime.h>', '<iostream>']
+    preload_package_includes = ['"cutlass/cutlass.h"']
+
+    assert isinstance(includes, list) or isinstance(includes, tuple)
+    sys_includes = sorted(list(set(preload_sys_includes + [include for include in includes if include.startswith('<')])))
+    package_includes = sorted(list(set(preload_package_includes + [include for include in includes if include.startswith('"')])))
+    code += '\n'.join(f'#include {include}' for include in sys_includes) + '\n\n'
+    code += '\n'.join(f'#include {include}' for include in package_includes) + '\n\n'
+
+    # Function signature
+    raw = '__raw_'
+    get_def = lambda n, t: f'{genc_map[t][0]} ' + (raw if genc_map[t][0] != genc_map[t][1] else '') + n
+    code += f'extern "C" void launch('
+    code += ', '.join([get_def(*arg_def) for arg_def in arg_defs] + ['int& __return_code', ])
+    code += ') {\n'
+
+    # Cast raw types
+    code += '    // Cast raw types (if needed)\n'
+    for arg_name, arg_type in arg_defs:
+        if genc_map[arg_type][0] != genc_map[arg_type][1]:
+            code += f'    auto {arg_name} = reinterpret_cast<{genc_map[arg_type][1]}>({raw}{arg_name});\n'
+
+    # Function body
+    code += '\n'.join([(('    ' if line else '') + line) for line in body.split('\n')])
+
+    # End the function
+    code += '}\n\n'
+
+    # Debug print
+    if os.getenv('DG_JIT_DEBUG', None):
+        print(f'Generated code:\n{code}')
+
+    return code
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit_kernels/init.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit_kernels/init.py
@ -0,0 +1,10 @@
+from .gemm import gemm_fp8_fp8_bf16_nt
+from .m_grouped_gemm import (
+    m_grouped_gemm_fp8_fp8_bf16_nt_contiguous,
+    m_grouped_gemm_fp8_fp8_bf16_nt_masked
+)
+from .utils import (
+    cell_div, set_num_sms, get_num_sms,
+    get_col_major_tma_aligned_tensor,
+    get_m_alignment_for_contiguous_layout
+)
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit_kernels/gemm.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit_kernels/gemm.py
@ -0,0 +1,171 @@
+import torch
+from typing import Tuple
+
+from .tuner import jit_tuner
+from .utils import get_num_sms, cell_div, get_col_major_tma_aligned_tensor, get_m_alignment_for_contiguous_layout
+
+# C++ code templates
+includes = ('"deep_gemm/fp8_gemm.cuh"', )
+template = """
+using namespace deep_gemm;
+
+// Templated args from Python JIT call
+constexpr auto N = {N}, K = {K};
+constexpr auto BLOCK_M = {BLOCK_M};
+constexpr auto BLOCK_N = {BLOCK_N};
+constexpr auto kNumStages = {NUM_STAGES};
+constexpr auto kNumTMAMulticast = {NUM_TMA_MULTICAST};
+
+// Make a templated GEMM
+using GemmType = Gemm<N, K, BLOCK_M, BLOCK_N, 128, 1, kNumStages, kNumTMAMulticast, GemmType::Normal>;
+
+// Launch kernel
+auto tma_a_desc = GemmType::make_2d_tma_a_desc(lhs, m);
+auto tma_b_desc = GemmType::make_2d_tma_b_desc(rhs);
+auto tma_scales_a_desc = GemmType::make_2d_tma_scales_a_desc(lhs_scales, m);
+auto tma_d_desc = GemmType::make_2d_tma_d_desc(out, m);
+GemmType::run(out, rhs_scales, nullptr,
+              m,
+              tma_a_desc, tma_b_desc, tma_scales_a_desc, tma_d_desc,
+              stream, num_sms, smem_size);
+"""
+
+
+def is_tma_multicast_legal(n: int, block_n: int, num_tma_multicast: int, num_sms: int) -> bool:
+    if num_tma_multicast == 1:
+        return True
+    return (n % (block_n * num_tma_multicast) == 0) and num_sms % num_tma_multicast == 0
+
+
+def get_smem_size(num_stages: int, k: int, block_m: int, block_n: int, block_k: int = 128) -> int:
+    smem_d = block_m * block_n * 2
+    smem_a_per_stage = block_m * block_k
+    smem_scales_a_per_stage = block_m * 4
+    smem_b_per_stage = block_n * block_k
+    smem_scales_b = cell_div(k, block_k) * 4
+    smem_barrier = num_stages * 8 * 2
+
+    smem_size = 0
+    smem_size += smem_d
+    smem_size += num_stages * smem_a_per_stage
+    smem_size += num_stages * smem_scales_a_per_stage
+    smem_size += num_stages * smem_b_per_stage
+    smem_size += smem_scales_b * (1 if block_k % block_n == 0 else 2)
+    smem_size += smem_barrier
+    return smem_size
+
+
+def get_best_configs(m: int, n: int, k: int, num_groups: int, num_sms: int,
+                     is_grouped_contiguous: bool = False) -> Tuple[int, int, int, int, int]:
+    if not is_grouped_contiguous:
+        # TODO: for some cases, smaller M block is better, add them into tuning space
+        block_ms = (64 if m <= 64 else 128, )
+    else:
+        block_ms = (get_m_alignment_for_contiguous_layout(), )
+    block_ns = tuple(range(16, 129, 8))
+
+    fix_wave_saturate = lambda x: num_sms if x == 0 else x
+    get_num_waves = lambda bm, bn: (cell_div(cell_div(m, bm) * cell_div(n, bn) * num_groups, num_sms) if bm else None)
+    get_last_wave_util = lambda bm, bn: fix_wave_saturate((cell_div(m, bm) * cell_div(n, bn) * num_groups) % num_sms)
+
+    # Decide block sizes by waves
+    best_block_m, best_block_n = None, None
+    for block_m in block_ms:
+        for block_n in block_ns:
+            success = False
+            num_waves, best_num_waves = get_num_waves(block_m, block_n), get_num_waves(best_block_m, best_block_n)
+            if best_block_m is None or best_block_n is None:
+                success = True
+            elif num_waves < best_num_waves:
+                success = True
+            elif num_waves == best_num_waves:
+                # Check last wave utilization
+                util = get_last_wave_util(block_m, block_n)
+                best_util = get_last_wave_util(best_block_m, best_block_n)
+                success = util > best_util or (util == best_util and (block_n >= best_block_n and block_m <= best_block_m))
+            best_block_m, best_block_n = (block_m, block_n) if success else (best_block_m, best_block_n)
+    assert best_block_m is not None and best_block_n is not None
+
+    # Always pick the longest one
+    # NOTES: for double B scales, the best number of stages may be reduced
+    best_num_stages, best_smem_size, sm90_capacity = None, None, 232448
+    for num_stages in (6, 5, 4) if 128 % best_block_n != 0 else (8, 7, 6, 5, 4):
+        best_smem_size = get_smem_size(num_stages, k, best_block_m, best_block_n)
+        if best_smem_size <= sm90_capacity:
+            best_num_stages = num_stages
+            break
+    assert best_num_stages is not None
+
+    # Decide the number of TMA multicast
+    best_num_tma_multicast = 1
+    if m >= 1024 and is_tma_multicast_legal(n, best_block_n, 2, num_sms) and num_groups == 1:
+        best_num_tma_multicast = 2
+
+    return best_block_m, best_block_n, best_num_stages, best_num_tma_multicast, best_smem_size
+
+
+def gemm_fp8_fp8_bf16_nt(lhs: Tuple[torch.Tensor, torch.Tensor],
+                         rhs: Tuple[torch.Tensor, torch.Tensor],
+                         out: torch.Tensor) -> None:
+    """
+    Do a normal GEMM with FP8 inputs and BF16 output, with 1x128 LHS scaling and 128x128 RHS scaling.
+    LHS, RHS, RHS scaling factors, and output tensors must be in contiguous format.
+    RHS and RHS scaling factors are required to be transposed.
+    The LHS scaling tensor requires TMA-aligned transposed format, if your input does not match the requirement,
+        this function will do a transposing with a set of slow PyTorch operations.
+
+    Arguments:
+        lhs: the first element is an FP8 tensor (typed `torch.float8_e4m3fn`) of shape `[m, k]`,
+             the second element is an FP32 1x128 scaling tensor for LHS of shape `[m, ⌈k / 128⌉]`.
+        rhs: the first element is an FP8 tensor (typed `torch.float8_e4m3fn`) of shape `[n, k]`.
+             the second element is an FP32 128x128 scaling tensor for RHS of shape `[⌈n / 128⌉, ⌈k / 128⌉]`.
+        out: the BF16 output tensor of shape `[m, n]`, representing the result.
+    """
+    lhs, lhs_scales = lhs
+    rhs, rhs_scales = rhs
+    m, k = lhs.shape
+    n, k_ = rhs.shape
+    m_, n_ = out.shape
+
+    assert n % 64 == 0 and k % 128 == 0
+
+    # Type and shape checks
+    assert m == m_ and n == n_ and k == k_
+    assert n > 0 and k > 0
+    assert lhs_scales.shape == (m, (k + 127) // 128)
+    assert rhs_scales.shape == ((n + 127) // 128, (k + 127) // 128)
+    assert lhs.dtype == torch.float8_e4m3fn and lhs_scales.dtype == torch.float32
+    assert rhs.dtype == torch.float8_e4m3fn and rhs_scales.dtype == torch.float32
+    assert out.dtype == torch.bfloat16
+    assert lhs.is_contiguous() and rhs.is_contiguous() and out.is_contiguous()
+
+    # LHS scales must be transposed for TMA load, but not for RHS scales
+    # NOTES: `get_tma_aligned_lhs_scales` may launch a kernel if not processed by previous kernels
+    lhs_scales = get_col_major_tma_aligned_tensor(lhs_scales)
+    assert rhs_scales.is_contiguous()
+
+    # Do nothing if `m` is zero
+    if m == 0:
+        return
+
+    # Auto-tuning with compilation
+    global includes, template
+    num_sms = get_num_sms()
+    block_m, block_n, num_stages, num_tma_multicast, smem_size = get_best_configs(m, n, k, 1, num_sms)
+    args = (lhs, lhs_scales, rhs, rhs_scales, out, m, torch.cuda.current_stream(), num_sms, smem_size)
+    runtime = jit_tuner.compile_and_tune(
+        name='gemm_fp8_fp8_bf16_nt',
+        keys={'N': n, 'K': k, 'BLOCK_M': block_m, 'BLOCK_N': block_n,
+              'NUM_STAGES': num_stages, 'NUM_TMA_MULTICAST': num_tma_multicast},
+        space=(),
+        includes=includes,
+        arg_defs=(('lhs', torch.float8_e4m3fn), ('lhs_scales', torch.float),
+                  ('rhs', torch.float8_e4m3fn), ('rhs_scales', torch.float),
+                  ('out', torch.bfloat16), ('m', int),
+                  ('stream', torch.cuda.Stream), ('num_sms', int), ('smem_size', int)),
+        template=template,
+        args=args
+    )
+
+    # Run the kernel
+    runtime(*args)
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit_kernels/m_grouped_gemm.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit_kernels/m_grouped_gemm.py
@ -0,0 +1,182 @@
+import torch
+from typing import Tuple
+
+from .gemm import get_best_configs
+from .tuner import jit_tuner
+from .utils import get_col_major_tma_aligned_tensor, get_num_sms
+
+# C++ code templates
+includes = ('"deep_gemm/fp8_gemm.cuh"', )
+template = """
+using namespace deep_gemm;
+
+// Templated args from Python JIT call
+constexpr auto N = {N}, K = {K};
+constexpr auto BLOCK_M = {BLOCK_M};
+constexpr auto BLOCK_N = {BLOCK_N};
+constexpr auto kNumStages = {NUM_STAGES};
+constexpr auto kNumTMAMulticast = {NUM_TMA_MULTICAST};
+
+// Make a templated grouped GEMM
+using GemmType = Gemm<N, K, BLOCK_M, BLOCK_N, 128, {NUM_GROUPS}, kNumStages, kNumTMAMulticast, GemmType::{GEMM_TYPE}>;
+
+// Launch kernel
+auto tma_a_desc = GemmType::make_2d_tma_a_desc(lhs, m);
+auto tma_b_desc = GemmType::make_2d_tma_b_desc(rhs);
+auto tma_scales_a_desc = GemmType::make_2d_tma_scales_a_desc(lhs_scales, m);
+auto tma_d_desc = GemmType::make_2d_tma_d_desc(out, m);
+GemmType::run(out, rhs_scales, grouped_layout,
+              m,
+              tma_a_desc, tma_b_desc, tma_scales_a_desc, tma_d_desc,
+              stream, num_sms, smem_size);
+"""
+
+
+def m_grouped_gemm_fp8_fp8_bf16_nt_contiguous(lhs: Tuple[torch.Tensor, torch.Tensor],
+                                              rhs: Tuple[torch.Tensor, torch.Tensor],
+                                              out: torch.Tensor, m_indices: torch.Tensor) -> None:
+    """
+    Do a grouped GEMM (contiguous format) with FP8 inputs and BF16 output, with 1x128 LHS scaling and 128x128 RHS scaling.
+    LHS, RHS, RHS scaling factors, and output tensors must be in contiguous format.
+    RHS and RHS scaling factors are required to be transposed.
+    The LHS scaling tensor requires TMA-aligned transposed format, if your input does not match the requirement,
+        this function will do a transposing with a set of slow PyTorch operations.
+    On the M axis, inputs are grouped into several batches, of which batch sizes aligned to
+        `get_m_alignment_for_contiguous_layout()` (128).
+
+    Arguments:
+        lhs: the first element is an FP8 tensor (typed `torch.float8_e4m3fn`) of shape `[m_sum, k]`,
+             the second element is an FP32 1x128 scaling tensor for LHS of shape `[m_sum, ⌈k / 128⌉]`.
+        rhs: the first element is an FP8 tensor (typed `torch.float8_e4m3fn`) of shape `[num_groups, n, k]`.
+             the second element is an FP32 128x128 scaling tensor for RHS of shape `[num_groups, ⌈n / 128⌉, ⌈k / 128⌉]`.
+        out: the BF16 output tensor of shape `[m_sum, n]`, representing the result.
+        m_indices: a tensor of shape `[m_sum]` with type `torch.int`.
+            `m_indices[i]` records the group which the j-th row of the LHS belong to,
+            which means that the i-th row of the LHS matrix will be multiplied with `rhs[m_indices[i]]`.
+            Values of `m_indices` in every-m-alignment-block must also be the same.
+            `-1` in this tensor indicates no RHS matrix selected, the kernel will skip the computation for that aligned block.
+    """
+    lhs, lhs_scales = lhs
+    rhs, rhs_scales = rhs
+    m, k = lhs.shape
+    num_groups, n, k_ = rhs.shape
+    m_, n_ = out.shape
+    m__ = m_indices.numel()
+
+    # Type and shape checks
+    assert m == m_ == m__ and k == k_ and n == n_
+    assert lhs_scales.shape == (m, (k + 127) // 128)
+    assert rhs_scales.shape == (num_groups, (n + 127) // 128, (k + 127) // 128)
+    assert lhs.dtype == torch.float8_e4m3fn and lhs_scales.dtype == torch.float32
+    assert rhs.dtype == torch.float8_e4m3fn and rhs_scales.dtype == torch.float32
+    assert out.dtype == torch.bfloat16
+    assert m_indices.dtype == torch.int32
+    assert lhs.is_contiguous() and rhs.is_contiguous()
+    assert out.is_contiguous() and m_indices.is_contiguous()
+
+    # LHS scales must be transposed for TMA load, but not for RHS scales
+    lhs_scales = get_col_major_tma_aligned_tensor(lhs_scales)
+    assert rhs_scales.is_contiguous()
+
+    # Do nothing if `m` is zero
+    if m == 0:
+        return
+
+    # Auto-tuning with compilation
+    global includes, template
+    num_sms = get_num_sms()
+    block_m, block_n, num_stages, num_tma_multicast, smem_size = get_best_configs(m, n, k, 1, num_sms,
+                                                                                  is_grouped_contiguous=True)
+    args = (lhs, lhs_scales, rhs, rhs_scales, out,
+            m_indices, m, num_groups,
+            torch.cuda.current_stream(), num_sms, smem_size)
+    runtime = jit_tuner.compile_and_tune(
+        name='m_grouped_gemm_fp8_fp8_bf16_nt',
+        keys={'N': n, 'K': k, 'BLOCK_M': block_m, 'BLOCK_N': block_n, 'NUM_GROUPS': num_groups,
+              'NUM_STAGES': num_stages, 'NUM_TMA_MULTICAST': num_tma_multicast, 'GEMM_TYPE': 'GroupedContiguous'},
+        space=(),
+        includes=includes,
+        arg_defs=(('lhs', torch.float8_e4m3fn), ('lhs_scales', torch.float),
+                  ('rhs', torch.float8_e4m3fn), ('rhs_scales', torch.float),
+                  ('out', torch.bfloat16),
+                  ('grouped_layout', torch.int32), ('m', int), ('num_groups', int),
+                  ('stream', torch.cuda.Stream), ('num_sms', int), ('smem_size', int)),
+        template=template,
+        args=args
+    )
+
+    # Run the kernel
+    runtime(*args)
+
+
+def m_grouped_gemm_fp8_fp8_bf16_nt_masked(lhs: Tuple[torch.Tensor, torch.Tensor],
+                                          rhs: Tuple[torch.Tensor, torch.Tensor],
+                                          out: torch.Tensor, masked_m: torch.Tensor, expected_m: int) -> None:
+    """
+    Do a grouped GEMM (masked format) with FP8 inputs and BF16 output, with 1x128 LHS scaling and 128x128 RHS scaling.
+    LHS, RHS, RHS scaling factors, and output tensors must be in contiguous format.
+    RHS and RHS scaling factors are required to be transposed.
+    The LHS scaling tensor requires TMA-aligned transposed format, if your input does not match the requirement,
+        this function will do a transposing with a set of slow PyTorch operations.
+    Moreover, this alignment requirement is different with the contiguous-format kernel, as we require that each batch
+        should be separately transposed.
+
+    Arguments:
+        lhs: the first element is an FP8 tensor (typed `torch.float8_e4m3fn`) of shape `[num_groups, m_max, k]`,
+             the second element is an FP32 1x128 scaling tensor for LHS of shape `[num_groups, m_max, ⌈k / 128⌉]`.
+        rhs: the first element is an FP8 tensor (typed `torch.float8_e4m3fn`) of shape `[num_groups, n, k]`.
+             the second element is an FP32 128x128 scaling tensor for RHS of shape `[num_groups, ⌈n / 128⌉, ⌈k / 128⌉]`.
+        out: the BF16 output tensor of shape `[num_groups, m_max, n]`, representing the result.
+        masked_m: a tensor of shape `[num_groups]`, `masked_m[i]` records actual rows of the `lhs[i]` matrix to compute
+            in the i-th group.
+        expected_m: a value hint (which is a value on CPU) for the M expectation of each batch,
+            correctly setting this value may lead to better performance.
+    """
+    lhs, lhs_scales = lhs
+    rhs, rhs_scales = rhs
+    num_groups, m, k = lhs.shape
+    num_groups_, n, k_ = rhs.shape
+    num_groups__, m_, n_ = out.shape
+    num_groups___ = masked_m.numel()
+
+    # Type and shape checks
+    assert num_groups == num_groups_ == num_groups__ == num_groups___
+    assert m == m_ and n == n_ and k == k_
+    assert expected_m > 0 and m > 0 and n > 0 and k > 0 and num_groups > 0
+    assert lhs_scales.shape == (num_groups, m, (k + 127) // 128)
+    assert rhs_scales.shape == (num_groups, (n + 127) // 128, (k + 127) // 128)
+    assert lhs.dtype == torch.float8_e4m3fn and lhs_scales.dtype == torch.float32
+    assert rhs.dtype == torch.float8_e4m3fn and rhs_scales.dtype == torch.float32
+    assert out.dtype == torch.bfloat16
+    assert masked_m.dtype == torch.int32
+    assert lhs.is_contiguous() and rhs.is_contiguous()
+    assert out.is_contiguous() and masked_m.is_contiguous()
+
+    # LHS scales must be transposed for TMA load, but not for RHS scales
+    lhs_scales = get_col_major_tma_aligned_tensor(lhs_scales)
+    assert rhs_scales.is_contiguous()
+
+    # Auto-tuning with compilation
+    global includes, template
+    num_sms = get_num_sms()
+    block_m, block_n, num_stages, num_tma_multicast, smem_size = get_best_configs(expected_m, n, k, num_groups, num_sms)
+    args = (lhs, lhs_scales, rhs, rhs_scales, out,
+            masked_m, m,
+            torch.cuda.current_stream(), num_sms, smem_size)
+    runtime = jit_tuner.compile_and_tune(
+        name='m_grouped_gemm_fp8_fp8_bf16_nt',
+        keys={'N': n, 'K': k, 'BLOCK_M': block_m, 'BLOCK_N': block_n, 'NUM_GROUPS': num_groups,
+              'NUM_STAGES': num_stages, 'NUM_TMA_MULTICAST': num_tma_multicast, 'GEMM_TYPE': 'GroupedMasked'},
+        space=(),
+        includes=includes,
+        arg_defs=(('lhs', torch.float8_e4m3fn), ('lhs_scales', torch.float),
+                  ('rhs', torch.float8_e4m3fn), ('rhs_scales', torch.float),
+                  ('out', torch.bfloat16),
+                  ('grouped_layout', torch.int32), ('m', int),
+                  ('stream', torch.cuda.Stream), ('num_sms', int), ('smem_size', int)),
+        template=template,
+        args=args
+    )
+
+    # Run the kernel
+    runtime(*args)
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit_kernels/tuner.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit_kernels/tuner.py
@ -0,0 +1,81 @@
+import copy
+import os
+import torch
+from typing import Any, Dict
+
+from ..jit import build, cpp_format, generate, Runtime
+
+
+class JITTuner:
+    def __init__(self) -> None:
+        self.tuned = {}
+
+    def compile_and_tune(self, name: str, keys: Dict[str, Any], space: tuple,
+                         includes: tuple, arg_defs: tuple, template: str, args: tuple) -> Runtime:
+        # NOTES: we always assume the space and template will not change
+        # We also assume the GPU device will not be changed
+        # NOTES: the function must have no accumulated side effects
+        keys = {k: keys[k] for k in sorted(keys.keys())}
+        signature = (name, f'{keys}')
+        if signature in self.tuned:
+            if os.getenv('DG_JIT_DEBUG', None):
+                print(f'Using cached JIT kernel {name} with keys {keys}')
+            return self.tuned[signature]
+
+        if os.getenv('DG_JIT_DEBUG', None):
+            print(f'Auto-tuning JIT kernel {name} with keys {keys}')
+
+        assert signature not in self.tuned
+        assert args is not None
+        space = (dict(), ) if len(space) == 0 else space
+
+        kernels = []
+        for tuned_keys in space:
+            assert isinstance(tuned_keys, dict)
+            full_keys = copy.deepcopy(keys)
+            full_keys.update(tuned_keys)
+            code = generate(includes, arg_defs, cpp_format(template, full_keys))
+
+            # Illegal build must raise errors
+            kernels.append((build(name, arg_defs, code), tuned_keys))
+
+        best_runtime, best_time, best_keys = None, None, None
+        for runtime, tuned_keys in kernels:
+            if len(space) > 1:
+                # Check kernel validity
+                return_code = runtime(*args)
+                if return_code != 0:
+                    # Pass illegal kernels, e.g. insufficient shared memory capacity
+                    if os.getenv('DG_JIT_DEBUG', None):
+                        print(f'Illegal JIT kernel {name} with keys {keys} and tuned keys {tuned_keys}: error code {return_code}')
+                    continue
+
+                # Measure performance with L2 flush and a large GEMM kernel before to reduce overhead between kernels
+                start_event = torch.cuda.Event(enable_timing=True)
+                end_event = torch.cuda.Event(enable_timing=True)
+                torch.empty(int(256e6 // 4), dtype=torch.int, device='cuda').zero_()
+                torch.randn((8192, 8192), dtype=torch.float, device='cuda') @ torch.randn((8192, 8192), dtype=torch.float, device='cuda')
+                start_event.record()
+                for i in range(20):
+                    assert runtime(*args) == 0
+                end_event.record()
+                end_event.synchronize()
+                elapsed_time = start_event.elapsed_time(end_event)
+            else:
+                elapsed_time = 0
+
+            # Compare if better
+            if best_time is None or elapsed_time < best_time:
+                best_runtime, best_time, best_keys = runtime, elapsed_time, tuned_keys
+            if os.getenv('DG_JIT_DEBUG', None):
+                print(f'Tuned JIT kernel {name} with keys {keys} and tuned keys {tuned_keys} has time {elapsed_time}')
+        assert best_runtime is not None, f'Failed to tune JIT kernel {name} with keys {keys}'
+
+        # Cache the best runtime and return
+        if os.getenv('DG_JIT_DEBUG', None) or os.getenv('DG_PRINT_AUTOTUNE', None):
+            print(f'Best JIT kernel {name} with keys {keys} has tuned keys {best_keys} and time {best_time}')
+        self.tuned[signature] = best_runtime
+        return best_runtime
+
+
+jit_tuner = JITTuner()
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit_kernels/utils.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/jit_kernels/utils.py
@ -0,0 +1,105 @@
+import torch
+
+_num_sms = None
+
+
+def set_num_sms(num_sms: int) -> None:
+    """
+    Set the maximum SM count for all GEMM kernels to use.
+
+    Arguments:
+        num_sms: the desired maximum SM count for all GEMM kernels to use.
+    """
+    global _num_sms
+    assert 0 < num_sms <= torch.cuda.get_device_properties(device='cuda').multi_processor_count
+    _num_sms = num_sms
+
+
+def get_num_sms() -> int:
+    """
+    Get the current maximum limit of SM count for all GEMM kernels to use.
+    If the count is never specified, the function will return the number of device SMs.
+
+    Returns:
+        Current maximum limit of SM count for all GEMM kernels to use.
+    """
+    global _num_sms
+    if _num_sms is None:
+        _num_sms = torch.cuda.get_device_properties(device='cuda').multi_processor_count
+    return _num_sms
+
+
+def cell_div(x: int, y: int) -> int:
+    """
+    Perform ceiling division of two integers.
+
+    Args:
+        x: the dividend.
+        y: the divisor.
+
+    Returns:
+        The result of the ceiling division.
+    """
+    return (x + y - 1) // y
+
+
+def get_m_alignment_for_contiguous_layout():
+    """
+    When we do a grouped GEMM in contiguous format, LHS are grouped into several batches along the M axis.
+    Since we deal with exactly one sub-matrix of RHS for each GEMM block, batch sizes above should align well
+        with GEMM block shape.
+    
+    Returns:
+        Group-level alignment requirement for grouped contiguous layout, which is always 128.
+    """
+    return 128
+
+
+def get_tma_aligned_size(x: int, element_size: int) -> int:
+    """
+    Global memory address of TMA must be 16-byte aligned.
+    Since we use column-major layout for the LHS scaling tensor,
+        the M-axis of the LHS scaling tensor needs to be padded to a multiple of 16 bytes.
+
+    Arguments:
+        x: original M-axis shape of the LHS scaling tensor.
+        element_size: element size of the LHS scaling tensor.
+
+    Returns:
+        M-axis shape of the LHS scaling tensor after padding.
+    """
+    tma_alignment_bytes = 16
+    assert tma_alignment_bytes % element_size == 0
+    alignment = tma_alignment_bytes // element_size
+    return cell_div(x, alignment) * alignment
+
+
+def get_col_major_tma_aligned_tensor(x: torch.Tensor) -> torch.Tensor:
+    """
+    Returns TMA-aligned transposed format of the input tensor. `torch.transpose` will be called if necessary.
+    If the input tensor is already column-major layout and 16-byte aligned along the M axis
+        (thus meets the requirement of LHS scaling tensor in DeepGEMM), this function will do nothing.
+
+    Arguments:
+        x: usually the LHS scaling tensor in GEMM.
+
+    Returns:
+        The LHS scaling tensor of TMA-aligned transposed format.
+    """
+    # NOTES: for the extreme performance, you may rewrite/fuse this function in CUDA
+    assert x.dim() in (2, 3)
+    remove_dim = False
+    if x.dim() == 2:
+        x, remove_dim = x.unsqueeze(0), True
+
+    b, m, n = x.shape
+    aligned_m = get_tma_aligned_size(m, x.element_size())
+
+    # The last kernel gives a column-major TMA aligned layout
+    if x.stride(0) == aligned_m * n and x.stride(1) == 1 and x.stride(2) == aligned_m:
+        return x.squeeze(0) if remove_dim else x
+
+    # Normal layout requires transposing
+    aligned_x = torch.transpose(torch.empty((b, n, aligned_m), device=x.device, dtype=x.dtype), 1, 2)
+    aligned_x[:, :m, :] = x
+    return aligned_x.squeeze(0) if remove_dim else aligned_x
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/utils.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/deep_gemm/utils.py
@ -0,0 +1,154 @@
+import os
+import sys
+import torch
+import torch.distributed as dist
+
+
+def bench(fn, num_warmups: int = 5, num_tests: int = 10,
+          high_precision: bool = False):
+    # Flush L2 cache with 256 MB data
+    torch.cuda.synchronize()
+    cache = torch.empty(int(256e6 // 4), dtype=torch.int, device='cuda')
+    cache.zero_()
+
+    # Warmup
+    for _ in range(num_warmups):
+        fn()
+
+    # Add a large kernel to eliminate the CPU launch overhead
+    if high_precision:
+        x = torch.randn((8192, 8192), dtype=torch.float, device='cuda')
+        y = torch.randn((8192, 8192), dtype=torch.float, device='cuda')
+        x @ y
+
+    # Testing
+    start_event = torch.cuda.Event(enable_timing=True)
+    end_event = torch.cuda.Event(enable_timing=True)
+    start_event.record()
+    for i in range(num_tests):
+        fn()
+    end_event.record()
+    torch.cuda.synchronize()
+
+    return start_event.elapsed_time(end_event) / num_tests
+
+
+class empty_suppress:
+    def __enter__(self):
+        return self
+
+    def __exit__(self, *_):
+        pass
+
+
+class suppress_stdout_stderr:
+    def __enter__(self):
+        self.outnull_file = open(os.devnull, 'w')
+        self.errnull_file = open(os.devnull, 'w')
+
+        self.old_stdout_fileno_undup = sys.stdout.fileno()
+        self.old_stderr_fileno_undup = sys.stderr.fileno()
+
+        self.old_stdout_fileno = os.dup(sys.stdout.fileno())
+        self.old_stderr_fileno = os.dup(sys.stderr.fileno())
+
+        self.old_stdout = sys.stdout
+        self.old_stderr = sys.stderr
+
+        os.dup2(self.outnull_file.fileno(), self.old_stdout_fileno_undup)
+        os.dup2(self.errnull_file.fileno(), self.old_stderr_fileno_undup)
+
+        sys.stdout = self.outnull_file
+        sys.stderr = self.errnull_file
+        return self
+
+    def __exit__(self, *_):
+        sys.stdout = self.old_stdout
+        sys.stderr = self.old_stderr
+
+        os.dup2(self.old_stdout_fileno, self.old_stdout_fileno_undup)
+        os.dup2(self.old_stderr_fileno, self.old_stderr_fileno_undup)
+
+        os.close(self.old_stdout_fileno)
+        os.close(self.old_stderr_fileno)
+
+        self.outnull_file.close()
+        self.errnull_file.close()
+
+
+def bench_kineto(fn, kernel_names, num_tests: int = 30, suppress_kineto_output: bool = False,
+                 trace_path: str = None, barrier_comm_profiling: bool = False, flush_l2: bool = False):
+    # Conflict with Nsight Systems
+    using_nsys = os.environ.get('DG_NSYS_PROFILING', False)
+
+    # For some auto-tuning kernels with prints
+    fn()
+
+    # Profile
+    suppress = suppress_stdout_stderr if suppress_kineto_output and not using_nsys else empty_suppress
+    with suppress():
+        schedule = torch.profiler.schedule(wait=0, warmup=1, active=1, repeat=1) if not using_nsys else None
+        profiler = torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA], schedule=schedule) if not using_nsys else empty_suppress()
+        with profiler:
+            for i in range(2):
+                # NOTES: use a large kernel and a barrier to eliminate the unbalanced CPU launch overhead
+                if barrier_comm_profiling:
+                    lhs = torch.randn((8192, 8192), dtype=torch.float, device='cuda')
+                    rhs = torch.randn((8192, 8192), dtype=torch.float, device='cuda')
+                    lhs @ rhs
+                    dist.all_reduce(torch.ones(1, dtype=torch.float, device='cuda'))
+                for _ in range(num_tests):
+                    if flush_l2:
+                        torch.empty(int(256e6 // 4), dtype=torch.int, device='cuda').zero_()
+                    fn()
+
+                if not using_nsys:
+                    profiler.step()
+
+    # Return 1 if using Nsight Systems
+    if using_nsys:
+        return 1
+
+    # Parse the profiling table
+    assert isinstance(kernel_names, str) or isinstance(kernel_names, tuple)
+    is_tupled = isinstance(kernel_names, tuple)
+    prof_lines = profiler.key_averages().table(sort_by='cuda_time_total', max_name_column_width=100).split('\n')
+    kernel_names = (kernel_names, ) if isinstance(kernel_names, str) else kernel_names
+    assert all([isinstance(name, str) for name in kernel_names])
+    for name in kernel_names:
+        assert sum([name in line for line in prof_lines]) == 1, f'Errors of the kernel {name} in the profiling table'
+
+    # Save chrome traces
+    if trace_path is not None:
+        profiler.export_chrome_trace(trace_path)
+
+    # Return average kernel times
+    units = {'ms': 1e3, 'us': 1e6}
+    kernel_times = []
+    for name in kernel_names:
+        for line in prof_lines:
+            if name in line:
+                time_str = line.split()[-2]
+                for unit, scale in units.items():
+                    if unit in time_str:
+                        kernel_times.append(float(time_str.replace(unit, '')) / scale)
+                        break
+                break
+    return tuple(kernel_times) if is_tupled else kernel_times[0]
+
+
+def calc_diff(x, y):
+    x, y = x.double(), y.double()
+    denominator = (x * x + y * y).sum()
+    sim = 2 * (x * y).sum() / denominator
+    return 1 - sim
+
+
+def count_bytes(tensors):
+    total = 0
+    for t in tensors:
+        if isinstance(t, tuple):
+            total += count_bytes(t)
+        else:
+            total += t.numel() * t.element_size()
+    return total
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/include/deep_gemm/fp8_gemm.cuh
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/include/deep_gemm/fp8_gemm.cuh
@ -0,0 +1,444 @@
+#pragma clang diagnostic push
+#pragma clang diagnostic ignored "-Wunknown-attributes"
+#pragma once
+
+#include <cutlass/arch/barrier.h>
+#include <cutlass/arch/reg_reconfig.h>
+
+#include <cute/arch/cluster_sm90.hpp>
+#include <cute/arch/copy_sm90_desc.hpp>
+#include <cute/arch/copy_sm90_tma.hpp>
+
+#include "mma_utils.cuh"
+#include "scheduler.cuh"
+#include "tma_utils.cuh"
+#include "utils.cuh"
+
+namespace deep_gemm {
+
+enum class Layout {
+    RowMajor,
+    ColMajor
+};
+
+template <uint32_t kNumTMAThreads, uint32_t kNumMathThreadsPerGroup>
+__device__ __host__ constexpr int get_num_threads_per_sm(int block_m) {
+    DG_STATIC_ASSERT(kNumMathThreadsPerGroup == 128, "Only support 128 threads per math group");
+    return (block_m == 64 ? 1 : 2) * kNumMathThreadsPerGroup + kNumTMAThreads;
+}
+
+template <uint32_t SHAPE_N, uint32_t SHAPE_K,
+          uint32_t BLOCK_M, uint32_t BLOCK_N, uint32_t BLOCK_K,
+          uint32_t kNumGroups, uint32_t kNumStages,
+          uint32_t kNumTMAThreads, uint32_t kNumMathThreadsPerGroup,
+          uint32_t kNumTMAMulticast,
+          GemmType kGemmType>
+__global__ void __launch_bounds__(get_num_threads_per_sm<kNumTMAThreads, kNumMathThreadsPerGroup>(BLOCK_M), 1)
+fp8_gemm_kernel(__nv_bfloat16* gmem_d, float* scales_b, int* grouped_layout,
+                uint32_t shape_m,
+                const __grid_constant__ CUtensorMap tensor_map_a,
+                const __grid_constant__ CUtensorMap tensor_map_b,
+                const __grid_constant__ CUtensorMap tensor_map_scales_a,
+                const __grid_constant__ CUtensorMap tensor_map_d) {
+#if (defined(__CUDA_ARCH__) and (__CUDA_ARCH__ >= 900)) or defined(__CLION_IDE__)
+    // Scaling checks
+    DG_STATIC_ASSERT(BLOCK_K == 128, "Only support per-128-channel FP8 scaling");
+    DG_STATIC_ASSERT(cell_div(BLOCK_N, BLOCK_K) == 1, "Too much B scales in a single block");
+
+    // Types
+    using WGMMA = typename FP8MMASelector<BLOCK_N>::type;
+    using Barrier = cutlass::arch::ClusterTransactionBarrier;
+
+    // Shared memory
+    static constexpr uint32_t SMEM_D_SIZE = BLOCK_M * BLOCK_N * sizeof(__nv_bfloat16);
+    static constexpr uint32_t SMEM_A_SIZE_PER_STAGE = BLOCK_M * BLOCK_K * sizeof(__nv_fp8_e4m3);
+    static constexpr uint32_t SMEM_B_SIZE_PER_STAGE = BLOCK_N * BLOCK_K * sizeof(__nv_fp8_e4m3);
+    static constexpr uint32_t SMEM_SCALES_A_SIZE_PER_STAGE = BLOCK_M * sizeof(float);
+    static constexpr uint32_t SHAPE_K_SCALES = cell_div(SHAPE_K, BLOCK_K);
+    static constexpr int kMustUseUniformedScaleB = (BLOCK_K % BLOCK_N == 0);
+
+    // Configs
+    constexpr uint32_t kFullKOfAllStages = kNumStages * BLOCK_K;
+    constexpr uint32_t kNumThreads = get_num_threads_per_sm<kNumTMAThreads, kNumMathThreadsPerGroup>(BLOCK_M);
+    constexpr uint32_t kNumMathThreads = kNumThreads - kNumTMAThreads;
+    constexpr uint32_t kNumIterations = cell_div(SHAPE_K, kFullKOfAllStages);
+    const uint32_t warp_idx = __shfl_sync(0xffffffff, threadIdx.x / 32, 0);
+    const uint32_t lane_idx = get_lane_id();
+
+    // Prefetch TMA descriptors at very beginning
+    if (threadIdx.x == kNumMathThreads) {
+        cute::prefetch_tma_descriptor(reinterpret_cast<cute::TmaDescriptor const*>(&tensor_map_a));
+        cute::prefetch_tma_descriptor(reinterpret_cast<cute::TmaDescriptor const*>(&tensor_map_b));
+        cute::prefetch_tma_descriptor(reinterpret_cast<cute::TmaDescriptor const*>(&tensor_map_scales_a));
+        cute::prefetch_tma_descriptor(reinterpret_cast<cute::TmaDescriptor const*>(&tensor_map_d));
+    }
+    __syncwarp();
+
+    // Align to 1024 bytes for swizzle-128B
+    extern __shared__ __align__(1024) uint8_t smem_buffer[];
+    DG_STATIC_ASSERT(SMEM_D_SIZE % 1024 == 0, "Shared memory of A/B must be aligned to 1024 bytes");
+
+    // Data on shared memory
+    auto smem_d = reinterpret_cast<__nv_bfloat16*>(smem_buffer);
+    __nv_fp8_e4m3* smem_a[kNumStages];
+    __nv_fp8_e4m3* smem_b[kNumStages];
+    float* smem_scales_a[kNumStages];
+    float* smem_scales_b;
+
+    // TMA Barrier for both divisible and non-divisible cases
+    Barrier* full_barriers[kNumStages];
+    Barrier* empty_barriers[kNumStages];
+
+    // Fill shared memory pointers
+    #pragma unroll
+    for (int i = 0; i < kNumStages; ++ i) {
+        smem_a[i] = reinterpret_cast<__nv_fp8_e4m3*>(smem_buffer + SMEM_D_SIZE + i * SMEM_A_SIZE_PER_STAGE);
+        smem_b[i] = reinterpret_cast<__nv_fp8_e4m3*>(smem_buffer + SMEM_D_SIZE + kNumStages * SMEM_A_SIZE_PER_STAGE + i * SMEM_B_SIZE_PER_STAGE);
+        smem_scales_a[i] = reinterpret_cast<float*>(smem_buffer + SMEM_D_SIZE + kNumStages * (SMEM_A_SIZE_PER_STAGE + SMEM_B_SIZE_PER_STAGE) + i * SMEM_SCALES_A_SIZE_PER_STAGE);
+    }
+    smem_scales_b = reinterpret_cast<float*>(smem_buffer + SMEM_D_SIZE + kNumStages * (SMEM_A_SIZE_PER_STAGE + SMEM_B_SIZE_PER_STAGE + SMEM_SCALES_A_SIZE_PER_STAGE));
+
+    // Fill barriers
+    DG_STATIC_ASSERT(sizeof(Barrier) % sizeof(float) == 0, "Misaligned barriers");
+    DG_STATIC_ASSERT(not kMustUseUniformedScaleB or SHAPE_K_SCALES % (sizeof(Barrier) / sizeof(float)) == 0, "Misaligned barriers");
+    auto barrier_start_ptr = reinterpret_cast<Barrier*>(smem_scales_b + SHAPE_K_SCALES * (kMustUseUniformedScaleB ? 1 : 2));
+    #pragma unroll
+    for (int i = 0; i < kNumStages; ++ i) {
+        full_barriers[i] = barrier_start_ptr + i;
+        empty_barriers[i] = barrier_start_ptr + kNumStages + i;
+    }
+
+    // Initialize barriers
+    DG_STATIC_ASSERT(kNumTMAMulticast <= 32, "To many TMA multicast");
+    if (threadIdx.x == kNumMathThreads) {
+        #pragma unroll
+        for (int i = 0; i < kNumStages; ++ i) {
+            full_barriers[i]->init(1);
+            empty_barriers[i]->init(kNumTMAMulticast * kNumMathThreads / 32);
+        }
+
+        // Make initialized barrier visible in async proxy
+        cutlass::arch::fence_view_async_shared();
+        (kNumTMAMulticast > 1) ? cutlass::arch::fence_barrier_init() : void();
+    }
+
+    // Synchronize all threads to make barrier visible in normal memory model
+    (kNumTMAMulticast > 1) ? cute::cluster_sync() : __syncthreads();
+
+    // For pipeline unrolling
+    struct DivisibleK {};
+    struct NotDivisibleK {};
+    auto launch_k_iterations = [](const auto& func) {
+        if constexpr (SHAPE_K % kFullKOfAllStages == 0) {
+            for (int k_iter = 0; k_iter < kNumIterations; ++ k_iter)
+                func(k_iter, DivisibleK{});
+        } else {
+            for (int k_iter = 0; k_iter < kNumIterations - 1; ++ k_iter)
+                func(k_iter, DivisibleK{});
+            func(kNumIterations - 1, NotDivisibleK{});
+        }
+    };
+
+    // Register reconfigurations
+    constexpr int kNumTMARegisters = 40;
+    constexpr int kNumMathRegisters = 232;
+
+    // Block scheduler
+    uint32_t m_block_idx, n_block_idx;
+    auto scheduler = Scheduler<kGemmType, SHAPE_N, BLOCK_M, BLOCK_N, kNumGroups, kNumTMAMulticast>(shape_m, grouped_layout);
+
+    if (threadIdx.x >= kNumMathThreads) {
+        // TMA warp-group for loading data
+        cutlass::arch::warpgroup_reg_dealloc<kNumTMARegisters>();
+
+        // NOTES: only one thread (or warp) will be used
+        if (threadIdx.x == kNumMathThreads) {
+            // Persistently schedule over blocks
+            while (scheduler.get_next_block(m_block_idx, n_block_idx)) {
+                launch_k_iterations([&](int k_iter, auto type) {
+                    constexpr bool kHasDivisibleStages = std::is_same_v<decltype(type), DivisibleK>;
+                    constexpr int kNumInnerStages = kHasDivisibleStages ? kNumStages : (SHAPE_K % kFullKOfAllStages) / BLOCK_K;
+                    DG_STATIC_ASSERT(kNumInnerStages != 0, "Invalid number of inner stages");
+
+                    #pragma unroll
+                    for (uint32_t s = 0; s < kNumInnerStages; ++ s) {
+                        // Wait consumer release
+                        empty_barriers[s]->wait((scheduler.current_iter * kNumIterations + k_iter + 1) & 1);
+
+                        // Issue TMA A with broadcasting
+                        auto& full_barrier = *full_barriers[s];
+                        int k_idx = k_iter * kFullKOfAllStages + s * BLOCK_K;
+                        tma_copy<kNumTMAMulticast>(&tensor_map_a, reinterpret_cast<uint64_t*>(&full_barrier),
+                                                   smem_a[s], k_idx, scheduler.get_global_idx(shape_m, BLOCK_M, m_block_idx));
+                        tma_copy<kNumTMAMulticast>(&tensor_map_scales_a, reinterpret_cast<uint64_t*>(&full_barrier),
+                                                   smem_scales_a[s], m_block_idx * BLOCK_M,
+                                                   scheduler.get_global_idx(SHAPE_K_SCALES, 1, k_idx / BLOCK_K));
+
+                        // Issue TMA B without broadcasting
+                        tma_copy(&tensor_map_b, reinterpret_cast<uint64_t*>(&full_barrier),
+                                 smem_b[s], k_idx, scheduler.get_global_idx<false>(SHAPE_N, BLOCK_N, n_block_idx, m_block_idx));
+                        full_barrier.arrive_and_expect_tx(SMEM_A_SIZE_PER_STAGE + SMEM_B_SIZE_PER_STAGE + SMEM_SCALES_A_SIZE_PER_STAGE);
+                    }
+
+                    // Wait unaligned cases
+                    #pragma unroll
+                    for (uint32_t s = kNumInnerStages; s < kNumStages; ++ s) {
+                        empty_barriers[s]->wait((scheduler.current_iter * kNumIterations + k_iter + 1) & 1);
+                        full_barriers[s]->arrive();
+                    }
+                });
+            }
+
+            // To safely deconstruct distributed shared barriers, we need another round of empty waits
+            if constexpr (kNumTMAMulticast > 1) {
+                #pragma unroll
+                for (uint32_t s = 0; s < kNumStages; ++ s)
+                    empty_barriers[s]->wait((scheduler.current_iter * kNumIterations + 1) & 1);
+            }
+        }
+    } else {
+        // Math warp-groups for WGMMA
+        cutlass::arch::warpgroup_reg_alloc<kNumMathRegisters>();
+
+        // NOTES: use `__shfl_sync` to encourage NVCC to use unified registers
+        const auto math_wg_idx = __shfl_sync(0xffffffff, threadIdx.x / kNumMathThreadsPerGroup, 0);
+        const auto r_0 = warp_idx * 16 + lane_idx / 4, r_1 = r_0 + 8;
+
+        // Persistently schedule over blocks
+        while (scheduler.get_next_block(m_block_idx, n_block_idx)) {
+            // Decide the number of scales B to load
+            DG_STATIC_ASSERT(SHAPE_N % 8 == 0, "Invalid shape N");
+            uint32_t num_former_iters = BLOCK_N / 8, num_full_iters = num_former_iters;
+            if constexpr (not kMustUseUniformedScaleB) {
+                num_former_iters = min(BLOCK_N, BLOCK_K - n_block_idx * BLOCK_N % BLOCK_K) / 8;
+                num_full_iters = min(SHAPE_N - n_block_idx * BLOCK_N, BLOCK_N) / 8;
+            }
+            uint32_t num_scales_b = SHAPE_K_SCALES * (num_former_iters >= num_full_iters ? 1 : 2);
+
+            // Load B scales with math warp-groups
+            // NOTES: except the first warp, we want to overlap loading B scales with TMA stores between tasks
+            if (threadIdx.x >= 32) {
+                auto num_previous_lines = scheduler.get_global_idx<false>(cell_div(SHAPE_N, BLOCK_K), 0, 0, m_block_idx);
+                auto local_scales_b = scales_b + (num_previous_lines + ((n_block_idx * BLOCK_N) / BLOCK_K)) * SHAPE_K_SCALES;
+                #pragma unroll
+                for (uint32_t i = threadIdx.x - 32; i < num_scales_b; i += kNumMathThreads - 32)
+                    st_shared(smem_scales_b + i, __ldg(local_scales_b + i));
+            }
+            cutlass::arch::NamedBarrier(kNumMathThreads).sync();
+
+            // Accumulation for WGMMA or CUDA promotion
+            float accum[WGMMA::kNumAccum], final_accum[WGMMA::kNumAccum] = {0};
+
+            // Empty barrier arrival
+            auto empty_barrier_arrive = [&](int s) {
+                if constexpr (kNumTMAMulticast == 1) {
+                    lane_idx == 0 ? empty_barriers[s]->arrive() : void();
+                } else {
+                    lane_idx < kNumTMAMulticast ? empty_barriers[s]->arrive(lane_idx) : void();
+                }
+            };
+
+            // Launch MMAs
+            launch_k_iterations([&](int k_iter, auto type) {
+                constexpr bool kHasDivisibleStages = std::is_same_v<decltype(type), DivisibleK>;
+                constexpr int kNumInnerStages = kHasDivisibleStages ? kNumStages : (SHAPE_K % kFullKOfAllStages) / BLOCK_K;
+                DG_STATIC_ASSERT(kNumInnerStages != 0, "Invalid number of inner stages");
+
+                #pragma unroll
+                for (int s = 0; s < kNumInnerStages; ++ s) {
+                    // Read B scales
+                    float scale_b_0 = ld_shared(smem_scales_b + k_iter * kNumStages + s), scale_b_1;
+                    // NOTES: even some blocks do not need to read the second row, but we still load one to align with other blocks
+                    if constexpr (not kMustUseUniformedScaleB)
+                        scale_b_1 = ld_shared(smem_scales_b + k_iter * kNumStages + s + SHAPE_K_SCALES);
+
+                    // Wait TMA arrivals
+                    full_barriers[s]->wait((scheduler.current_iter * kNumIterations + k_iter) & 1);
+
+                    // Read A scales
+                    // NOTES: all shared memory read must be prior to `warpgroup_arrive` to avoid next scheduled block polluting the results
+                    auto scale_a_0 = ld_shared(smem_scales_a[s] + r_0), scale_a_1 = ld_shared(smem_scales_a[s] + r_1);
+
+                    // Commit WGMMA instructions
+                    #pragma unroll
+                    for (int i = 0; i < WGMMA::kNumAccum; ++ i)
+                        warpgroup_fence_operand(accum[i]);
+                    warpgroup_arrive();
+                    #pragma unroll
+                    for (int k = 0; k < BLOCK_K / WGMMA::K; ++ k) {
+                        auto desc_a = make_smem_desc(smem_a[s] + math_wg_idx * WGMMA::M * BLOCK_K + k * WGMMA::K, 1);
+                        auto desc_b = make_smem_desc(smem_b[s] + k * WGMMA::K, 1);
+                        WGMMA::wgmma(desc_a, desc_b, accum, k);
+                    }
+                    warpgroup_commit_batch();
+                    #pragma unroll
+                    for (int i = 0; i < WGMMA::kNumAccum; ++ i)
+                        warpgroup_fence_operand(accum[i]);
+                    warpgroup_wait<0>();
+
+                    // Notify barrier arrival
+                    empty_barrier_arrive(s);
+
+                    // Promote with scales
+                    float scale_0_0 = scale_a_0 * scale_b_0, scale_1_0 = scale_a_1 * scale_b_0;
+                    float scale_0_1, scale_1_1;
+                    if constexpr (not kMustUseUniformedScaleB)
+                        scale_0_1 = scale_a_0 * scale_b_1, scale_1_1 = scale_a_1 * scale_b_1;
+                    #pragma unroll
+                    for (int i = 0; i < WGMMA::kNumAccum / 4; ++ i) {
+                        bool predicate = kMustUseUniformedScaleB or i < num_former_iters;
+                        final_accum[i * 4 + 0] += (predicate ? scale_0_0 : scale_0_1) * accum[i * 4 + 0];
+                        final_accum[i * 4 + 1] += (predicate ? scale_0_0 : scale_0_1) * accum[i * 4 + 1];
+                        final_accum[i * 4 + 2] += (predicate ? scale_1_0 : scale_1_1) * accum[i * 4 + 2];
+                        final_accum[i * 4 + 3] += (predicate ? scale_1_0 : scale_1_1) * accum[i * 4 + 3];
+                    }
+                }
+
+                // Wait unaligned cases
+                #pragma unroll
+                for (uint32_t s = kNumInnerStages; s < kNumStages; ++ s) {
+                    full_barriers[s]->wait((scheduler.current_iter * kNumIterations + k_iter) & 1);
+                    empty_barrier_arrive(s);
+                }
+            });
+
+            // Write back to shared memory using STSM
+            DG_STATIC_ASSERT(WGMMA::kNumAccum % 4 == 0, "Invalid STSM x2 vectorization");
+            #pragma unroll
+            for (auto i = 0; i < WGMMA::kNumAccum / 8; ++ i) {
+                SM90_U32x4_STSM_N<nv_bfloat162>::copy(
+                    __float22bfloat162_rn({final_accum[i * 8 + 0], final_accum[i * 8 + 1]}),
+                    __float22bfloat162_rn({final_accum[i * 8 + 2], final_accum[i * 8 + 3]}),
+                    __float22bfloat162_rn({final_accum[i * 8 + 4], final_accum[i * 8 + 5]}),
+                    __float22bfloat162_rn({final_accum[i * 8 + 6], final_accum[i * 8 + 7]}),
+                    smem_d + (warp_idx * 16 + lane_idx % 16) * BLOCK_N + i * 16 + 8 * (lane_idx / 16)
+                );
+            }
+            if constexpr (WGMMA::kNumAccum % 8 != 0) {
+                SM90_U32x2_STSM_N<nv_bfloat162>::copy(
+                    __float22bfloat162_rn({final_accum[WGMMA::kNumAccum / 8 * 8 + 0], final_accum[WGMMA::kNumAccum / 8 * 8 + 1]}),
+                    __float22bfloat162_rn({final_accum[WGMMA::kNumAccum / 8 * 8 + 2], final_accum[WGMMA::kNumAccum / 8 * 8 + 3]}),
+                    smem_d + (warp_idx * 16 + lane_idx % 16) * BLOCK_N + WGMMA::kNumAccum / 8 * 16
+                );
+            }
+            cute::tma_store_fence();
+            cutlass::arch::NamedBarrier(kNumMathThreads).sync();
+
+            // Use TMA store to write back to global memory
+            if (threadIdx.x == 0) {
+                cute::SM90_TMA_STORE_2D::copy(&tensor_map_d, smem_d, n_block_idx * BLOCK_N,
+                                              scheduler.get_global_idx(shape_m, BLOCK_M, m_block_idx));
+                cute::tma_store_arrive();
+                cute::tma_store_wait<0>();
+            }
+            __syncwarp();
+        }
+    }
+#else
+    if (blockIdx.x == 0 and threadIdx.x == 0)
+        DG_DEVICE_ASSERT(false and "This kernel only support sm_90a");
+#endif
+}
+
+template <uint32_t SHAPE_N, uint32_t SHAPE_K,
+          uint32_t BLOCK_M, uint32_t BLOCK_N, uint32_t BLOCK_K,
+          uint32_t kNumGroups, uint32_t kNumStages,
+          uint32_t kNumTMAMulticast,
+          GemmType kGemmType>
+class Gemm {
+private:
+    using Barrier = cuda::barrier<cuda::thread_scope_block>;
+
+public:
+    Gemm() = default;
+
+    static void run(__nv_bfloat16* gmem_d, float* scales_b, int* grouped_layout,
+                    uint32_t shape_m,
+                    const CUtensorMap& tma_a_desc,
+                    const CUtensorMap& tma_b_desc,
+                    const CUtensorMap& tma_scales_a_desc,
+                    const CUtensorMap& tma_d_desc,
+                    cudaStream_t stream,
+                    int num_sms, uint32_t smem_size) {
+        // NOTES: we must use 4 warps to do TMA, because `setmaxnreg.aligned` requires 4 warps
+        constexpr uint32_t kNumTMAThreads = 128;
+        constexpr uint32_t kNumMathThreadsPerGroup = 128;
+        auto kernel = fp8_gemm_kernel<SHAPE_N, SHAPE_K, BLOCK_M, BLOCK_N, BLOCK_K,
+                                      kNumGroups, kNumStages, kNumTMAThreads, kNumMathThreadsPerGroup,
+                                      kNumTMAMulticast, kGemmType>;
+        DG_HOST_ASSERT(cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) == cudaSuccess);
+
+        // Cluster launch
+        cudaLaunchConfig_t config;
+        config.gridDim = num_sms;
+        config.blockDim = get_num_threads_per_sm<kNumTMAThreads, kNumMathThreadsPerGroup>(BLOCK_M);
+        config.dynamicSmemBytes = smem_size;
+        config.stream = stream;
+
+        // Clusters for TMA multicast
+        // NOTES: `>= 4` cluster size will cause performance degradation
+        cudaLaunchAttribute attr;
+        attr.id = cudaLaunchAttributeClusterDimension;
+        attr.val.clusterDim = {kNumTMAMulticast, 1, 1};
+        config.attrs = &attr;
+        config.numAttrs = 1;
+
+        // Launch
+        auto status = cudaLaunchKernelEx(&config, kernel,
+                                         gmem_d, scales_b, grouped_layout,
+                                         shape_m,
+                                         tma_a_desc, tma_b_desc, tma_scales_a_desc, tma_d_desc);
+        DG_HOST_ASSERT(status == cudaSuccess);
+    }
+
+    template <typename T>
+    static CUtensorMap make_2d_tma_a_desc(T* global_address, uint32_t shape_m) {
+        return make_2d_tma_desc(global_address, Layout::RowMajor,
+                                shape_m * (kGemmType == GemmType::GroupedMasked ? kNumGroups : 1), SHAPE_K, BLOCK_M, BLOCK_K);
+    }
+
+    template <typename T>
+    static CUtensorMap make_2d_tma_b_desc(T* global_address) {
+        return make_2d_tma_desc(global_address, Layout::ColMajor,
+                                SHAPE_K, SHAPE_N * (kGemmType != GemmType::Normal ? kNumGroups : 1), BLOCK_K, BLOCK_N);
+    }
+
+    template <typename T>
+    static CUtensorMap make_2d_tma_d_desc(T* global_address, uint32_t shape_m) {
+        return make_2d_tma_desc(global_address, Layout::RowMajor,
+                                shape_m * (kGemmType == GemmType::GroupedMasked ? kNumGroups : 1), SHAPE_N, BLOCK_M, BLOCK_N,
+                                CUtensorMapSwizzle::CU_TENSOR_MAP_SWIZZLE_NONE);
+    }
+
+    template <typename T>
+    static CUtensorMap make_2d_tma_scales_a_desc(T* global_address, uint32_t shape_m) {
+        // Make TMA aligned to 16 bytes
+        constexpr uint32_t kAlignment = 16 / sizeof(T);
+        shape_m = cell_div(shape_m, kAlignment) * kAlignment;
+
+        return make_2d_tma_desc(global_address, Layout::ColMajor,
+                                shape_m, cell_div(SHAPE_K, BLOCK_K) * (kGemmType == GemmType::GroupedMasked ? kNumGroups : 1), BLOCK_M, 1,
+                                CUtensorMapSwizzle::CU_TENSOR_MAP_SWIZZLE_NONE);
+    }
+
+    template <typename T>
+    static CUtensorMap make_2d_tma_desc(
+            T* global_address, Layout layout,
+            uint32_t gmem_rows, uint32_t gmem_cols,
+            uint32_t smem_rows, uint32_t smem_cols,
+            CUtensorMapSwizzle swizzle_type = CUtensorMapSwizzle::CU_TENSOR_MAP_SWIZZLE_128B) {
+        if (layout == Layout::RowMajor) {
+            uint64_t gmem_dim[2] = {gmem_cols, gmem_rows};
+            uint32_t smem_dim[2] = {smem_cols, smem_rows};
+            return make_2d_tma_copy_desc(global_address, gmem_dim, gmem_cols * sizeof(T), smem_dim, swizzle_type);
+        } else {
+            uint64_t gmem_dim[2] = {gmem_rows, gmem_cols};
+            uint32_t smem_dim[2] = {smem_rows, smem_cols};
+            return make_2d_tma_copy_desc(global_address, gmem_dim, gmem_rows * sizeof(T), smem_dim, swizzle_type);
+        }
+    }
+};
+
+};  // namespace deep_gemm
+
+#pragma clang diagnostic pop
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/include/deep_gemm/mma_utils.cuh
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/include/deep_gemm/mma_utils.cuh
@ -0,0 +1,885 @@
+#pragma once
+
+#include <cuda.h>
+
+#include "utils.cuh"
+
+namespace deep_gemm {
+
+struct SM90_64x16x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %10, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7},"
+                     " %8,"
+                     " %9,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                     : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07)
+                     : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 16;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x24x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %14, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n24k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7, "
+                     " %8,   %9,   %10,  %11},"
+                     " %12,"
+                     " %13,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                     : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                       "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+                     : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 24;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x32x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %18, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7, "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15},"
+                     " %16,"
+                     " %17,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                     : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                       "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+                     : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 32;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x40x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %22, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n40k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7, "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15, "
+                     " %16,  %17,  %18,  %19},"
+                     " %20,"
+                     " %21,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 40;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x48x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %26, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7, "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15, "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23},"
+                     " %24,"
+                     " %25,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 48;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x56x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %30, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n56k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27}, "
+                     " %28,"
+                     " %29,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                     : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                       "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                       "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                       "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+                     : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 56;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x64x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %34, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31}, "
+                     " %32,"
+                     " %33,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 64;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x72x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %38, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n72k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35}, "
+                     " %36,"
+                     " %37,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 72;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x80x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %42, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39}, "
+                     " %40,"
+                     " %41,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 80;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x88x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %46, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n88k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43}, "
+                     " %44,"
+                     " %45,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 88;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x96x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %50, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n96k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47}, "
+                     " %48,"
+                     " %49,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 96;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x104x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 float& d48, float& d49, float& d50, float& d51,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %54, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n104k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+                     " %48,  %49,  %50,  %51}, "
+                     " %52,"
+                     " %53,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+                      "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              d[48], d[49], d[50], d[51],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 104;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x112x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 float& d48, float& d49, float& d50, float& d51, float& d52, float& d53, float& d54, float& d55,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %58, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n112k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+                     " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55}, "
+                     " %56,"
+                     " %57,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+                      "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51), "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              d[48], d[49], d[50], d[51], d[52], d[53], d[54], d[55],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 112;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x120x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 float& d48, float& d49, float& d50, float& d51, float& d52, float& d53, float& d54, float& d55,
+                                 float& d56, float& d57, float& d58, float& d59,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %62, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n120k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+                     " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+                     " %56,  %57,  %58,  %59}, "
+                     " %60,"
+                     " %61,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+                      "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51), "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+                      "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              d[48], d[49], d[50], d[51], d[52], d[53], d[54], d[55],
+              d[56], d[57], d[58], d[59],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 120;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x128x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 float& d48, float& d49, float& d50, float& d51, float& d52, float& d53, float& d54, float& d55,
+                                 float& d56, float& d57, float& d58, float& d59, float& d60, float& d61, float& d62, float& d63,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %66, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n128k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+                     " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+                     " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63}, "
+                     " %64,"
+                     " %65,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+                      "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51), "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+                      "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59), "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              d[48], d[49], d[50], d[51], d[52], d[53], d[54], d[55],
+              d[56], d[57], d[58], d[59], d[60], d[61], d[62], d[63],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 128;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+struct SM90_64x192x32_F32E4M3E4M3_SS {
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
+                                 float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
+                                 float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
+                                 float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
+                                 float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
+                                 float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
+                                 float& d40, float& d41, float& d42, float& d43, float& d44, float& d45, float& d46, float& d47,
+                                 float& d48, float& d49, float& d50, float& d51, float& d52, float& d53, float& d54, float& d55,
+                                 float& d56, float& d57, float& d58, float& d59, float& d60, float& d61, float& d62, float& d63,
+                                 float& d64, float& d65, float& d66, float& d67, float& d68, float& d69, float& d70, float& d71,
+                                 float& d72, float& d73, float& d74, float& d75, float& d76, float& d77, float& d78, float& d79,
+                                 float& d80, float& d81, float& d82, float& d83, float& d84, float& d85, float& d86, float& d87,
+                                 float& d88, float& d89, float& d90, float& d91, float& d92, float& d93, float& d94, float& d95,
+                                 bool scale_d) {
+        asm volatile("{\n"
+                     ".reg .pred p;\n"
+                     "setp.ne.b32 p, %98, 0;\n"
+                     "wgmma.mma_async.sync.aligned.m64n192k32.f32.e4m3.e4m3"
+                     "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+                     " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+                     " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+                     " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+                     " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+                     " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+                     " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+                     " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+                     " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+                     " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+                     " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+                     " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},  "
+                     " %96,"
+                     " %97,"
+                     " p   , 1,    1;\n"
+                     "}\n"
+                    : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+                      "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+                      "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+                      "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+                      "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+                      "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43), "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+                      "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51), "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+                      "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59), "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+                      "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67), "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+                      "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75), "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+                      "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83), "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+                      "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91), "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+                    : "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
+    }
+
+    __device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
+        wgmma(desc_a, desc_b,
+              d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
+              d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
+              d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
+              d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
+              d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
+              d[40], d[41], d[42], d[43], d[44], d[45], d[46], d[47],
+              d[48], d[49], d[50], d[51], d[52], d[53], d[54], d[55],
+              d[56], d[57], d[58], d[59], d[60], d[61], d[62], d[63],
+              d[64], d[65], d[66], d[67], d[68], d[69], d[70], d[71],
+              d[72], d[73], d[74], d[75], d[76], d[77], d[78], d[79],
+              d[80], d[81], d[82], d[83], d[84], d[85], d[86], d[87],
+              d[88], d[89], d[90], d[91], d[92], d[93], d[94], d[95],
+              scale_d);
+    }
+
+    static constexpr int M = 64;
+    static constexpr int N = 192;
+    static constexpr int K = 32;
+    static constexpr int kNumAccum = M * N / 128;
+};
+
+template <typename dtype_t>
+struct SM90_U32x2_STSM_N {
+    __device__ __forceinline__ static void
+    copy(dtype_t src_0, dtype_t src_1, void* smem_dst) {
+        const uint32_t src[2] = {*reinterpret_cast<uint32_t*>(&src_0), *reinterpret_cast<uint32_t*>(&src_1)};
+        asm volatile("stmatrix.sync.aligned.x2.m8n8.shared.b16 [%0], {%1, %2};\n"
+                     :: "l"(smem_dst), "r"(src[0]), "r"(src[1]));
+    }
+};
+
+template <typename dtype_t>
+struct SM90_U32x4_STSM_N {
+    __device__ __forceinline__ static void
+    copy(dtype_t src_0, dtype_t src_1, dtype_t src_2, dtype_t src_3, void* smem_dst) {
+        const uint32_t src[4] = {*reinterpret_cast<uint32_t*>(&src_0), *reinterpret_cast<uint32_t*>(&src_1),
+                                 *reinterpret_cast<uint32_t*>(&src_2), *reinterpret_cast<uint32_t*>(&src_3)};
+        asm volatile("stmatrix.sync.aligned.x4.m8n8.shared.b16 [%0], {%1, %2, %3, %4};\n"
+                     :: "l"(smem_dst), "r"(src[0]), "r"(src[1]), "r"(src[2]), "r"(src[3]));
+    }
+};
+
+__device__ void warpgroup_arrive() {
+    asm volatile("wgmma.fence.sync.aligned;\n" ::: "memory");
+}
+
+__device__ void warpgroup_commit_batch() {
+    asm volatile("wgmma.commit_group.sync.aligned;\n" ::: "memory");
+}
+
+__device__ void warpgroup_fence_operand(float& reg) {
+    asm volatile("" : "+f"(reg) :: "memory");
+}
+
+__forceinline__ __device__ uint32_t get_lane_id() {
+    uint32_t lane_id;
+    asm("mov.u32 %0, %laneid;" : "=r"(lane_id));
+    return lane_id;
+}
+
+__device__  __forceinline__ uint32_t ld_shared(const uint32_t* __restrict__ ptr) {
+    uint32_t ret;
+    asm volatile("ld.shared.u32 %0, [%1];" : "=r"(ret) : "l"(ptr));
+    return ret;
+}
+
+__device__  __forceinline__ int4 ld_shared(const int4* __restrict__ ptr) {
+    int4 ret;
+    asm volatile("ld.shared.v4.s32 {%0, %1, %2, %3}, [%4];" : "=r"(ret.x), "=r"(ret.y), "=r"(ret.z), "=r"(ret.w) : "l"(ptr));
+    return ret;
+}
+
+__device__  __forceinline__ float ld_shared(const float* __restrict__ ptr) {
+    float ret;
+    asm volatile("ld.shared.f32 %0, [%1];" : "=f"(ret) : "l"(ptr));
+    return ret;
+}
+
+__device__ __forceinline__ void st_shared(const float* ptr, float val) {
+    asm volatile("st.shared.f32 [%0], %1;" :: "l"(ptr), "f"(val));
+}
+
+__device__ __forceinline__ void st_shared(const uint32_t* ptr, uint32_t val) {
+    asm volatile("st.shared.u32 [%0], %1;" :: "l"(ptr), "r"(val));
+}
+
+template <int N>
+__device__ void warpgroup_wait() {
+    DG_STATIC_ASSERT(N >= 0 and N <= 7, "WGMMA wait: N must be in range [0, 7]");
+    asm volatile("wgmma.wait_group.sync.aligned %0;\n" :: "n"(N) : "memory");
+}
+
+union GmmaDescriptor {
+    __host__ __device__ constexpr GmmaDescriptor() noexcept: desc_(0) {}
+
+    __host__ __device__ constexpr GmmaDescriptor(uint64_t desc) noexcept: desc_(desc) {}
+
+    __host__ __device__ constexpr GmmaDescriptor(GmmaDescriptor const &t) noexcept: desc_(t.desc_) {}
+
+    __host__ __device__ constexpr GmmaDescriptor(GmmaDescriptor &&t) noexcept: desc_(t.desc_) {}
+
+    __host__ __device__ constexpr GmmaDescriptor &operator=(GmmaDescriptor const &t) noexcept {
+        desc_ = t.desc_;
+        return *this;
+    }
+
+    __host__ __device__ constexpr GmmaDescriptor &operator=(GmmaDescriptor &&t) noexcept {
+        desc_ = t.desc_;
+        return *this;
+    }
+
+    uint64_t desc_;
+    uint32_t reg32_[2];
+    uint16_t reg16_[4];
+
+    struct {
+        uint16_t start_address_: 14, : 2;
+        uint16_t leading_byte_offset_: 14, : 2;
+        uint16_t stride_byte_offset_: 14, : 2;
+        uint8_t : 1, base_offset_: 3, : 4;
+        uint8_t : 6, layout_type_: 2;
+    } bitfield;
+
+    // Decay to an `uint64_t`
+    __host__ __device__ constexpr operator uint64_t() const noexcept { return desc_; }
+};
+
+template <class PointerType>
+__device__ GmmaDescriptor make_smem_desc(PointerType smem_ptr, int layout_type,
+                                         int leading_byte_offset = 0,
+                                         int stride_byte_offset = 1024) {
+    GmmaDescriptor desc;
+    auto uint_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
+    desc.bitfield.start_address_ = uint_ptr >> 4;
+    desc.bitfield.layout_type_ = layout_type;
+    desc.bitfield.leading_byte_offset_ = leading_byte_offset >> 4;
+    desc.bitfield.stride_byte_offset_ = stride_byte_offset >> 4;
+    desc.bitfield.base_offset_ = 0;
+    return desc;
+}
+
+template <int N>
+struct FP8MMASelector {
+    static constexpr auto select_type() {
+        if constexpr (N == 16) return SM90_64x16x32_F32E4M3E4M3_SS();
+        if constexpr (N == 24) return SM90_64x24x32_F32E4M3E4M3_SS();
+        if constexpr (N == 32) return SM90_64x32x32_F32E4M3E4M3_SS();
+        if constexpr (N == 40) return SM90_64x40x32_F32E4M3E4M3_SS();
+        if constexpr (N == 48) return SM90_64x48x32_F32E4M3E4M3_SS();
+        if constexpr (N == 56) return SM90_64x56x32_F32E4M3E4M3_SS();
+        if constexpr (N == 64) return SM90_64x64x32_F32E4M3E4M3_SS();
+        if constexpr (N == 72) return SM90_64x72x32_F32E4M3E4M3_SS();
+        if constexpr (N == 80) return SM90_64x80x32_F32E4M3E4M3_SS();
+        if constexpr (N == 88) return SM90_64x88x32_F32E4M3E4M3_SS();
+        if constexpr (N == 96) return SM90_64x96x32_F32E4M3E4M3_SS();
+        if constexpr (N == 104) return SM90_64x104x32_F32E4M3E4M3_SS();
+        if constexpr (N == 112) return SM90_64x112x32_F32E4M3E4M3_SS();
+        if constexpr (N == 120) return SM90_64x120x32_F32E4M3E4M3_SS();
+        if constexpr (N == 128) return SM90_64x128x32_F32E4M3E4M3_SS();
+        if constexpr (N == 192) return SM90_64x192x32_F32E4M3E4M3_SS();
+    }
+
+    using type = decltype(select_type());
+};
+
+} // namespace deep_gemm
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/include/deep_gemm/scheduler.cuh
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/include/deep_gemm/scheduler.cuh
@ -0,0 +1,103 @@
+#include "utils.cuh"
+
+namespace deep_gemm {
+
+enum class GemmType {
+    Normal,
+    GroupedContiguous,
+    GroupedMasked
+};
+
+#pragma clang diagnostic push
+#pragma ide diagnostic ignored "cppcoreguidelines-pro-type-member-init"
+template <GemmType kGemmType,
+          uint32_t SHAPE_N, uint32_t BLOCK_M, uint32_t BLOCK_N,
+          uint32_t kNumGroups, uint32_t kNumTMAMulticast,
+          uint32_t kNumNBlocks = cell_div(SHAPE_N, BLOCK_N),
+          uint32_t kNumNBlocksPerGroup = 16>
+struct Scheduler {
+    int current_iter = -1;
+    uint32_t num_aligned_m_blocks;
+
+    // For normal GEMM
+    // Maybe not used in the masked grouped GEMM
+    uint32_t num_blocks;
+
+    // For grouped GEMM
+    int* grouped_layout;
+    // Only used for masked layout
+    uint32_t curr_group_idx, curr_cumsum;
+
+    __device__ __forceinline__ explicit Scheduler(const uint32_t shape_m,
+                                                  int* grouped_layout = nullptr) {
+        num_aligned_m_blocks = cell_div(shape_m, BLOCK_M);
+        if constexpr (kGemmType == GemmType::Normal) {
+            num_blocks = num_aligned_m_blocks * kNumNBlocks;
+        } else if (kGemmType == GemmType::GroupedContiguous) {
+            num_blocks = num_aligned_m_blocks * kNumNBlocks;
+            this->grouped_layout = grouped_layout;
+        } else if (kGemmType == GemmType::GroupedMasked) {
+            curr_group_idx = curr_cumsum = 0;
+            this->grouped_layout = grouped_layout;
+        }
+    }
+
+    __device__ __forceinline__ void get_swizzled_block_idx(const uint32_t num_m_blocks, int block_idx, uint32_t& m_block_idx, uint32_t& n_block_idx) {
+        DG_STATIC_ASSERT(kNumNBlocksPerGroup % kNumTMAMulticast == 0, "Invalid group size");
+
+        // Swizzle for better L2 usages
+        auto num_blocks_per_group = num_m_blocks * kNumNBlocksPerGroup;
+        auto group_idx = block_idx / num_blocks_per_group;
+        auto first_n_block_idx = group_idx * kNumNBlocksPerGroup;
+        auto num_n_blocks_in_group = min(kNumNBlocksPerGroup, kNumNBlocks - first_n_block_idx);
+        auto in_group_idx = block_idx % num_blocks_per_group;
+        m_block_idx = in_group_idx / num_n_blocks_in_group;
+        n_block_idx = first_n_block_idx + in_group_idx % num_n_blocks_in_group;
+    }
+
+    template <bool kIgnoreGroupedForGroupedContiguous=true>
+    __device__ __forceinline__ uint32_t get_global_idx(const uint32_t shape_dim, const uint32_t block_size,
+                                                       const uint32_t& block_idx, const uint32_t& m_block_idx=0) {
+        if constexpr (kGemmType == GemmType::Normal) {
+            return block_idx * block_size;
+        } else if (kGemmType == GemmType::GroupedContiguous) {
+            auto offset = kIgnoreGroupedForGroupedContiguous ? 0 : __ldg(grouped_layout + m_block_idx * BLOCK_M);
+            return offset * shape_dim + block_idx * block_size;
+        } else if (kGemmType == GemmType::GroupedMasked) {
+            return curr_group_idx * shape_dim + block_idx * block_size;
+        }
+    }
+
+    __device__ __forceinline__ bool get_next_block(uint32_t& m_block_idx, uint32_t& n_block_idx) {
+        const auto next_block_idx = (++ current_iter) * gridDim.x + blockIdx.x;
+
+        if constexpr (kGemmType == GemmType::GroupedMasked) {
+            uint32_t num_m_blocks;
+            while (true) {
+                // End of the task
+                if (curr_group_idx == kNumGroups)
+                    return false;
+
+                // Within current group
+                num_m_blocks = cell_div(static_cast<uint32_t>(__ldg(grouped_layout + curr_group_idx)), BLOCK_M);
+                auto current_m_block_cumsum = curr_cumsum + num_m_blocks;
+                if (next_block_idx < current_m_block_cumsum * kNumNBlocks)
+                    break;
+
+                // Move to check the next group
+                curr_group_idx ++, curr_cumsum = current_m_block_cumsum;
+            }
+
+            get_swizzled_block_idx(num_m_blocks, next_block_idx - curr_cumsum * kNumNBlocks, m_block_idx, n_block_idx);
+        } else {
+            if (next_block_idx >= num_blocks)
+                return false;
+
+            get_swizzled_block_idx(num_aligned_m_blocks, next_block_idx, m_block_idx, n_block_idx);
+        }
+        return true;
+    }
+};
+#pragma clang diagnostic pop
+
+} // namespace deep_gemm
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/include/deep_gemm/tma_utils.cuh
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/include/deep_gemm/tma_utils.cuh
@ -0,0 +1,96 @@
+#pragma once
+
+#include <cassert>
+#include <cuda.h>
+#include <cudaTypedefs.h>
+#include <cuda_fp8.h>
+#include <cuda_runtime.h>
+#include <cuda/barrier>
+
+#include "utils.cuh"
+
+namespace deep_gemm {
+
+template <class T>
+constexpr CUtensorMapDataType get_CUtensorMapDataType() {
+    if constexpr (std::is_same<T, uint8_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT8;
+    } else if constexpr (std::is_same<T, __nv_fp8_e4m3>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT8;
+    } else if constexpr (std::is_same<T, __nv_fp8_e5m2>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT8;
+    } else if constexpr (std::is_same<T, uint16_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT16;
+    } else if constexpr (std::is_same<T, uint32_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT32;
+    } else if constexpr (std::is_same<T, uint64_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_UINT64;
+    } else if constexpr (std::is_same<T, int32_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_INT32;
+    } else if constexpr (std::is_same<T, int64_t>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_INT64;
+    } else if constexpr (std::is_same<T, __half>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_FLOAT16;
+    } else if constexpr (std::is_same<T, float>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_FLOAT32;
+    } else if constexpr (std::is_same<T, __nv_bfloat16>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_BFLOAT16;
+    }  else if constexpr (std::is_same<T, double>::value) {
+        return CU_TENSOR_MAP_DATA_TYPE_FLOAT64;
+    }
+}
+
+PFN_cuTensorMapEncodeTiled get_cuTensorMapEncodeTiled() {
+    // Get pointer to `cuTensorMapEncodeTiled`
+    cudaDriverEntryPointQueryResult driver_status;
+    void* cuTensorMapEncodeTiled_ptr = nullptr;
+
+#if CUDA_VERSION >= 12050
+    cudaGetDriverEntryPointByVersion("cuTensorMapEncodeTiled", &cuTensorMapEncodeTiled_ptr, 12000,
+                                     cudaEnableDefault, &driver_status);
+#else
+    cudaGetDriverEntryPoint("cuTensorMapEncodeTiled", &cuTensorMapEncodeTiled_ptr,
+                            cudaEnableDefault, &driver_status);
+#endif
+
+    if (driver_status != cudaDriverEntryPointSuccess)
+        throw std::runtime_error("driver_status != cudaDriverEntryPointSuccess");
+    return reinterpret_cast<PFN_cuTensorMapEncodeTiled>(cuTensorMapEncodeTiled_ptr);
+}
+
+template <typename T>
+CUtensorMap make_2d_tma_copy_desc(T* global_address, uint64_t gmem_dim[2],
+                                  uint64_t stride_in_bytes, uint32_t smem_dim[2],
+                                  CUtensorMapSwizzle swizzle_type,
+                                  PFN_cuTensorMapEncodeTiled encode_func = nullptr) {
+    CUtensorMap tensor_map{};
+    constexpr uint32_t rank = 2;
+    uint64_t global_stride[rank - 1] = {stride_in_bytes};
+    uint32_t elem_strides[rank] = {1, 1};
+
+    if (encode_func == nullptr)
+        encode_func = get_cuTensorMapEncodeTiled();
+
+    auto result = encode_func(
+            &tensor_map, get_CUtensorMapDataType<typename std::remove_cv<T>::type>(), rank,
+            global_address, gmem_dim, global_stride, smem_dim, elem_strides,
+            CUtensorMapInterleave::CU_TENSOR_MAP_INTERLEAVE_NONE, swizzle_type,
+            CUtensorMapL2promotion::CU_TENSOR_MAP_L2_PROMOTION_L2_256B,
+            CUtensorMapFloatOOBfill::CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE);
+    DG_HOST_ASSERT(result == CUDA_SUCCESS);
+    return tensor_map;
+}
+
+template <uint32_t kNumTMAMulticast = 1>
+__device__ __forceinline__ void
+tma_copy(void const* desc_ptr, uint64_t* barrier_ptr, void* smem_ptr,
+         int32_t const& crd_0, int32_t const& crd_1) {
+    constexpr auto cache_hint = static_cast<uint64_t>(cute::TMA::CacheHintSm90::EVICT_NORMAL);
+    if constexpr (kNumTMAMulticast == 1) {
+        cute::SM90_TMA_LOAD_2D::copy(desc_ptr, barrier_ptr, cache_hint, smem_ptr, crd_0, crd_1);
+    } else if (cute::block_rank_in_cluster() == 0) {
+        cute::SM90_TMA_LOAD_MULTICAST_2D::copy(desc_ptr, barrier_ptr, (1 << kNumTMAMulticast) - 1, cache_hint, smem_ptr, crd_0, crd_1);
+    }
+}
+
+}  // namespace deep_gemm
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/include/deep_gemm/utils.cuh
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/include/deep_gemm/utils.cuh
@ -0,0 +1,48 @@
+#pragma once
+
+#include <exception>
+
+#ifdef __CLION_IDE__
+__host__ __device__ __forceinline__ void host_device_printf(const char* format, ...) { asm volatile("trap;"); }
+#define printf host_device_printf
+#endif
+
+class AssertionException : public std::exception {
+private:
+    std::string message{};
+
+public:
+    explicit AssertionException(const std::string& message) : message(message) {}
+
+    const char *what() const noexcept override { return message.c_str(); }
+};
+
+#ifndef DG_HOST_ASSERT
+#define DG_HOST_ASSERT(cond)                                        \
+do {                                                                \
+    if (not (cond)) {                                               \
+        printf("Assertion failed: %s:%d, condition: %s\n",          \
+               __FILE__, __LINE__, #cond);                          \
+        throw AssertionException("Assertion failed: " #cond);       \
+    }                                                               \
+} while (0)
+#endif
+
+#ifndef DG_DEVICE_ASSERT
+#define DG_DEVICE_ASSERT(cond)                                                          \
+do {                                                                                    \
+    if (not (cond)) {                                                                   \
+        printf("Assertion failed: %s:%d, condition: %s\n", __FILE__, __LINE__, #cond);  \
+        asm("trap;");                                                                   \
+    }                                                                                   \
+} while (0)
+#endif
+
+#ifndef DG_STATIC_ASSERT
+#define DG_STATIC_ASSERT(cond, reason) static_assert(cond, reason)
+#endif
+
+template <typename T>
+__device__ __host__ constexpr T cell_div(T a, T b) {
+    return (a + b - 1) / b;
+}
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/reference/host/gemm_with_groupwise_scaling.h
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/reference/host/gemm_with_groupwise_scaling.h
@ -0,0 +1,511 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Reference implementation for GETT in host-side code.
+*/
+
+#pragma once
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+#include "cutlass/gemm/gemm.h"
+#include "cutlass/complex.h"
+#include "cutlass/numeric_conversion.h"
+#include "cutlass/epilogue/thread/activation.h"
+#include "cutlass/relatively_equal.h"
+#include <iostream>
+#include "cute/tensor.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::reference::host {
+
+template<class T, class = void>
+struct ElementTraits {
+  using type = T;
+};
+
+template<class T>
+struct ElementTraits<T, std::enable_if_t<!std::is_same_v<decltype(std::declval<T>().get()), void> > >  {
+  using type = decltype(std::declval<T>().get());
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+  class ElementAccumulator_,
+  class TensorA_,                                                                                         // (M, K, L)
+  class TensorB_,                                                                                         // (N, K, L)
+  class TensorScaleA_,                                                                                    // (m, k, L)
+  class TensorScaleB_,                                                                                    // (n, k, L)
+  class TileShape_
+>
+struct GettMainloopParams {
+  using ElementAccumulator = ElementAccumulator_;
+  using TensorA = TensorA_;
+  using TensorB = TensorB_;
+  using EngineA = typename TensorA::engine_type;
+  using LayoutA = typename TensorA::layout_type;
+  using EngineB = typename TensorB::engine_type;
+  using LayoutB = typename TensorB::layout_type;
+
+  using TensorScaleA = TensorScaleA_;
+  using TensorScaleB = TensorScaleB_;
+  using TileShape = TileShape_;
+  using EngineScaleA = typename TensorScaleA::engine_type;
+  using EngineScaleB = typename TensorScaleB::engine_type;
+
+  TensorA A{};
+  TensorB B{};
+  TensorScaleA ScaleA{};
+  TensorScaleB ScaleB{};  
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+template<
+  class ElementScalar_,
+  class ElementScalingFactor_,
+  class ElementAccumulator_,
+  class ElementCompute_,
+  class TensorC_,                                                                                          // (M, N, L)
+  class TensorD_,                                                                                          // (M, N, L)
+  class VectorBias_ = TensorD_,                                                                            //    (M, 1)
+  class TensorAux_ = TensorD_,                                                                             // (M, N, L)
+  class VectorAlpha_ = TensorD_,                                                                           //    (M, 1)
+  class VectorBeta_ = VectorAlpha_,                                                                        //    (M, 1)
+  class ActivationFunctor_ = cutlass::epilogue::thread::Identity<ElementCompute_>,
+  class BiasBinaryOp_ = cutlass::plus<ElementCompute_>,
+  bool PerColumnBias_ = false
+>
+struct GettEpilogueParams {
+  using ElementScalar = ElementScalar_;
+  using ElementScalingFactor = ElementScalingFactor_;
+  using ElementAccumulator = ElementAccumulator_;
+  using ElementCompute = ElementCompute_;
+  using TensorC = TensorC_;
+  using TensorD = TensorD_;
+  using TensorAux = TensorAux_;
+  using VectorBias = VectorBias_;
+  using VectorAlpha = VectorAlpha_;
+  using VectorBeta = VectorBeta_;
+  using ActivationFunctor = ActivationFunctor_;
+  using BiasBinaryOp = BiasBinaryOp_;
+
+  using EngineC = typename TensorC::engine_type;
+  using LayoutC = typename TensorC::layout_type;
+  using EngineD =  typename TensorD::engine_type;
+  using LayoutD = typename TensorD::layout_type;
+  static constexpr bool PerColumnBias = PerColumnBias_;
+  ElementScalar alpha = ElementScalar(1);
+  ElementScalar beta = ElementScalar(0);
+
+  TensorC C{};
+  TensorD D{};
+  VectorBias Bias{};
+  TensorAux Aux{};
+  VectorAlpha Valpha{};
+  VectorBeta Vbeta{};
+  ElementCompute st = ElementCompute(1);
+
+  ElementAccumulator* abs_max_D = nullptr;
+  ElementAccumulator* abs_max_Aux = nullptr;
+
+  ElementScalingFactor scale_a = ElementScalingFactor(1);
+  ElementScalingFactor scale_b = ElementScalingFactor(1);
+  ElementScalingFactor scale_c = ElementScalingFactor(1);
+  ElementScalingFactor scale_d = ElementScalingFactor(1);
+  ElementScalingFactor scale_aux = ElementScalingFactor(1);
+
+  bool beta_per_channel_scaling = false;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// GETT - General Tensor-Tensor contraction reference kernel with Groupwise scaling
+template <
+  class MainloopParams,
+  class EpilogueParams
+>
+void Gett(
+    MainloopParams const& mainloop_params,
+    EpilogueParams const& epilogue_params)
+{
+
+  static int constexpr kBlockM = cute::get<0>(typename MainloopParams::TileShape{});
+  static int constexpr kBlockN = cute::get<1>(typename MainloopParams::TileShape{});
+  // printf("mainloop_params.ScaleA.layout()"); cute::print(mainloop_params.ScaleA.layout()); printf("\n");
+  // printf("mainloop_params.ScaleB.layout()"); cute::print(mainloop_params.ScaleB.layout()); printf("\n");
+
+#if defined(_OPENMP)
+  #pragma omp parallel for collapse(3)
+#endif
+  for (int64_t l = 0; l < cute::size<2>(mainloop_params.A.layout()); ++l) {
+    for (int64_t m = 0; m < cute::size<0>(mainloop_params.A.layout()); m += kBlockM) {
+      for (int64_t n = 0; n < cute::size<0>(mainloop_params.B.layout()); n += kBlockN) {
+        typename MainloopParams::ElementAccumulator acc[kBlockM][kBlockN];
+        gett_mainloop(mainloop_params, m, n, l, acc);
+        gett_epilogue(epilogue_params, m, n, l, acc);
+      }
+    }
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// GETT - Mainloop
+template <class MainloopParams, class ElementAccumulator, int kBlockM, int kBlockN>
+void gett_mainloop(
+    MainloopParams const& mainloop_params,
+    int64_t m,
+    int64_t n,
+    int64_t l,
+    ElementAccumulator (&acc)[kBlockM][kBlockN])
+{
+
+  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == 3, "M, K, B");
+  static_assert(cute::rank(typename MainloopParams::LayoutB{}) == 3, "N, K, B");
+
+  using cute::raw_pointer_cast;
+
+  using ElementA = typename ElementTraits<typename MainloopParams::EngineA::value_type>::type;
+  using ElementB = typename ElementTraits<typename MainloopParams::EngineB::value_type>::type;
+  using ElementBlockScaleA = typename ElementTraits<typename MainloopParams::EngineScaleA::value_type>::type;
+  using ElementBlockScaleB = typename ElementTraits<typename MainloopParams::EngineScaleB::value_type>::type;
+
+  using RingOp = multiply_add<ElementAccumulator, ElementAccumulator, ElementAccumulator>;
+  RingOp fma_op;
+
+  multiplies<ElementAccumulator> scale_op;
+
+  static int constexpr kBlockK = cute::get<2>(typename MainloopParams::TileShape{});;
+
+  // Tempo accumulators to seperate blockwise accumulation
+  typename MainloopParams::ElementAccumulator acc_temp[kBlockM][kBlockN];
+
+  // Zero out accumulators
+  for (int m_b = 0; m_b < kBlockM; ++m_b) {
+    for (int n_b = 0; n_b < kBlockN; ++n_b) {
+      acc[m_b][n_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
+      acc_temp[m_b][n_b] = ElementAccumulator(0);
+    }
+  }
+
+  int64_t block_m = m / kBlockM;
+  int64_t block_n = n / kBlockN;
+  cute::Tensor blockscale_A = mainloop_params.ScaleA(block_m, _, _, l);
+  cute::Tensor blockscale_B = mainloop_params.ScaleB(block_n, _, _, l);
+
+  const int ScaleGranularityM = cute::size<0>(typename MainloopParams::TileShape{}) / cute::size<1>(mainloop_params.ScaleA.shape());
+  const int ScaleGranularityN = cute::size<1>(typename MainloopParams::TileShape{}) / cute::size<1>(mainloop_params.ScaleB.shape());
+  assert(cute::size<0>(typename MainloopParams::TileShape{}) == ScaleGranularityM * cute::size<1>(mainloop_params.ScaleA.shape()));
+  assert(cute::size<1>(typename MainloopParams::TileShape{}) == ScaleGranularityN * cute::size<1>(mainloop_params.ScaleB.shape()));
+
+  // Compute on this k-block
+  for (int64_t k = 0; k < cute::size<1>(mainloop_params.A.layout()); ++k) {
+
+    // Load Blockwise scaling factor from blockscale Tensors for B
+    int64_t block_k = k / kBlockK;
+    cute::Tensor scale_a = blockscale_A(_, block_k);
+    cute::Tensor scale_b = blockscale_B(_, block_k);
+
+    // Load A
+    ElementAccumulator a_frag[kBlockM];
+    for (int m_b = 0; m_b < kBlockM; ++m_b) {
+      if (m + m_b < cute::size<0>(mainloop_params.A.layout())) {
+        // Perform reference GEMM calculations at the accumulator's precision. Cast A value to accumulator type.
+        a_frag[m_b] = static_cast<ElementAccumulator>(ElementA(mainloop_params.A(m + m_b, k, l)));
+      } else {
+        a_frag[m_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
+      }
+    }
+
+    // Load B
+    ElementAccumulator b_frag[kBlockN];
+    for (int n_b = 0; n_b < kBlockN; ++n_b) {
+      if (n + n_b < cute::size<0>(mainloop_params.B.layout())) {
+        // Perform reference GEMM calculations at the accumulator's precision. Cast A value to accumulator type.
+        b_frag[n_b] = static_cast<ElementAccumulator>(ElementB(mainloop_params.B(n + n_b, k, l)));
+      } else {
+        b_frag[n_b] = ElementAccumulator(0); // RingOp::AdditionIdentity
+      }
+    }
+
+    // do compute
+    for (int m_b = 0; m_b < kBlockM; ++m_b) {
+      for (int n_b = 0; n_b < kBlockN; ++n_b) {
+        acc_temp[m_b][n_b] = fma_op(a_frag[m_b], b_frag[n_b], acc_temp[m_b][n_b]);
+      }
+    }
+
+    // Apply Groupwise-scaling at kBlockK boundary
+    // (a) Apply group and block scaling factors on the partial accumulated results (acc_temp) at the kBlocK boundary 
+    // (b) Zero-out partial temporary (acc_temp),
+    // (c) Update permanent (accu)
+    if ((k+1) % kBlockK == 0) {
+      for (int m_b = 0; m_b < kBlockM; ++m_b) {
+        auto scale_a_m_b = scale_a[m_b / ScaleGranularityM];
+        for (int n_b = 0; n_b < kBlockN; ++n_b) {
+          auto scale_b_n_b = scale_b[n_b / ScaleGranularityN];
+          ElementAccumulator blockwise_scaled_accum = acc_temp[m_b][n_b] * scale_a_m_b * scale_b_n_b;
+          acc[m_b][n_b] = blockwise_scaled_accum + acc[m_b][n_b];
+          acc_temp[m_b][n_b] = ElementAccumulator(0); 
+        }
+      }
+    }
+
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// GETT - Epilogue
+template <class EpilogueParams, class ElementAccumulator, int kBlockM, int kBlockN>
+void gett_epilogue(
+    EpilogueParams const& epilogue_params,
+    int64_t m,
+    int64_t n,
+    int64_t l,
+    ElementAccumulator (&acc)[kBlockM][kBlockN])
+{
+  static_assert(cute::rank(typename EpilogueParams::LayoutC{}) == 3, "M, K, B");
+  static_assert(cute::rank(typename EpilogueParams::LayoutD{}) == 3, "N, K, B");
+
+  using cute::raw_pointer_cast;
+
+  using ElementCompute = typename EpilogueParams::ElementCompute;
+  using ElementC = typename EpilogueParams::TensorC::value_type;
+  using ElementD = typename EpilogueParams::TensorD::value_type;
+  using ElementAux = typename EpilogueParams::TensorAux::value_type;
+  using ElementBias = typename EpilogueParams::VectorBias::value_type;
+  using ElementScalar = typename EpilogueParams::ElementScalar;
+  using ElementScalingFactor = typename EpilogueParams::ElementScalingFactor;
+  using ActivationFunctor = typename EpilogueParams::ActivationFunctor;
+  using BiasBinaryOp = typename EpilogueParams::BiasBinaryOp;
+
+  constexpr bool PerColBias = EpilogueParams::PerColumnBias;
+  constexpr bool IsScalingAndAmaxOutputNeeded = 
+      cute::is_same_v<ElementD, cutlass::float_e4m3_t> or
+      cute::is_same_v<ElementD, cutlass::float_e5m2_t>;
+
+  constexpr bool IsScalingAndAmaxAuxOutputNeeded =
+      cute::is_same_v<ElementAux, cutlass::float_e4m3_t> or
+      cute::is_same_v<ElementAux, cutlass::float_e5m2_t>;
+
+  constexpr bool IsReLUAuxNeeded =
+      (cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::ReLu<ElementCompute>> or
+       cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>) and 
+      cute::is_same_v<ElementAux, cutlass::uint1b_t>;
+  constexpr bool IsClamp =
+      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::Clamp<ElementCompute>>;
+
+  constexpr bool IsBackpropFusion =
+      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::dGELU<ElementCompute>> or
+      cute::is_same_v<ActivationFunctor, cutlass::epilogue::thread::dReLU<ElementCompute>>;
+
+  // Input related converter
+  NumericConverter<ElementCompute, ElementAccumulator> accumulator_converter;
+  NumericConverter<ElementCompute, ElementC> source_converter;
+  NumericConverter<ElementCompute, ElementBias> bias_converter;
+  [[maybe_unused]] NumericConverter<ElementCompute, ElementAux> aux_source_converter;
+
+  // Scale related converter
+  NumericConverter<ElementCompute, ElementScalar> scale_converter;
+  NumericConverter<ElementCompute, ElementScalingFactor> scaling_factor_converter;
+
+  // Abs max converter
+  [[maybe_unused]] NumericConverter<ElementAccumulator, ElementCompute> abs_max_output_converter;
+
+  // Output related converter
+  NumericConverter<ElementD, ElementCompute> destination_converter;
+  [[maybe_unused]] NumericConverter<ElementAux, ElementCompute> aux_destination_converter;
+  NumericConverter<ElementBias, ElementCompute> dBias_converter;
+
+  // Epilogue operations
+  multiply_add<ElementCompute, ElementCompute, ElementCompute> epilogue_fma;
+  multiplies<ElementCompute> mul;
+  plus<ElementCompute> add;
+
+  // Activation operation
+  ActivationFunctor activation;
+
+  // Bias binary operation
+  BiasBinaryOp bias_op;
+
+  // Do conversion
+  ElementCompute converted_alpha = scale_converter(epilogue_params.alpha);
+  ElementCompute converted_beta = scale_converter(epilogue_params.beta);
+  ElementCompute converted_scale_a = scaling_factor_converter(epilogue_params.scale_a);
+  ElementCompute converted_scale_b = scaling_factor_converter(epilogue_params.scale_b);
+  ElementCompute converted_scale_c = scaling_factor_converter(epilogue_params.scale_c);
+  ElementCompute converted_scale_d = scaling_factor_converter(epilogue_params.scale_d);
+  ElementCompute converted_scale_aux = scaling_factor_converter(epilogue_params.scale_aux);
+
+  // Init local var
+  [[maybe_unused]] ElementCompute local_abs_max_output = ElementCompute(0);
+  [[maybe_unused]] ElementCompute local_abs_max_aux_output = ElementCompute(0);
+
+  converted_alpha = mul(converted_alpha, mul(converted_scale_a, converted_scale_b));
+  converted_beta = mul(converted_beta, converted_scale_c);
+
+  ElementCompute inter_accum[kBlockM][kBlockN];
+
+  for (int m_b = 0; m_b < kBlockM; ++m_b) {
+    ElementCompute local_dBias = ElementCompute(0);
+
+    for (int n_b = 0; n_b < kBlockN; ++n_b) {
+      if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n + n_b < cute::size<1>(epilogue_params.D.layout())) {
+        // Convert every type to ElementCompute first, do compute, convert to output type, write it out
+        ElementCompute converted_acc = accumulator_converter(acc[m_b][n_b]);
+        // per-row alpha
+        if (raw_pointer_cast(epilogue_params.Valpha.data())) {
+          converted_alpha = scale_converter(epilogue_params.Valpha(m + m_b));
+        }
+        ElementCompute output = mul(converted_alpha, converted_acc);
+
+        if (raw_pointer_cast(epilogue_params.Bias.data()) && not IsBackpropFusion) {
+          ElementCompute converted_bias = bias_converter(epilogue_params.Bias(PerColBias ? n + n_b : m + m_b));
+          output = bias_op(output, converted_bias);
+        }
+
+        if (raw_pointer_cast(epilogue_params.C.data())) {
+          ElementCompute converted_src = source_converter(epilogue_params.C(m + m_b, n + n_b, l));
+          // per-row beta
+          if (epilogue_params.Vbeta.data()) {
+            converted_beta = scale_converter(epilogue_params.Vbeta(m + m_b));
+          }
+          output = epilogue_fma(converted_beta, converted_src, output);
+        }
+
+        if constexpr (IsBackpropFusion) {
+          ElementAux aux_input = ElementAux(0);
+          if (raw_pointer_cast(epilogue_params.Aux.data())) {
+            aux_input = epilogue_params.Aux(m + m_b, n + n_b, l);
+          }
+
+          output = activation(output, aux_source_converter(aux_input));
+          local_dBias = add(local_dBias, output);
+        }
+        else {
+          if (raw_pointer_cast(epilogue_params.Aux.data())) {
+            auto aux_output = output;
+            if constexpr (IsScalingAndAmaxAuxOutputNeeded) {
+              maximum_absolute_value_reduction<ElementCompute, true> amax_op;
+              local_abs_max_aux_output = amax_op(local_abs_max_aux_output, aux_output);
+              aux_output = epilogue_fma(converted_scale_aux, aux_output, ElementCompute(0));
+            }
+
+            if constexpr (IsReLUAuxNeeded) {
+              epilogue_params.Aux(m + m_b, n + n_b, l) = not (aux_output < 0) ? uint1b_t(1) : uint1b_t(0);
+            } else {
+              epilogue_params.Aux(m + m_b, n + n_b, l) = aux_destination_converter(aux_output);
+            }
+          }
+
+          if constexpr (IsClamp) { // Treat Clamp as ReLU
+            output = activation(output, {0, std::numeric_limits<ElementCompute>::max()});
+          }
+          else {
+            output = activation(output);
+          }
+        }
+
+        if constexpr (IsScalingAndAmaxOutputNeeded) {
+          maximum_absolute_value_reduction<ElementCompute, true> amax_op;
+          local_abs_max_output = amax_op(local_abs_max_output, output);
+          output = epilogue_fma(converted_scale_d, output, ElementCompute(0));
+        }
+
+        inter_accum[m_b][n_b] = ElementCompute(output);
+      }
+    } // n_b
+
+    if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n < cute::size<1>(epilogue_params.D.layout())) {
+      if (raw_pointer_cast(epilogue_params.Bias.data()) && IsBackpropFusion) {
+        ElementCompute converted_dBias = bias_converter(epilogue_params.Bias(m + m_b));
+        local_dBias = add(local_dBias, converted_dBias);
+        epilogue_params.Bias(m + m_b) = dBias_converter(local_dBias);
+      }
+    }
+  } // m_b
+  for (int m_b = 0; m_b < kBlockM; ++m_b) {
+    for (int n_b = 0; n_b < kBlockN; ++n_b) {
+      if (m + m_b < cute::size<0>(epilogue_params.D.layout()) && n + n_b < cute::size<1>(epilogue_params.D.layout())) {
+        epilogue_params.D(m + m_b, n + n_b, l) = destination_converter(inter_accum[m_b][n_b]);
+      }
+    }
+  }
+
+#if defined(_OPENMP)
+  #pragma omp critical(Abs_Max_Data_Update)
+#endif
+  {
+    if constexpr (IsScalingAndAmaxOutputNeeded) {
+      if (epilogue_params.abs_max_D) {
+        *epilogue_params.abs_max_D = maximum_with_nan_propogation<ElementAccumulator>{}(
+          *epilogue_params.abs_max_D, abs_max_output_converter(local_abs_max_output));
+      }
+    }
+
+    if constexpr (IsScalingAndAmaxAuxOutputNeeded) {
+      if (epilogue_params.abs_max_Aux) {
+        *epilogue_params.abs_max_Aux = maximum_with_nan_propogation<ElementAccumulator>{}(
+            *epilogue_params.abs_max_Aux, abs_max_output_converter(local_abs_max_aux_output));
+      }
+    }
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// GEMM - General Matrix-Matrix contraction without conjugation options
+template <
+  class MainloopParams,
+  class EpilogueParams
+>
+void Gemm3x(
+    MainloopParams const& mainloop_params,
+    EpilogueParams const& epilogue_params)
+{
+  using namespace cute;
+
+  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == cute::rank(typename MainloopParams::LayoutB{}));
+  static_assert(cute::rank(typename EpilogueParams::LayoutC{}) == cute::rank(typename EpilogueParams::LayoutD{}));
+  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == cute::rank(typename EpilogueParams::LayoutC{}));
+  static_assert(cute::rank(typename MainloopParams::LayoutA{}) == 3, "Only Rank3 Tensors (M, K, Batch_Count) "
+                                                                     "with Batchmode are supported");
+  // Lower the Matrix-Multiplication with Groupwise scaling (Gemm3x) to a Tensor Contraction (Gett).
+  Gett(mainloop_params, epilogue_params);
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // cutlass::reference::host
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/setup.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/setup.py
@ -0,0 +1,69 @@
+import os
+import setuptools
+import shutil
+import subprocess
+from setuptools.command.develop import develop
+from setuptools.command.install import install
+
+current_dir = os.path.dirname(os.path.realpath(__file__))
+jit_include_dirs = ('deep_gemm/include/deep_gemm', )
+cutlass_dirs = '../../include'
+third_party_include_dirs = (os.path.join(cutlass_dirs, 'cute'), os.path.join(cutlass_dirs, 'cutlass'))
+print(third_party_include_dirs)
+
+
+class PostDevelopCommand(develop):
+    def run(self):
+        develop.run(self)
+        self.make_jit_include_symlinks()
+
+    @staticmethod
+    def make_jit_include_symlinks():
+        # Make symbolic links of third-party include directories
+        for d in third_party_include_dirs:
+            dirname = d.split('/')[-1]
+            src_dir = f'{current_dir}/{d}'
+            dst_dir = f'{current_dir}/deep_gemm/include/{dirname}'
+            if not os.path.exists(src_dir):
+                os.makedirs(src_dir, exist_ok=True)
+            assert os.path.exists(src_dir)
+            if os.path.exists(dst_dir):
+                assert os.path.islink(dst_dir)
+                os.unlink(dst_dir)
+            os.symlink(src_dir, dst_dir, target_is_directory=True)
+
+
+class PostInstallCommand(install):
+    def run(self):
+        install.run(self)
+        self.copy_jit_includes()
+
+    def copy_jit_includes(self):
+        # Copy include directories needed by JIT
+        shutil.rmtree(f'{self.build_lib}/deep_gemm/include', ignore_errors=True)
+        os.makedirs(f'{self.build_lib}/deep_gemm/include', exist_ok=False)
+        for d in jit_include_dirs + third_party_include_dirs:
+            src_dir = f'{current_dir}/{d}'
+            dst_dir = f'{self.build_lib}/deep_gemm/include/{d.split("/")[-1]}'
+            assert os.path.exists(src_dir)
+            shutil.copytree(src_dir, dst_dir)
+
+
+if __name__ == '__main__':
+    # noinspection PyBroadException
+    try:
+        cmd = ['git', 'rev-parse', '--short', 'HEAD']
+        revision = '+' + subprocess.check_output(cmd).decode('ascii').rstrip()
+    except:
+        revision = ''
+
+    # noinspection PyTypeChecker
+    setuptools.setup(
+        name='deep_gemm',
+        version='1.0.0' + revision,
+        packages=['deep_gemm', 'deep_gemm/jit', 'deep_gemm/jit_kernels'],
+        cmdclass={
+            'develop': PostDevelopCommand,
+            'install': PostInstallCommand
+        }
+    )
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/tests/test_core.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/tests/test_core.py
@ -0,0 +1,158 @@
+import random
+import torch
+from typing import Tuple
+
+import deep_gemm
+from deep_gemm import bench_kineto, calc_diff, cell_div, get_col_major_tma_aligned_tensor
+
+
+def per_token_cast_to_fp8(x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+    assert x.dim() == 2 and x.size(1) % 128 == 0
+    m, n = x.shape
+    x_view = x.view(m, -1, 128)
+    x_amax = x_view.abs().float().amax(dim=2).view(m, -1).clamp(1e-4)
+    return (x_view * (448.0 / x_amax.unsqueeze(2))).to(torch.float8_e4m3fn).view(m, n), (x_amax / 448.0).view(m, -1)
+
+
+def per_block_cast_to_fp8(x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+    assert x.dim() == 2
+    m, n = x.shape
+    x_padded = torch.zeros((cell_div(m, 128) * 128, cell_div(n, 128) * 128), dtype=x.dtype, device=x.device)
+    x_padded[:m, :n] = x
+    x_view = x_padded.view(-1, 128, x_padded.size(1) // 128, 128)
+    x_amax = x_view.abs().float().amax(dim=(1, 3), keepdim=True).clamp(1e-4)
+    x_scaled = (x_view * (448.0 / x_amax)).to(torch.float8_e4m3fn)
+    return x_scaled.view_as(x_padded)[:m, :n].contiguous(), (x_amax / 448.0).view(x_view.size(0), x_view.size(2))
+
+
+def construct(m: int, k: int, n: int) -> \
+        Tuple[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor], torch.Tensor, torch.Tensor]:
+    x = torch.randn((m, k), device='cuda', dtype=torch.bfloat16)
+    y = torch.randn((n, k), device='cuda', dtype=torch.bfloat16)
+    out = torch.empty((m, n), device='cuda', dtype=torch.bfloat16)
+    ref_out = x @ y.t()
+
+    x_fp8, y_fp8 = per_token_cast_to_fp8(x), per_block_cast_to_fp8(y)
+    # Transpose earlier so that the testing will not trigger transposing kernels
+    x_fp8 = (x_fp8[0], get_col_major_tma_aligned_tensor(x_fp8[1]))
+    return x_fp8, y_fp8, out, ref_out
+
+
+def construct_grouped(num_groups: int, m: int, k: int, n: int, is_masked: bool) -> \
+        Tuple[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor], torch.Tensor, torch.Tensor]:
+    x = torch.randn((num_groups, m, k), device='cuda', dtype=torch.bfloat16)
+    y = torch.randn((num_groups, n, k), device='cuda', dtype=torch.bfloat16)
+    out = torch.empty((num_groups, m, n), device='cuda', dtype=torch.bfloat16)
+    ref_out = torch.einsum('gmk,gnk->gmn', x, y)
+
+    assert m % 4 == 0, f'TMA alignment error: {m}'
+    x_fp8 = (torch.empty_like(x, dtype=torch.float8_e4m3fn), torch.empty((num_groups, m, k // 128), device='cuda', dtype=torch.float))
+    y_fp8 = (torch.empty_like(y, dtype=torch.float8_e4m3fn), torch.empty((num_groups, (n + 127) // 128, k // 128), device='cuda', dtype=torch.float))
+    for i in range(num_groups):
+        x_fp8[0][i], x_fp8[1][i] = per_token_cast_to_fp8(x[i])
+        y_fp8[0][i], y_fp8[1][i] = per_block_cast_to_fp8(y[i])
+
+    # For non-masked input, we must merge the group and M dims
+    if not is_masked:
+        x_fp8 = (x_fp8[0].view(-1, k), per_token_cast_to_fp8(x.view(-1, k))[1])
+        out, ref_out = out.view(-1, n), ref_out.view(-1, n)
+
+    # Transpose earlier so that the testing will not trigger transposing kernels
+    x_fp8 = (x_fp8[0], get_col_major_tma_aligned_tensor(x_fp8[1]))
+    return x_fp8, y_fp8, out, ref_out
+
+
+def test_gemm() -> None:
+    print('Testing GEMM:')
+    for m in (64, 128, 4096):
+        for k, n in [(7168, 2112), (1536, 24576), (512, 32768), (16384, 7168), (7168, 4096), (2048, 7168)]:
+            x_fp8, y_fp8, out, ref_out = construct(m, k, n)
+            deep_gemm.gemm_fp8_fp8_bf16_nt(x_fp8, y_fp8, out)
+            diff = calc_diff(out, ref_out)
+            assert diff < 0.001, f'{m=}, {k=}, {n=}, {diff:.5f}'
+
+            # noinspection PyShadowingNames
+            def test_func():
+                # Construct new tensors every time to avoid L2 cache acceleration
+                x_fp8, y_fp8, out, ref_out = construct(m, k, n)
+                deep_gemm.gemm_fp8_fp8_bf16_nt(x_fp8, y_fp8, out)
+
+            t = bench_kineto(test_func, 'fp8_gemm', suppress_kineto_output=True)
+            print(f' > Performance (m={m:5}, n={n:5}, k={k:5}): {t * 1e6:4.0f} us | '
+                  f'throughput: {2 * m * n * k / t / 1e12:4.0f} TFLOPS, '
+                  f'{(m * k + k * n + m * n * 2) / 1e9 / t:4.0f} GB/s')
+    print()
+
+
+def test_m_grouped_gemm_contiguous() -> None:
+    print('Testing grouped contiguous GEMM:')
+
+    for num_groups, m, k, n in ((4, 8192, 7168, 4096), (4, 8192, 2048, 7168), (8, 4096, 7168, 4096), (8, 4096, 2048, 7168)):
+        # TODO: make a stronger test
+        x_fp8, y_fp8, out, ref_out = construct_grouped(num_groups, m, k, n, is_masked=False)
+        m_indices = torch.arange(0, num_groups, device='cuda', dtype=torch.int)
+        m_indices = m_indices.unsqueeze(-1).expand(num_groups, m).contiguous().view(-1)
+        deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous(x_fp8, y_fp8, out, m_indices)
+        diff = calc_diff(out, ref_out)
+        assert diff < 0.001, f'm={m * num_groups}, {k=}, {n=}, {diff:.5f}'
+
+        # noinspection PyShadowingNames
+        def test_func():
+            # Construct new tensors every time to avoid L2 cache acceleration
+            x_fp8, y_fp8, out, ref_out = construct_grouped(num_groups, m, k, n, is_masked=False)
+            m_indices = torch.arange(0, num_groups, device='cuda', dtype=torch.int)
+            m_indices = m_indices.unsqueeze(-1).expand(num_groups, m).contiguous().view(-1)
+            deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous(x_fp8, y_fp8, out, m_indices)
+
+        t = bench_kineto(test_func, 'fp8_gemm', suppress_kineto_output=True)
+        print(f' > Performance ({num_groups=}, m_per_group={m:4}, n={n:4}, k={k:4}): {t * 1e6:4.0f} us | '
+              f'throughput: {2 * num_groups * m * n * k / t / 1e12:4.0f} TFLOPS, '
+              f'{(num_groups * (m * k + k * n + m * n * 2)) / 1e9 / t:4.0f} GB/s')
+    print()
+
+
+def test_m_grouped_gemm_masked() -> None:
+    print('Testing grouped masked GEMM:')
+
+    for num_groups, m in ((1, 1024), (2, 512), (4, 256)):
+        for k, n in ((7168, 4096), (2048, 7168), ):
+            # Test correctness
+            masked_m_candidates = list(filter(lambda candidate: candidate <= m, (64, 128, 192, 256, 320, 384)))
+            for i in range(10):
+                x_fp8, y_fp8, out, ref_out = construct_grouped(num_groups, m, k, n, is_masked=True)
+                masked_m = torch.empty((num_groups, ), device='cuda', dtype=torch.int)
+                for j in range(num_groups):
+                    masked_m[j] = random.choice(masked_m_candidates)
+                expected_m = int(masked_m.float().mean()) + 1
+                deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_masked(x_fp8, y_fp8, out, masked_m, expected_m)
+                for j in range(num_groups):
+                    diff = calc_diff(out[j, :masked_m[j].item()], ref_out[j, :masked_m[j].item()])
+                    assert diff < 0.001, f'{m=}, {k=}, {n=}, {j=}, masked_m={masked_m[j]}, {num_groups=}, {diff:.5f}'
+
+            # noinspection PyShadowingNames
+            def test_func():
+                # Construct new tensors every time to avoid L2 cache acceleration
+                x_fp8, y_fp8, out, ref_out = construct_grouped(num_groups, m, k, n, is_masked=True)
+                masked_m = torch.ones((num_groups, ), device='cuda', dtype=torch.int) * m
+                deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_masked(x_fp8, y_fp8, out, masked_m, m)
+
+            # Test performance with fixed shapes
+            t = bench_kineto(test_func, 'fp8_gemm', suppress_kineto_output=True)
+            print(f' > Performance ({num_groups=}, m_per_group={m:4}, n={n:4}, k={k:4}): {t * 1e6:4.0f} us | '
+                  f'throughput: {2 * num_groups * m * n * k / t / 1e12:4.0f} TFLOPS, '
+                  f'{(num_groups * (m * k + k * n + m * n * 2)) / 1e9 / t:4.0f} GB/s')
+    print()
+
+
+if __name__ == '__main__':
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+    torch.manual_seed(0)
+    random.seed(0)
+
+    print('Library path:')
+    print(f' > {deep_gemm.__path__}\n')
+
+    test_gemm()
+    test_m_grouped_gemm_contiguous()
+    test_m_grouped_gemm_masked()
--- a/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/tests/test_jit.py
+++ b/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/tests/test_jit.py
@ -0,0 +1,64 @@
+import os
+import torch
+from typing import Any
+
+from deep_gemm import jit
+
+
+class Capture:
+    def __init__(self) -> None:
+        self.read_fd = None
+        self.write_fd = None
+        self.saved_stdout = None
+        self.captured = None
+
+    def __enter__(self) -> Any:
+        self.read_fd, self.write_fd = os.pipe()
+        self.saved_stdout = os.dup(1)
+        os.dup2(self.write_fd, 1)
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb) -> None:
+        os.dup2(self.saved_stdout, 1)
+        os.close(self.write_fd)
+        with os.fdopen(self.read_fd, 'r') as f:
+            self.captured = f.read()
+
+    def capture(self) -> str:
+        return self.captured
+
+
+if __name__ == '__main__':
+    # Runtime
+    print(f'NVCC compiler: {jit.get_nvcc_compiler()}\n')
+
+    # Templates
+    print('Generated code:')
+    args = (('lhs', torch.float8_e4m3fn), ('rhs', torch.float8_e4m3fn), ('scale', torch.float), ('out', torch.bfloat16),
+            ('enable_double_streams', bool), ('stream', torch.cuda.Stream))
+    body = "\n"
+    body += 'std::cout << reinterpret_cast<uint64_t>(lhs) << std::endl;\n'
+    body += 'std::cout << reinterpret_cast<uint64_t>(rhs) << std::endl;\n'
+    body += 'std::cout << reinterpret_cast<uint64_t>(scale) << std::endl;\n'
+    body += 'std::cout << reinterpret_cast<uint64_t>(out) << std::endl;\n'
+    body += 'std::cout << enable_double_streams << std::endl;\n'
+    body += 'std::cout << reinterpret_cast<uint64_t>(stream) << std::endl;\n'
+    code = jit.generate((), args, body)
+    print(code)
+
+    # Build
+    print('Building ...')
+    func = jit.build('test_func', args, code)
+
+    # Test correctness
+    print('Running ...')
+    fp8_tensor = torch.empty((1, ), dtype=torch.float8_e4m3fn, device='cuda')
+    fp32_tensor = torch.empty((1, ), dtype=torch.float, device='cuda')
+    bf16_tensor = torch.empty((1, ), dtype=torch.bfloat16, device='cuda')
+    with Capture() as capture:
+        assert func(fp8_tensor, fp8_tensor, fp32_tensor, bf16_tensor, True, torch.cuda.current_stream()) == 0
+    output = capture.capture()
+    ref_output = f'{fp8_tensor.data_ptr()}\n{fp8_tensor.data_ptr()}\n{fp32_tensor.data_ptr()}\n{bf16_tensor.data_ptr()}\n1\n{torch.cuda.current_stream().cuda_stream}\n'
+    assert output == ref_output, f'{output=}, {ref_output=}'
+
+    print('JIT test passed')
--- a/examples/68_hopper_flash_mla/README.md
+++ b/examples/68_hopper_flash_mla/README.md
@ -0,0 +1,12 @@
+# Hopper FlashMLA - Examples
+The codes in this example are migrated from [FlashMLA](https://github.com/deepseek-ai/FlashMLA/tree/main), it implements an efficient MLA decoding kernel for Hopper GPU.
+
+# Run the example
+### Install
+```
+python setup.py install
+```
+### Run the test
+```
+python tests/test_flash_mla.py
+```
--- a/examples/68_hopper_flash_mla/csrc/flash_api.cpp
+++ b/examples/68_hopper_flash_mla/csrc/flash_api.cpp
@ -0,0 +1,213 @@
+// Adapted from https://github.com/Dao-AILab/flash-attention/blob/main/csrc/flash_attn/flash_api.cpp
+
+#include <torch/python.h>
+#include <torch/nn/functional.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDAGuard.h>
+
+#include <cutlass/fast_math.h>
+
+#include "flash_mla.h"
+#include "static_switch.h"
+
+#define CHECK_DEVICE(x) TORCH_CHECK(x.is_cuda(), #x " must be on CUDA")
+#define CHECK_SHAPE(x, ...) TORCH_CHECK(x.sizes() == torch::IntArrayRef({__VA_ARGS__}), #x " must have shape (" #__VA_ARGS__ ")")
+#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
+
+std::vector<at::Tensor>
+get_mla_metadata(
+    at::Tensor &seqlens_k,
+    const int num_heads_per_head_k,
+    const int num_heads_k
+) {
+    // This should match the logic in the MLA kernel.
+    static constexpr int block_size_m = 64;
+    static constexpr int block_size_n = 64;
+    static constexpr int fixed_overhead_num_blocks = 5;
+
+    CHECK_DEVICE(seqlens_k);
+    TORCH_CHECK(seqlens_k.is_contiguous());
+    TORCH_CHECK(seqlens_k.dtype() == torch::kInt32);
+
+    int batch_size = seqlens_k.size(0);
+    int *seqlens_k_ptr = seqlens_k.data_ptr<int>();
+    auto options = seqlens_k.options();
+
+    auto dprops = at::cuda::getCurrentDeviceProperties();
+    int sm_count = dprops->multiProcessorCount;
+    int num_sm_parts = sm_count / num_heads_k / cutlass::ceil_div(num_heads_per_head_k, block_size_m);
+
+    auto tile_scheduler_metadata = torch::empty({num_sm_parts, TileSchedulerMetaDataSize}, options);
+    auto num_splits = torch::empty({batch_size + 1}, options);
+    int *tile_scheduler_metadata_ptr = tile_scheduler_metadata.data_ptr<int>();
+    int *num_splits_ptr = num_splits.data_ptr<int>();
+
+    at::cuda::CUDAGuard device_guard{(char)seqlens_k.get_device()};
+    auto stream = at::cuda::getCurrentCUDAStream().stream();
+    Mla_metadata_params params = {};
+    params.seqlens_k_ptr = seqlens_k_ptr;
+    params.tile_scheduler_metadata_ptr = tile_scheduler_metadata_ptr;
+    params.num_splits_ptr = num_splits_ptr;
+    params.batch_size = batch_size;
+    params.block_size_n = block_size_n;
+    params.fixed_overhead_num_blocks = fixed_overhead_num_blocks;
+    params.num_sm_parts = num_sm_parts;
+    get_mla_metadata_func(params, stream);
+
+    return {tile_scheduler_metadata, num_splits};
+}
+
+std::vector<at::Tensor>
+mha_fwd_kvcache_mla(
+    at::Tensor &q,                               // batch_size x seqlen_q x num_heads x head_size
+    const at::Tensor &kcache,                    // num_blocks x page_block_size x num_heads_k x head_size
+    std::optional<const at::Tensor> &vcache_,    // num_blocks x page_block_size x num_heads_k x head_size_v
+    const int head_size_v,
+    const at::Tensor &seqlens_k,                 // batch_size
+    const at::Tensor &block_table,               // batch_size x max_num_blocks_per_seq
+    const float softmax_scale,
+    bool is_causal,
+    const at::Tensor &tile_scheduler_metadata,   // num_sm_parts x TileSchedulerMetaDataSize
+    const at::Tensor &num_splits                 // batch_size + 1
+) {
+    auto dprops = at::cuda::getCurrentDeviceProperties();
+    bool is_sm90 = dprops->major == 9 && dprops->minor == 0;
+    TORCH_CHECK(is_sm90);
+
+    at::Tensor vcache = vcache_.has_value() ? vcache_.value() : kcache;
+
+    auto q_dtype = q.dtype();
+    TORCH_CHECK(kcache.dtype() == q_dtype, "query and key must have the same dtype");
+
+    CHECK_DEVICE(q); CHECK_DEVICE(kcache); CHECK_DEVICE(vcache);
+
+    TORCH_CHECK(q.stride(-1) == 1, "Input tensor must have contiguous last dimension");
+    TORCH_CHECK(kcache.stride(-1) == 1, "Input tensor must have contiguous last dimension");
+    TORCH_CHECK(vcache.stride(-1) == 1, "Input tensor must have contiguous last dimension");
+
+    CHECK_DEVICE(block_table);
+    TORCH_CHECK(block_table.dtype() == torch::kInt32, "block_table must have dtype torch.int32");
+    TORCH_CHECK(block_table.stride(-1) == 1, "block_table must have contiguous last dimension");
+
+    const auto sizes = q.sizes();
+    const int batch_size = sizes[0];
+    const int seqlen_q_ori = sizes[1];
+    const int num_heads_ori = sizes[2];
+    const int head_size = sizes[3];
+    TORCH_CHECK(head_size % 8 == 0, "head_size should be a multiple of 8");
+    TORCH_CHECK(head_size_v % 32 == 0, "head_size_v should be a multiple of 32");
+
+    const int max_num_blocks_per_seq = block_table.size(1);
+    const int num_blocks = kcache.size(0);
+    const int page_block_size = kcache.size(1);
+    const int num_heads_k = kcache.size(2);
+    TORCH_CHECK(batch_size > 0, "batch size must be postive");
+    TORCH_CHECK(num_heads_ori % num_heads_k == 0, "Number of heads in key/value must divide number of heads in query");
+
+    if (seqlen_q_ori == 1) { is_causal = false; }
+
+    const int ngroups = num_heads_ori / num_heads_k;
+    const int seqlen_q = seqlen_q_ori * ngroups;
+    const int num_heads = num_heads_k;
+    q = q.view({batch_size, seqlen_q_ori, num_heads_k, ngroups, head_size}).transpose(2, 3)
+            .reshape({batch_size, seqlen_q, num_heads, head_size});
+
+    int head_size_k = head_size;
+    CHECK_SHAPE(q, batch_size, seqlen_q, num_heads, head_size);
+    CHECK_SHAPE(kcache, num_blocks, page_block_size, num_heads_k, head_size_k);
+    if (vcache_.has_value()) { CHECK_SHAPE(vcache, num_blocks, page_block_size, num_heads_k, head_size_v); }
+    CHECK_SHAPE(block_table, batch_size, max_num_blocks_per_seq);
+
+
+    TORCH_CHECK(seqlens_k.dtype() == torch::kInt32, "seqlens_k must have dtype int32");
+    CHECK_DEVICE(seqlens_k);
+    CHECK_CONTIGUOUS(seqlens_k);
+    CHECK_SHAPE(seqlens_k, batch_size);
+
+    at::cuda::CUDAGuard device_guard{(char)q.get_device()};
+
+    auto opts = q.options();
+    at::Tensor out = torch::empty({batch_size, seqlen_q, num_heads, head_size_v}, opts);
+    at::Tensor softmax_lse = torch::empty({batch_size, num_heads, seqlen_q}, opts.dtype(at::kFloat));
+
+    Flash_fwd_mla_params params = {};
+    // Set the sizes.
+    params.b = batch_size;
+    params.seqlen_q = seqlen_q;
+    params.cu_seqlens_k = seqlens_k.data_ptr<int>();
+    params.h = num_heads;
+    params.h_h_k_ratio = num_heads / num_heads_k;
+    params.ngroups = ngroups;
+    params.is_causal = is_causal;
+    params.d = head_size;
+    params.d_v = head_size_v;
+    params.scale_softmax = softmax_scale;
+    params.scale_softmax_log2 = float(softmax_scale * M_LOG2E);
+    // Set the pointers and strides.
+    params.q_ptr = q.data_ptr();
+    params.k_ptr = kcache.data_ptr();
+    params.v_ptr = vcache.data_ptr();
+    params.o_ptr = out.data_ptr();
+    params.softmax_lse_ptr = softmax_lse.data_ptr();
+    // All stride are in elements, not bytes.
+    params.q_batch_stride = q.stride(0);
+    params.k_batch_stride = kcache.stride(0);
+    params.v_batch_stride = vcache.stride(0);
+    params.o_batch_stride = out.stride(0);
+    params.q_row_stride = q.stride(-3);
+    params.k_row_stride = kcache.stride(-3);
+    params.v_row_stride = vcache.stride(-3);
+    params.o_row_stride = out.stride(-3);
+    params.q_head_stride = q.stride(-2);
+    params.k_head_stride = kcache.stride(-2);
+    params.v_head_stride = vcache.stride(-2);
+    params.o_head_stride = out.stride(-2);
+
+    params.block_table = block_table.data_ptr<int>();
+    params.block_table_batch_stride = block_table.stride(0);
+    params.page_block_size = page_block_size;
+
+    TORCH_CHECK(tile_scheduler_metadata.dtype() == torch::kInt32, "tile_scheduler_metadata must have dtype int32");
+    TORCH_CHECK(tile_scheduler_metadata.size(1) == TileSchedulerMetaDataSize);
+    CHECK_DEVICE(tile_scheduler_metadata);
+    CHECK_CONTIGUOUS(tile_scheduler_metadata);
+    params.tile_scheduler_metadata_ptr = tile_scheduler_metadata.data_ptr<int>();
+    params.num_sm_parts = tile_scheduler_metadata.size(0);
+    TORCH_CHECK(num_splits.dtype() == torch::kInt32, "num_splits must have dtype int32");
+    CHECK_DEVICE(num_splits);
+    CHECK_CONTIGUOUS(num_splits);
+    params.num_splits_ptr = num_splits.data_ptr<int>();
+
+    at::Tensor softmax_lse_accum = torch::empty({batch_size + params.num_sm_parts, num_heads, seqlen_q}, opts.dtype(at::kFloat));
+    at::Tensor out_accum = torch::empty({batch_size + params.num_sm_parts, num_heads, seqlen_q, head_size_v}, opts.dtype(at::kFloat));
+    params.softmax_lseaccum_ptr = softmax_lse_accum.data_ptr();
+    params.oaccum_ptr = out_accum.data_ptr();
+
+    auto stream = at::cuda::getCurrentCUDAStream().stream();
+    TORCH_CHECK(head_size == 576);
+
+    if (q_dtype == torch::kBFloat16) {
+        run_mha_fwd_splitkv_mla<cutlass::bfloat16_t, 576>(params, stream);
+    }
+    #ifndef FLASH_MLA_DISABLE_FP16
+    else if (q_dtype == torch::kHalf) {
+        run_mha_fwd_splitkv_mla<cutlass::half_t, 576>(params, stream);
+    }
+    #endif
+    else {
+        TORCH_CHECK(false, "Unsupported tensor dtype for query");
+    }
+
+    out = out.view({batch_size, seqlen_q_ori, ngroups, num_heads_k, head_size_v}).transpose(2, 3)
+            .reshape({batch_size, seqlen_q_ori, num_heads_ori, head_size_v});
+    softmax_lse = softmax_lse.view({batch_size, num_heads_k, seqlen_q_ori, ngroups}).transpose(2, 3)
+            .reshape({batch_size, num_heads_ori, seqlen_q_ori});
+
+    return {out, softmax_lse};
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+    m.doc() = "FlashMLA";
+    m.def("get_mla_metadata", &get_mla_metadata);
+    m.def("fwd_kvcache_mla", &mha_fwd_kvcache_mla);
+}
--- a/examples/68_hopper_flash_mla/csrc/flash_fwd_mla_bf16_sm90.cu
+++ b/examples/68_hopper_flash_mla/csrc/flash_fwd_mla_bf16_sm90.cu
@ -0,0 +1,3 @@
+#include "flash_fwd_mla_kernel.h"
+
+template void run_mha_fwd_splitkv_mla<cutlass::bfloat16_t, 576>(Flash_fwd_mla_params &params, cudaStream_t stream);
--- a/examples/68_hopper_flash_mla/csrc/flash_fwd_mla_fp16_sm90.cu
+++ b/examples/68_hopper_flash_mla/csrc/flash_fwd_mla_fp16_sm90.cu
@ -0,0 +1,3 @@
+#include "flash_fwd_mla_kernel.h"
+
+template void run_mha_fwd_splitkv_mla<cutlass::half_t, 576>(Flash_fwd_mla_params &params, cudaStream_t stream);
--- a/examples/68_hopper_flash_mla/csrc/flash_fwd_mla_kernel.h
+++ b/examples/68_hopper_flash_mla/csrc/flash_fwd_mla_kernel.h
@ -0,0 +1,603 @@
+#pragma once
+
+#include <cute/tensor.hpp>
+#include <cutlass/cutlass.h>
+#include <cutlass/array.h>
+#include <cutlass/numeric_types.h>
+
+using namespace cute;
+
+#include "named_barrier.h"
+#include "utils.h"
+#include "softmax.h"
+#include "static_switch.h"
+#include "flash_mla.h"
+
+
+template<typename PrecType, int DIM, int DIM2 = DIM>
+constexpr auto getSmemLayoutK() {
+    constexpr int headSizeBytes = sizeof(PrecType) * DIM;
+    constexpr int headSizeBytes2 = sizeof(PrecType) * DIM2;
+
+    if constexpr (headSizeBytes % 128 == 0 && headSizeBytes2 % 128 == 0) {
+        return GMMA::Layout_K_SW128_Atom<PrecType>{};
+    } else if constexpr (headSizeBytes % 64 == 0 && headSizeBytes2 % 64 == 0) {
+        return GMMA::Layout_K_SW64_Atom<PrecType>{};
+    } else {
+        return GMMA::Layout_K_SW32_Atom<PrecType>{};
+    }
+}
+
+template<int kHeadDim_, int kBlockM_, int kBlockN_, int kNWarps_, typename elem_type=cutlass::bfloat16_t, int kHeadDimV_ = 0>
+struct Flash_fwd_kernel_traits_mla {
+    using Element = elem_type;
+    using ElementAccum = float;
+    using index_t = int64_t;
+
+    static constexpr int kNWarps = kNWarps_;
+    static constexpr int kNThreads = kNWarps * 32;
+    static constexpr int kNWarpsS = 4;
+    static constexpr int kNThreadsS = kNWarpsS * 32;
+
+    static constexpr int kBlockM = kBlockM_;
+    static constexpr int kBlockN = kBlockN_;
+    static constexpr int kHeadDim = kHeadDim_;
+    static_assert(kHeadDim % 32 == 0);
+    static constexpr int kHeadDimV = kHeadDimV_ != 0 ? kHeadDimV_ : kHeadDim;
+    static_assert(kHeadDimV % 32 == 0);
+    static_assert(kHeadDimV <= kHeadDim);
+    static constexpr int kBlockKSmem = kHeadDim % 64 == 0 ? 64 : 32;
+    static constexpr int kSwizzle = kBlockKSmem == 32 ? 2 : 3;
+
+    using TiledMma = decltype(make_tiled_mma(
+            cute::GMMA::ss_op_selector<Element, Element, ElementAccum, Shape<Int<kBlockM>, Int<kBlockN>, Int<kHeadDim>>,
+                    GMMA::Major::K, GMMA::Major::K>(),
+            Layout<Shape<Int<kNWarpsS / 4>, _1, _1>>{}));
+
+    static constexpr int AtomLayoutNO = kNThreads / kNThreadsS;
+    using TiledMmaO = decltype(make_tiled_mma(
+            cute::GMMA::rs_op_selector<Element, Element, ElementAccum, Shape<Int<kBlockM>, Int<kHeadDimV / AtomLayoutNO>, Int<kBlockN>>,
+                    GMMA::Major::K, GMMA::Major::MN>(),
+            Layout<Shape<Int<kNWarpsS / 4>, Int<AtomLayoutNO>, _1>>{}));
+
+    using SmemLayoutQ = decltype(tile_to_shape(
+            getSmemLayoutK<Element, kHeadDim>(),
+            Shape<Int<kBlockM>, Int<kHeadDim>>{}));
+
+    using SmemLayoutK = decltype(tile_to_shape(
+            getSmemLayoutK<Element, kHeadDim, kHeadDimV>(),
+            Shape<Int<kBlockN>, Int<kHeadDim>>{}));
+
+    using SmemLayoutV = decltype(tile_to_shape(
+            getSmemLayoutK<Element, kHeadDim, kHeadDimV>(),
+            Shape<Int<kBlockN>, Int<kHeadDimV>>{}));
+    using SmemLayoutVtransposed = decltype(composition(SmemLayoutV{}, make_layout(Shape<Int<kHeadDimV>, Int<kBlockN>>{}, GenRowMajor{})));
+
+    using SmemLayoutP = Layout<Shape<Shape<_2, _2>, Int<kNThreadsS>, _1, Int<kBlockN / 8>>>;
+    using SmemLayoutRow = Layout<Shape<_2, Int<kNThreadsS>>, Stride<_1, _2>>;
+
+    using SmemLayoutAtomO = decltype(composition(
+            Swizzle<kSwizzle, 3, 3>{},
+            Layout<Shape<Int<8>, Int<kBlockKSmem>>, Stride<Int<kBlockKSmem>, _1>>{}));
+    using SmemLayoutO = decltype(tile_to_shape(
+            SmemLayoutAtomO{},
+            Shape<Int<kBlockM>, Int<kHeadDimV>>{}));
+    using SmemCopyAtomO = Copy_Atom<SM90_U32x4_STSM_N, Element>;
+    using SmemCopyAtomOaccum = Copy_Atom<AutoVectorizingCopyWithAssumedAlignment<128>, ElementAccum>;
+
+    static constexpr int kGmemElemsPerLoad = sizeof(cute::uint128_t) / sizeof(Element);
+    static_assert(kHeadDim % kGmemElemsPerLoad == 0, "kHeadDim must be a multiple of kGmemElemsPerLoad");
+    static constexpr int kGmemThreadsPerRow = kBlockKSmem / kGmemElemsPerLoad;
+    using Gmem_copy_struct = SM80_CP_ASYNC_CACHEGLOBAL<cute::uint128_t>;
+    static constexpr int kNThreadsLoad = kNThreads - kNThreadsS;
+    static_assert(kNThreadsLoad % kGmemThreadsPerRow == 0, "kNThreads must be a multiple of kGmemThreadsPerRow");
+
+    using GmemLayoutAtom = Layout<
+            Shape<Int<kNThreadsLoad / kGmemThreadsPerRow>, Int<kGmemThreadsPerRow>>,
+            Stride<Int<kGmemThreadsPerRow>, _1>>;
+    using GmemTiledCopy = decltype(make_tiled_copy(
+            Copy_Atom<Gmem_copy_struct, Element>{},
+            GmemLayoutAtom{},
+            Layout<Shape<_1, _8>>{}));  // Val layout, 8 vals per read
+
+    using GmemLayoutAtomO = Layout<
+            Shape<Int<kNThreadsS / kGmemThreadsPerRow>, Int<kGmemThreadsPerRow>>,
+            Stride<Int<kGmemThreadsPerRow>, _1>>;
+    using GmemTiledCopyO = decltype(make_tiled_copy(
+            Copy_Atom<AutoVectorizingCopyWithAssumedAlignment<128>, Element>{},
+            GmemLayoutAtomO{},
+            Layout<Shape<_1, _8>>{}));  // Val layout, 8 vals per store
+
+    static constexpr int kGmemElemsPerLoadAccum = sizeof(cute::uint128_t) / sizeof(ElementAccum);
+    static constexpr int kGmemThreadsPerRowAccum = kBlockKSmem / kGmemElemsPerLoadAccum;
+    using GmemLayoutAtomOaccum = Layout<
+            Shape<Int<kNThreadsS / kGmemThreadsPerRowAccum>, Int<kGmemThreadsPerRowAccum>>,
+            Stride<Int<kGmemThreadsPerRowAccum>, _1>>;
+    using GmemTiledCopyOaccum = decltype(make_tiled_copy(
+            Copy_Atom<AutoVectorizingCopyWithAssumedAlignment<128>, ElementAccum>{},
+            GmemLayoutAtomOaccum{},
+            Layout<Shape<_1, _4>>{}));  // Val layout, 4 vals per store
+};
+
+namespace flash {
+
+using namespace cute;
+
+template<typename Kernel_traits>
+struct SharedStorageMLA {
+    union {
+        struct {
+            cute::array_aligned<typename Kernel_traits::Element, cute::cosize_v<typename Kernel_traits::SmemLayoutQ>> smem_q;
+            cute::array_aligned<typename Kernel_traits::Element, cute::cosize_v<typename Kernel_traits::SmemLayoutK> * 2> smem_k;  // Double buffer
+            cute::array_aligned<typename Kernel_traits::Element, cute::cosize_v<typename Kernel_traits::SmemLayoutP>> smem_p;
+            cute::array_aligned<typename Kernel_traits::ElementAccum, cute::cosize_v<typename Kernel_traits::SmemLayoutRow>> smem_scale;
+        };
+        struct {
+            cute::array_aligned<typename Kernel_traits::ElementAccum, cute::cosize_v<typename Kernel_traits::SmemLayoutRow>> smem_max;
+            cute::array_aligned<typename Kernel_traits::ElementAccum, cute::cosize_v<typename Kernel_traits::SmemLayoutRow>> smem_sum;
+            cute::array_aligned<typename Kernel_traits::ElementAccum, cute::cosize_v<typename Kernel_traits::SmemLayoutO>> smem_o;
+        };
+    };
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Kernel_traits, bool Split, typename SharedStorage, typename AccO, typename Softmax>
+__forceinline__ __device__ void store(const Flash_fwd_mla_params &params, const int bidb, const int bidh, const int m_block, const int n_split_idx,
+                                      SharedStorage &shared_storage, AccO tOrO, Softmax softmax) {
+    constexpr int kBlockM = Kernel_traits::kBlockM;
+    constexpr int kHeadDimV = Kernel_traits::kHeadDimV;
+    constexpr int kNThreadsS = Kernel_traits::kNThreadsS;
+    using Element = typename Kernel_traits::Element;
+    using ElementAccum = typename Kernel_traits::ElementAccum;
+    using index_t = typename Kernel_traits::index_t;
+
+    const int tidx = threadIdx.x;
+
+    typename Kernel_traits::TiledMmaO tiled_mma_o;
+    auto thr_mma_o = tiled_mma_o.get_thread_slice(tidx);
+
+    // Epilogue
+
+    const int split_offset = __ldg(params.num_splits_ptr + bidb);
+
+    Tensor lse = softmax.template normalize_softmax_lse</*Is_dropout=*/false, Split>(tOrO, params.scale_softmax);
+
+    using ElementO = std::conditional_t<!Split, Element, ElementAccum>;
+    Tensor sOaccum = make_tensor(make_smem_ptr(reinterpret_cast<ElementO *>(shared_storage.smem_o.data())), typename Kernel_traits::SmemLayoutO{}); // (SMEM_M,SMEM_N)
+    // Partition sO to match the accumulator partitioning
+    using SmemTiledCopyO = std::conditional_t<
+            !Split,
+            typename Kernel_traits::SmemCopyAtomO,
+            typename Kernel_traits::SmemCopyAtomOaccum
+    >;
+    auto smem_tiled_copy_Oaccum = make_tiled_copy_C(SmemTiledCopyO{}, tiled_mma_o);
+    auto smem_thr_copy_Oaccum = smem_tiled_copy_Oaccum.get_thread_slice(tidx);
+    Tensor rO = flash::convert_type<ElementO>(tOrO);
+    Tensor taccOrOaccum = smem_thr_copy_Oaccum.retile_S(rO);        // ((Atom,AtomNum), MMA_M, MMA_N)
+    Tensor taccOsOaccum = smem_thr_copy_Oaccum.partition_D(sOaccum);     // ((Atom,AtomNum),PIPE_M,PIPE_N)
+
+    __syncthreads();
+
+    cute::copy(smem_tiled_copy_Oaccum, taccOrOaccum, taccOsOaccum);
+
+    const index_t row_offset_o = bidb * params.o_batch_stride + m_block * kBlockM * params.o_row_stride + bidh * params.o_head_stride;
+    const index_t row_offset_oaccum = (((split_offset + n_split_idx) * params.h + bidh) * params.seqlen_q + m_block * kBlockM) * params.d_v;
+    const index_t row_offset_lse = (bidb * params.h + bidh) * params.seqlen_q + m_block * kBlockM;
+    const index_t row_offset_lseaccum = ((split_offset + n_split_idx) * params.h + bidh) * params.seqlen_q + m_block * kBlockM;
+
+    Tensor gOaccum = make_tensor(make_gmem_ptr(reinterpret_cast<ElementO *>(Split ? params.oaccum_ptr : params.o_ptr) + (Split ? row_offset_oaccum : row_offset_o)),
+                                 Shape<Int<kBlockM>, Int<kHeadDimV>>{},
+                                 make_stride(Split ? kHeadDimV : params.o_row_stride, _1{}));
+    Tensor gLSEaccum = make_tensor(make_gmem_ptr(reinterpret_cast<ElementAccum *>(Split ? params.softmax_lseaccum_ptr : params.softmax_lse_ptr) + (Split ? row_offset_lseaccum : row_offset_lse)),
+                                   Shape<Int<kBlockM>>{}, Stride<_1>{});
+
+    using GmemTiledCopyO = std::conditional_t<!Split, typename Kernel_traits::GmemTiledCopyO, typename Kernel_traits::GmemTiledCopyOaccum>;
+    GmemTiledCopyO gmem_tiled_copy_Oaccum;
+    auto gmem_thr_copy_Oaccum = gmem_tiled_copy_Oaccum.get_thread_slice(tidx);
+    Tensor tOsOaccum = gmem_thr_copy_Oaccum.partition_S(sOaccum);        // ((Atom,AtomNum),ATOM_M,ATOM_N)
+    Tensor tOgOaccum = gmem_thr_copy_Oaccum.partition_D(gOaccum);
+
+    __syncthreads();
+
+    if (tidx >= kNThreadsS) { return; }
+
+    Tensor tOrOaccum = make_tensor<ElementO>(shape(tOgOaccum));
+    cute::copy(gmem_tiled_copy_Oaccum, tOsOaccum, tOrOaccum);
+
+    Tensor caccO = make_identity_tensor(Shape<Int<kBlockM>, Int<kHeadDimV>>{});    // (BLK_M,BLK_K) -> (blk_m,blk_k)
+    Tensor taccOcO = thr_mma_o.partition_C(caccO);                           // ((MMA=4, X), MMA_M, MMA_K=1)
+    Tensor taccOcO_row = taccOcO(make_coord(0, _, 0), _, 0);
+    CUTE_STATIC_ASSERT_V(size(lse) == size(taccOcO_row));                     // MMA_M
+    if (get<1>(taccOcO_row(0)) == 0) {
+#pragma unroll
+        for (int mi = 0; mi < size(lse); ++mi) {
+            const int row = get<0>(taccOcO_row(mi));
+            if (row < params.seqlen_q - m_block * kBlockM) { gLSEaccum(row) = lse(mi); }
+        }
+    }
+
+    // Construct identity layout for sO
+    Tensor cO = make_identity_tensor(make_shape(size<0>(sOaccum), size<1>(sOaccum)));    // (BLK_M,BLK_K) -> (blk_m,blk_k)
+    // Repeat the partitioning with identity layouts
+    Tensor tOcO = gmem_thr_copy_Oaccum.partition_D(cO);                           // (ACPY,ACPY_M,ACPY_K) -> (blk_m,blk_k)
+    Tensor tOpO = make_tensor<bool>(make_shape(size<2>(tOgOaccum)));
+    // Clear_OOB_K must be false since we don't want to write zeros to gmem
+    flash::copy</*Is_even_MN=*/false, /*Is_even_K=*/true, /*Clear_OOB_MN=*/false, /*Clear_OOB_K=*/false>(
+            gmem_tiled_copy_Oaccum, tOrOaccum, tOgOaccum, tOcO, tOpO, params.seqlen_q - m_block * kBlockM
+    );
+}
+
+template<typename Kernel_traits, bool Is_causal, typename SharedStorage>
+__forceinline__ __device__ void compute_attn_1rowblock_splitkv_mla(const Flash_fwd_mla_params &params,
+                                                                   const int bidb, const int bidh, const int m_block,
+                                                                   const int n_split_idx, const int seqlen_k,
+                                                                   const int n_block_min, const int n_block_max, const bool NoSplit,
+                                                                   SharedStorage &shared_storage) {
+    constexpr int kBlockM = Kernel_traits::kBlockM;
+    constexpr int kBlockN = Kernel_traits::kBlockN;
+    constexpr int kHeadDim = Kernel_traits::kHeadDim;
+    constexpr int kHeadDimV = Kernel_traits::kHeadDimV;
+    constexpr int kNThreads = Kernel_traits::kNThreads;
+    constexpr int kNThreadsS = Kernel_traits::kNThreadsS;
+    static_assert(kNThreads == 256 and kNThreadsS == 128);
+    using Element = typename Kernel_traits::Element;
+    using index_t = typename Kernel_traits::index_t;
+
+    const int tidx = threadIdx.x;
+    int n_block = n_block_max - 1;
+
+    Tensor sQ = make_tensor(make_smem_ptr(shared_storage.smem_q.data()), typename Kernel_traits::SmemLayoutQ{});
+    Tensor sK = make_tensor(make_smem_ptr(shared_storage.smem_k.data()), typename Kernel_traits::SmemLayoutK{});
+    Tensor sV = make_tensor(make_smem_ptr(shared_storage.smem_k.data()), typename Kernel_traits::SmemLayoutV{});
+    Tensor sVt = make_tensor(make_smem_ptr(shared_storage.smem_k.data()), typename Kernel_traits::SmemLayoutVtransposed{});
+
+    Tensor sP = make_tensor(make_smem_ptr(shared_storage.smem_p.data()), typename Kernel_traits::SmemLayoutP{});
+    Tensor tPsP = sP(_, tidx % kNThreadsS, _, _);
+    Tensor sScale_o = make_tensor(make_smem_ptr(shared_storage.smem_scale.data()), typename Kernel_traits::SmemLayoutRow{});
+    Tensor tScale_osScale_o = sScale_o(_, tidx % kNThreadsS);
+    Tensor sRow_max = make_tensor(make_smem_ptr(shared_storage.smem_max.data()), typename Kernel_traits::SmemLayoutRow{});
+    Tensor tRow_maxsRow_max = sRow_max(_, tidx % kNThreadsS);
+    Tensor sRow_sum = make_tensor(make_smem_ptr(shared_storage.smem_sum.data()), typename Kernel_traits::SmemLayoutRow{});
+    Tensor tRow_sumsRow_sum = sRow_sum(_, tidx % kNThreadsS);
+
+    typename Kernel_traits::TiledMmaO tiled_mma_o;
+    auto thr_mma_o = tiled_mma_o.get_thread_slice(tidx);
+    Tensor tOrVt = thr_mma_o.partition_fragment_B(sVt);                // (MMA, MMA_K,MMA_N)
+    Tensor tOrO = partition_fragment_C(tiled_mma_o, Shape<Int<kBlockM>, Int<kHeadDimV>>{});  // ((MMA=4, X), MMA_M, MMA_N=1)
+    clear(tOrO);
+
+    flash::Softmax<2 * size<1>(tOrO)> softmax;
+
+    int warp_group_idx = cutlass::canonical_warp_group_idx();
+    if (warp_group_idx == 0) {
+        typename Kernel_traits::TiledMma tiled_mma;
+        auto thr_mma = tiled_mma.get_thread_slice(tidx);
+        Tensor tSrQ = thr_mma.partition_fragment_A(sQ);                           // (MMA,MMA_M,MMA_K)
+        Tensor tSrK = thr_mma.partition_fragment_B(sK);                           // (MMA,MMA_N,MMA_K)
+
+        if (n_block % 2 == 1) {
+            // Double buffer for sK
+            constexpr int sK_offset = size(sK);
+            tSrK.data() = tSrK.data() + sK_offset / 8;
+            tOrVt.data() = tOrVt.data() + sK_offset / 8;
+        }
+
+        // We need masking on S for the very last block when K and V has length not multiple of kBlockN.
+        // We also need masking on S if it's causal, for the last ceil_div(kBlockM, kBlockN) blocks.
+        // We will have at least 1 "masking" iteration.
+        // If not even_N, then seqlen_k might end in the middle of a block. In that case we need to
+        // mask 2 blocks (e.g. when kBlockM == kBlockN), not just 1.
+        constexpr int n_masking_steps = !Is_causal ? 1 : cute::ceil_div(kBlockM, kBlockN) + 1;
+#pragma unroll 1
+        for (int masking_step = n_masking_steps; n_block >= n_block_min; --masking_step, --n_block) {
+            __syncthreads();
+
+            Tensor tSrS = partition_fragment_C(tiled_mma, Shape<Int<kBlockM>, Int<kBlockN>>{});  // ((MMA=4, X), MMA_M, MMA_N=1)
+            flash::gemm</*zero_init=*/true, /*wg_wait=*/0>(tiled_mma, tSrQ, tSrK, tSrS);
+
+            const bool is_masking_step = masking_step > 0;
+            const bool is_first_masking_step = masking_step == n_masking_steps;
+
+            if (is_masking_step) {
+                Tensor cS = make_identity_tensor(Shape<Int<kBlockM>, Int<kBlockN>>{});
+                Tensor tScS = thr_mma.partition_C(cS);
+#pragma unroll
+                for (int i = 0; i < size(tSrS); ++i) {
+                    if constexpr (!Is_causal) {  // Just masking based on col
+                        if (int(get<1>(tScS(i))) >= int(seqlen_k - n_block * kBlockN)) tSrS(i) = -INFINITY;
+                    } else {
+                        // Ensure seqlen_k - 1 - (n_block * kBlockN + col) >= (seqlen_q - 1 - (m_block * kBlockM + row)) / ngroups
+                        // col <= seqlen_k - 1 - n_block * kBlockN - (seqlen_q - 1 - (m_block * kBlockM + row)) / ngroups
+                        int row = int(get<0>(tScS(i)));
+                        int col_limit_right = seqlen_k - 1 - n_block * kBlockN - (params.seqlen_q - 1 - (m_block * kBlockM + row)) / params.ngroups;
+                        if (int(get<1>(tScS(i))) > col_limit_right) tSrS(i) = -INFINITY;
+                    }
+                }
+            }
+
+            // We have key_padding_mask so we'll need to Check_inf
+            Tensor scale_o = is_first_masking_step
+                             ? softmax.template softmax</*Is_first=*/true,  /*Check_inf=*/Is_causal>(tSrS, params.scale_softmax_log2)
+                             : is_masking_step ?
+                               softmax.template softmax</*Is_first=*/false, /*Check_inf=*/Is_causal>(tSrS, params.scale_softmax_log2)
+                                               : softmax.template softmax</*Is_first=*/false, /*Check_inf=*//*Is_local=*/false>(tSrS, params.scale_softmax_log2);
+
+            Tensor rP = flash::convert_type<Element>(tSrS);
+            cute::copy(rP, tPsP);
+            cute::copy(scale_o, tScale_osScale_o);
+
+            cutlass::arch::NamedBarrier::arrive(kNThreads, static_cast<int>(NamedBarriers::SReady));
+
+            flash::rescale_o(tOrO, scale_o);
+
+            Tensor tOrP = make_tensor(rP.data(), flash::convert_layout_acc_Aregs<Kernel_traits::TiledMma>(rP.layout()));
+            flash::gemm</*zero_init=*/false, /*wg_wait=*/0>(tiled_mma_o, tOrP, tOrVt, tOrO);
+
+            // Double buffer for sK
+            const int sK_offset = n_block % 2 == 0 ? size(sK) : -size(sK);
+            tSrK.data() = tSrK.data() + sK_offset / 8;
+            tOrVt.data() = tOrVt.data() + sK_offset / 8;
+        }
+
+        cute::copy(softmax.row_max, tRow_maxsRow_max);
+        cute::copy(softmax.row_sum, tRow_sumsRow_sum);
+        cutlass::arch::NamedBarrier::arrive(kNThreads, static_cast<int>(NamedBarriers::SoftmaxReady));
+    } else {
+        const int *block_table = params.block_table + bidb * params.block_table_batch_stride;
+        int cur_block_table = __ldg(&block_table[n_block]);
+
+        const index_t row_offset_q = bidb * params.q_batch_stride + m_block * kBlockM * params.q_row_stride + bidh * params.q_head_stride;
+        Tensor gQ = make_tensor(make_gmem_ptr(reinterpret_cast<Element *>(params.q_ptr) + row_offset_q),
+                                Shape<Int<kBlockM>, Int<kHeadDim>>{},
+                                make_stride(params.q_row_stride, _1{}));
+        typename Kernel_traits::GmemTiledCopy gmem_tiled_copy_Q;
+        auto gmem_thr_copy_Q = gmem_tiled_copy_Q.get_thread_slice(tidx - kNThreadsS);
+        Tensor tQgQ = gmem_thr_copy_Q.partition_S(gQ);
+        Tensor tQsQ = gmem_thr_copy_Q.partition_D(sQ);
+        Tensor cQ = make_identity_tensor(make_shape(size<0>(sQ), size<1>(sQ)));  // (BLK_M,BLK_K) -> (blk_m,blk_k)
+        Tensor tQcQ = gmem_thr_copy_Q.partition_S(cQ);  // (ACPY,ACPY_M,ACPY_K) -> (blk_m,blk_k)
+        Tensor tQpQ = make_tensor<bool>(make_shape(size<2>(tQsQ)));
+
+        // We don't need to clear the sQ smem tiles since we'll only write out the valid outputs
+        flash::copy</*Is_even_MN=*/false, /*Is_even_K=*/true>(gmem_tiled_copy_Q, tQgQ, tQsQ, tQcQ, tQpQ,
+                                                              params.seqlen_q - m_block * kBlockM);
+
+        const index_t row_offset_k = (bidh / params.h_h_k_ratio) * params.k_head_stride;
+        Tensor gK = make_tensor(make_gmem_ptr(reinterpret_cast<Element *>(params.k_ptr) + row_offset_k),
+                                Shape<Int<kBlockN>, Int<kHeadDim>>{},
+                                make_stride(params.k_row_stride, _1{}));
+        typename Kernel_traits::GmemTiledCopy gmem_tiled_copy_K;
+        auto gmem_thr_copy_K = gmem_tiled_copy_K.get_thread_slice(tidx - kNThreadsS);
+        Tensor tKgK = gmem_thr_copy_K.partition_S(gK);
+        Tensor tKsK = gmem_thr_copy_K.partition_D(sK);
+        Tensor cK = make_identity_tensor(make_shape(size<0>(sK), size<1>(sK)));  // (BLK_N,BLK_K) -> (blk_n,blk_k)
+        Tensor tKcK = gmem_thr_copy_K.partition_S(cK);  // (BCPY,BCPY_N,BCPY_K) -> (blk_n,blk_k)
+        Tensor tKpK = make_tensor<bool>(make_shape(size<2>(tKsK)));
+
+        if (n_block % 2 == 1) {
+            // Double buffer for sK
+            constexpr int sK_offset = size(sK);
+            tKsK.data() = tKsK.data() + sK_offset;
+            tOrVt.data() = tOrVt.data() + sK_offset / 8;
+        }
+
+        // We need to clear the sK smem tiles because K is V.
+        const index_t offset_k = cur_block_table * params.k_batch_stride;
+        tKgK.data() = tKgK.data() + offset_k;
+        flash::copy</*Is_even_MN=*/false, /*Is_even_K=*/true, /*Clear_OOB_MN=*/true>(gmem_tiled_copy_K, tKgK, tKsK, tKcK, tKpK,
+                                                                                        seqlen_k - n_block * kBlockN);
+        tKgK.data() = tKgK.data() + -offset_k;
+        cute::cp_async_fence();
+
+        if (n_block - 1 >= n_block_min) {
+            cur_block_table = __ldg(&block_table[n_block - 1]);
+        }
+
+#pragma unroll 1
+        for (; n_block >= n_block_min; --n_block) {
+            flash::cp_async_wait<0>();
+            __syncthreads();
+
+            if (n_block - 1 >= n_block_min) {
+                // Double buffer for sK
+                const int sK_offset = n_block % 2 == 0 ? size(sK) : -size(sK);
+                tKsK.data() = tKsK.data() + sK_offset;
+
+                const index_t offset_k = cur_block_table * params.k_batch_stride;
+                tKgK.data() = tKgK.data() + offset_k;
+                flash::copy</*Is_even_MN=*/true, /*Is_even_K=*/true>(gmem_tiled_copy_K, tKgK, tKsK, tKcK, tKpK);
+                tKgK.data() = tKgK.data() + -offset_k;
+                cute::cp_async_fence();
+            }
+
+            cutlass::arch::NamedBarrier::sync(kNThreads, static_cast<int>(NamedBarriers::SReady));
+
+            if (n_block - 2 >= n_block_min) {
+                cur_block_table = __ldg(&block_table[n_block - 2]);
+            }
+
+            typename Kernel_traits::TiledMma tiled_mma;
+            auto tSrS_layout = partition_fragment_C(tiled_mma, Shape<Int<kBlockM>, Int<kBlockN>>{}).layout();
+            Tensor rP = make_tensor<Element>(tSrS_layout);
+            Tensor scale_o = make_tensor<float>(Shape<_2>{});
+            cute::copy(tScale_osScale_o, scale_o);
+            cute::copy(tPsP, rP);
+
+            flash::rescale_o(tOrO, scale_o);
+
+            Tensor tOrP = make_tensor(rP.data(), flash::convert_layout_acc_Aregs<Kernel_traits::TiledMma>(rP.layout()));
+            flash::gemm</*zero_init=*/false, /*wg_wait=*/0>(tiled_mma_o, tOrP, tOrVt, tOrO);
+
+            // Double buffer for sK
+            const int sK_offset = n_block % 2 == 0 ? size(sK) : -size(sK);
+            tOrVt.data() = tOrVt.data() + sK_offset / 8;
+        }
+
+        cutlass::arch::NamedBarrier::sync(kNThreads, static_cast<int>(NamedBarriers::SoftmaxReady));
+        cute::copy(tRow_maxsRow_max, softmax.row_max);
+        cute::copy(tRow_sumsRow_sum, softmax.row_sum);
+    }
+
+    if (NoSplit)
+        store<Kernel_traits, false>(params, bidb, bidh, m_block, n_split_idx, shared_storage, tOrO, softmax);
+    else
+        store<Kernel_traits, true>(params, bidb, bidh, m_block, n_split_idx, shared_storage, tOrO, softmax);
+}
+
+template<typename Kernel_traits, bool Is_causal, typename SharedStorage>
+__global__ void __launch_bounds__(Kernel_traits::kNThreads, 1, 1)
+flash_fwd_splitkv_mla_kernel(__grid_constant__ const Flash_fwd_mla_params params) {
+    constexpr int kBlockN = Kernel_traits::kBlockN;
+    const int m_block = blockIdx.x;
+    const int bidh = blockIdx.y;
+    const int partition_idx = blockIdx.z;
+
+    extern __shared__ char shared_memory[];
+    auto &shared_storage = *reinterpret_cast<SharedStorage *>(shared_memory);
+
+    int *tile_scheduler_metadata_ptr = params.tile_scheduler_metadata_ptr + partition_idx * TileSchedulerMetaDataSize;
+    int4 tile_scheduler_metadata = __ldg(reinterpret_cast<int4 *>(tile_scheduler_metadata_ptr));
+    int begin_idx = tile_scheduler_metadata.x;
+    int begin_seqlen = tile_scheduler_metadata.y;
+    int end_idx = tile_scheduler_metadata.z;
+    int end_seqlen = tile_scheduler_metadata.w;
+    if (begin_idx >= params.b) return;
+    int begin_n_split_idx = __ldg(tile_scheduler_metadata_ptr + 4);
+
+#pragma unroll 1
+    for (int batch_id = begin_idx; batch_id <= end_idx; ++batch_id) {
+        const int n_split_idx = batch_id == begin_idx ? begin_n_split_idx : 0;
+        const int seqlen_k = __ldg(params.cu_seqlens_k + batch_id);
+        const int n_block_min = batch_id == begin_idx ? begin_seqlen / kBlockN : 0;
+        const int n_block_max = batch_id == end_idx ? cute::ceil_div(end_seqlen, kBlockN) : cute::ceil_div(seqlen_k, kBlockN);
+        const bool NoSplit = n_block_min == 0 && n_block_max == cute::ceil_div(seqlen_k, kBlockN);
+        if (batch_id > begin_idx) {
+            __syncthreads();  // Barrier between two tiles.
+        }
+        flash::compute_attn_1rowblock_splitkv_mla<Kernel_traits, Is_causal>(params, batch_id, bidh, m_block, n_split_idx, seqlen_k, n_block_min, n_block_max, NoSplit, shared_storage);
+    }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Element, typename ElementAccum, typename index_t, int kHeadDimV, int kMaxSplits>
+__global__ void __launch_bounds__(256, 1, 1)
+flash_fwd_splitkv_mla_combine_kernel(__grid_constant__ const Flash_fwd_mla_params params) {
+    constexpr int kNThreads = 128;
+
+    const int tidx = threadIdx.x;
+    const int bidx = blockIdx.x;
+    const int hs = params.h * params.seqlen_q;
+    const int batch_idx = bidx / hs;
+    const int hs_idx = bidx % hs;
+
+    const int split_offset = __ldg(params.num_splits_ptr + batch_idx);
+    const int actual_num_splits = __ldg(params.num_splits_ptr + batch_idx + 1) - split_offset;
+    FLASH_DEVICE_ASSERT(actual_num_splits <= kMaxSplits);
+    if (actual_num_splits == 1) return;
+
+    __shared__ ElementAccum sLseScale[kMaxSplits];
+
+    const index_t row_offset_lseaccum = split_offset * hs + hs_idx;
+    const index_t row_offset_lse = bidx;
+    Tensor gLSEaccum = make_tensor(make_gmem_ptr(reinterpret_cast<ElementAccum *>(params.softmax_lseaccum_ptr) + row_offset_lseaccum),
+                                   Shape<Int<kMaxSplits>>{}, make_stride(hs));
+    Tensor gLSE = make_tensor(make_gmem_ptr(reinterpret_cast<ElementAccum *>(params.softmax_lse_ptr) + row_offset_lse),
+                              Shape<_1>{}, Stride<_1>{});
+
+    int warp_idx = cutlass::canonical_warp_idx_sync();
+    if (warp_idx == 0) {
+        constexpr int kNLsePerThread = cute::ceil_div(kMaxSplits, 32);
+
+        float local_lse[kNLsePerThread];
+        for (int i = 0; i < kNLsePerThread; ++i) {
+            const int split = i * 32 + tidx;
+            local_lse[i] = split < actual_num_splits ? gLSEaccum(split) : -INFINITY;
+        }
+
+        float max_lse = -INFINITY;
+        for (int i = 0; i < kNLsePerThread; ++i) max_lse = max(max_lse, local_lse[i]);
+        for (int offset = 16; offset >= 1; offset /= 2) max_lse = max(max_lse, __shfl_xor_sync(uint32_t(-1), max_lse, offset));
+        max_lse = max_lse == -INFINITY ? 0.0f : max_lse;  // In case all local LSEs are -inf
+
+        float sum_lse = 0;
+        for (int i = 0; i < kNLsePerThread; ++i) sum_lse = sum_lse + expf(local_lse[i] - max_lse);
+        for (int offset = 16; offset >= 1; offset /= 2) sum_lse = sum_lse + __shfl_xor_sync(uint32_t(-1), sum_lse, offset);
+
+        float global_lse = (sum_lse == 0.f || sum_lse != sum_lse) ? INFINITY : logf(sum_lse) + max_lse;
+        if (tidx == 0) gLSE(0) = global_lse;
+
+        for (int i = 0; i < kNLsePerThread; ++i) {
+            const int split = i * 32 + tidx;
+            if (split < actual_num_splits) sLseScale[split] = expf(local_lse[i] - global_lse);
+        }
+    }
+    __syncthreads();
+
+    static_assert(kHeadDimV % kNThreads == 0);
+    constexpr int Elements = kHeadDimV / kNThreads;
+    const index_t row_offset_oaccum = (split_offset * hs + hs_idx) * kHeadDimV;
+    Tensor gOaccum = make_tensor(make_gmem_ptr(reinterpret_cast<ElementAccum *>(params.oaccum_ptr) + row_offset_oaccum),
+                                 Shape<Int<kHeadDimV>>{}, Stride<_1>{});
+    using GmemTiledCopyOaccum = decltype(make_tiled_copy(
+            Copy_Atom<AutoVectorizingCopyWithAssumedAlignment<128>, ElementAccum>{},
+            Layout<Shape<Int<kNThreads>>>{},
+            Layout<Shape<Int<Elements>>>{}));
+    GmemTiledCopyOaccum gmem_tiled_copy_Oaccum;
+    auto gmem_thr_copy_Oaccum = gmem_tiled_copy_Oaccum.get_thread_slice(tidx);
+    Tensor tOgOaccum = gmem_thr_copy_Oaccum.partition_S(gOaccum);
+    Tensor tOrOaccum = make_tensor<ElementAccum>(shape(tOgOaccum));
+    Tensor tOrO = make_tensor<ElementAccum>(shape(tOgOaccum));
+    clear(tOrO);
+
+    for (int split = 0; split < actual_num_splits; ++split) {
+        cute::copy(tOgOaccum, tOrOaccum);
+        ElementAccum lse_scale = sLseScale[split];
+        for (int i = 0; i < size(tOrO); ++i) {
+            tOrO(i) += lse_scale * tOrOaccum(i);
+        }
+        tOgOaccum.data() = tOgOaccum.data() + hs * kHeadDimV;
+    }
+
+    Tensor rO = flash::convert_type<Element>(tOrO);
+    const int head_idx = (bidx - batch_idx * hs) / params.seqlen_q;
+    const int row = bidx - batch_idx * hs - head_idx * params.seqlen_q;
+    auto o_ptr = reinterpret_cast<Element *>(params.o_ptr) + batch_idx * params.o_batch_stride + head_idx * params.o_head_stride + row * params.o_row_stride;
+    Tensor gO = make_tensor(make_gmem_ptr(o_ptr + tidx * Elements), Shape<Int<decltype(size<0>(rO))::value>>{}, Stride<_1>{});
+    cute::copy(rO, gO);
+}
+
+} // namespace flash
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename Kernel_traits, typename SharedStorage>
+void run_flash_splitkv_fwd_mla(Flash_fwd_mla_params &params, cudaStream_t stream) {
+    FLASH_ASSERT(params.page_block_size == Kernel_traits::kBlockN);
+    const int num_m_block = cute::ceil_div(params.seqlen_q, Kernel_traits::kBlockM);
+    BOOL_SWITCH(params.is_causal, Is_causal, [&] {
+        auto kernel = &flash::flash_fwd_splitkv_mla_kernel<Kernel_traits, Is_causal, SharedStorage>;
+        constexpr size_t smem_size = sizeof(SharedStorage);
+        CHECK_CUDA(cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size));
+        kernel<<<dim3(num_m_block, params.h, params.num_sm_parts), Kernel_traits::kNThreads, smem_size, stream>>>(params);
+    });
+    CHECK_CUDA_KERNEL_LAUNCH();
+
+    dim3 grid_combine(params.b * params.h * params.seqlen_q);
+    MLA_NUM_SPLITS_SWITCH(params.num_sm_parts, kMaxSplits, [&] {
+        auto combine_kernel = &flash::flash_fwd_splitkv_mla_combine_kernel<
+                typename Kernel_traits::Element, typename Kernel_traits::ElementAccum, typename Kernel_traits::index_t, Kernel_traits::kHeadDimV, kMaxSplits>;
+        combine_kernel<<<grid_combine, 128, 0, stream>>>(params);
+    });
+    CHECK_CUDA_KERNEL_LAUNCH();
+}
+
+template<typename T, int Headdim>
+void run_mha_fwd_splitkv_mla(Flash_fwd_mla_params &params, cudaStream_t stream) {
+    static_assert(Headdim == 576);
+    FLASH_ASSERT(params.d_v == 512);
+    FLASH_ASSERT(params.k_ptr == params.v_ptr);  // Shared_KV
+    using Kernel_traits = Flash_fwd_kernel_traits_mla<576, 64, 64, 8, T, 512>;
+    run_flash_splitkv_fwd_mla<Kernel_traits, flash::SharedStorageMLA<Kernel_traits>>(params, stream);
+}
--- a/examples/68_hopper_flash_mla/csrc/flash_fwd_mla_metadata.cu
+++ b/examples/68_hopper_flash_mla/csrc/flash_fwd_mla_metadata.cu
@ -0,0 +1,77 @@
+#include "flash_fwd_mla_kernel.h"
+
+static constexpr int MaxBatchSize = 4096;
+
+__global__ void __launch_bounds__(256, 1, 1)
+get_mla_metadata_kernel(__grid_constant__ const Mla_metadata_params params) {
+    int *seqlens_k_ptr = params.seqlens_k_ptr;
+    int *tile_scheduler_metadata_ptr = params.tile_scheduler_metadata_ptr;
+    int *num_splits_ptr = params.num_splits_ptr;
+    int batch_size = params.batch_size;
+    int block_size_n = params.block_size_n;
+    int fixed_overhead_num_blocks = params.fixed_overhead_num_blocks;
+    int num_sm_parts = params.num_sm_parts;
+
+    __shared__ int num_blocks_shared[MaxBatchSize];
+    __shared__ int num_splits_shared[MaxBatchSize];
+
+    int total_num_blocks = 0;
+    for (int i = threadIdx.x; i < batch_size; i += 32) {
+        int num_blocks = cutlass::ceil_div(seqlens_k_ptr[i], block_size_n);
+        total_num_blocks += num_blocks + fixed_overhead_num_blocks;
+        num_blocks_shared[i] = num_blocks;
+    }
+    for (int offset = 16; offset >= 1; offset /= 2) {
+        total_num_blocks += __shfl_xor_sync(uint32_t(-1), total_num_blocks, offset);
+    }
+    __syncwarp();
+
+    if (threadIdx.x == 0) {
+        int payload = cutlass::ceil_div(total_num_blocks, num_sm_parts) + fixed_overhead_num_blocks;
+
+        int now_idx = 0, now_block = 0, now_n_split_idx = 0, cum_num_splits = 0;
+        num_splits_shared[0] = 0;
+        for (int i = 0; i < num_sm_parts; ++i) {
+            int tile_scheduler_metadata0[4], tile_scheduler_metadata1;
+            tile_scheduler_metadata0[0] = now_idx;
+            tile_scheduler_metadata0[1] = now_block * block_size_n;
+            tile_scheduler_metadata1 = now_n_split_idx;
+            int remain_payload = payload;
+            while (now_idx < batch_size) {
+                int num_blocks = num_blocks_shared[now_idx];
+                int now_remain_blocks = num_blocks - now_block;
+                if (remain_payload >= now_remain_blocks + fixed_overhead_num_blocks) {
+                    cum_num_splits += now_n_split_idx + 1;
+                    num_splits_shared[now_idx + 1] = cum_num_splits;
+                    remain_payload -= now_remain_blocks + fixed_overhead_num_blocks;
+                    ++now_idx;
+                    now_block = 0;
+                    now_n_split_idx = 0;
+                } else {
+                    if (remain_payload - fixed_overhead_num_blocks > 0) {
+                        now_block += remain_payload - fixed_overhead_num_blocks;
+                        ++now_n_split_idx;
+                        remain_payload = 0;
+                    }
+                    break;
+                }
+            }
+            tile_scheduler_metadata0[2] = now_block > 0 ? now_idx : now_idx - 1;
+            tile_scheduler_metadata0[3] = now_block > 0 ? now_block * block_size_n : seqlens_k_ptr[now_idx - 1];
+            *reinterpret_cast<int4 *>(tile_scheduler_metadata_ptr + i * TileSchedulerMetaDataSize) = *reinterpret_cast<int4 *>(tile_scheduler_metadata0);
+            tile_scheduler_metadata_ptr[i * TileSchedulerMetaDataSize + 4] = tile_scheduler_metadata1;
+        }
+        FLASH_DEVICE_ASSERT(now_idx == batch_size && now_block == 0 && now_n_split_idx == 0);
+    }
+    __syncwarp();
+
+    for (int i = threadIdx.x; i <= batch_size; i += 32) {
+        num_splits_ptr[i] = num_splits_shared[i];
+    }
+}
+
+void get_mla_metadata_func(Mla_metadata_params &params, cudaStream_t stream) {
+    FLASH_ASSERT(params.batch_size < MaxBatchSize);
+    get_mla_metadata_kernel<<<1, 32, 0, stream>>>(params);
+    CHECK_CUDA_KERNEL_LAUNCH();
+}
--- a/examples/68_hopper_flash_mla/csrc/flash_mla.h
+++ b/examples/68_hopper_flash_mla/csrc/flash_mla.h
@ -0,0 +1,63 @@
+#pragma once
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct Flash_fwd_mla_params {
+    using index_t = int64_t;
+
+    int b, seqlen_q, d, d_v;
+    int h, h_h_k_ratio, ngroups;
+    bool is_causal;
+    float scale_softmax, scale_softmax_log2;
+    int *__restrict__ cu_seqlens_k;
+
+    void *__restrict__ q_ptr;
+    void *__restrict__ k_ptr;
+    void *__restrict__ v_ptr;
+    void *__restrict__ o_ptr;
+    void *__restrict__ softmax_lse_ptr;
+
+    index_t q_batch_stride;
+    index_t k_batch_stride;
+    index_t v_batch_stride;
+    index_t o_batch_stride;
+    index_t q_row_stride;
+    index_t k_row_stride;
+    index_t v_row_stride;
+    index_t o_row_stride;
+    index_t q_head_stride;
+    index_t k_head_stride;
+    index_t v_head_stride;
+    index_t o_head_stride;
+
+    int *__restrict__ block_table;
+    index_t block_table_batch_stride;
+    int page_block_size;
+
+    int *__restrict__ tile_scheduler_metadata_ptr;
+    int num_sm_parts;
+    int *__restrict__ num_splits_ptr;
+
+    void *__restrict__ softmax_lseaccum_ptr;
+    void *__restrict__ oaccum_ptr;
+};
+
+static constexpr int TileSchedulerMetaDataSize = 8;
+// [begin_idx, begin_seqlen, end_idx, end_seqlen, begin_n_split_idx, _, _, _]
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename T, int Headdim>
+void run_mha_fwd_splitkv_mla(Flash_fwd_mla_params &params, cudaStream_t stream);
+
+struct Mla_metadata_params {
+    int *__restrict__ seqlens_k_ptr;
+    int *__restrict__ tile_scheduler_metadata_ptr;
+    int *__restrict__ num_splits_ptr;
+    int batch_size;
+    int block_size_n;
+    int fixed_overhead_num_blocks;
+    int num_sm_parts;
+};
+
+void get_mla_metadata_func(Mla_metadata_params &params, cudaStream_t stream);
--- a/examples/68_hopper_flash_mla/csrc/named_barrier.h
+++ b/examples/68_hopper_flash_mla/csrc/named_barrier.h
@ -0,0 +1,15 @@
+#pragma once
+
+#include "cutlass/barrier.h"
+
+namespace flash {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+// Enumerates the reserved named barriers to avoid potential conflicts
+
+enum class NamedBarriers {
+    SReady = 1,
+    SoftmaxReady = 2,
+};
+
+} // flash
--- a/examples/68_hopper_flash_mla/csrc/softmax.h
+++ b/examples/68_hopper_flash_mla/csrc/softmax.h
@ -0,0 +1,197 @@
+// Adapted from https://github.com/Dao-AILab/flash-attention/blob/main/csrc/flash_attn/src/softmax.h
+
+#pragma once
+
+#include <cmath>
+
+#include <cute/tensor.hpp>
+#include <cutlass/numeric_types.h>
+
+#include "utils.h"
+
+namespace flash {
+
+using namespace cute;
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<bool zero_init=true, typename Engine0, typename Layout0, typename Engine1, typename Layout1, typename Operator>
+__device__ __forceinline__ void thread_reduce_(Tensor<Engine0, Layout0> const &tensor, Tensor<Engine1, Layout1> &summary, Operator &op) {
+    static_assert(Layout0::rank == 2, "Only support 2D Tensor");
+    static_assert(Layout1::rank == 1, "Only support 1D Tensor");
+    CUTE_STATIC_ASSERT_V(size<0>(summary) == size<0>(tensor));
+    #pragma unroll
+    for (int mi = 0; mi < size<0>(tensor); mi++) {
+        summary(mi) = zero_init ? tensor(mi, 0) : op(summary(mi), tensor(mi, 0));
+        #pragma unroll
+        for (int ni = 1; ni < size<1>(tensor); ni++) {
+            summary(mi) = op(summary(mi), tensor(mi, ni));
+        }
+    }
+}
+
+template<typename Engine0, typename Layout0, typename Engine1, typename Layout1, typename Operator>
+__device__ __forceinline__ void quad_allreduce_(Tensor<Engine0, Layout0> &dst, Tensor<Engine1, Layout1> &src, Operator &op) {
+    CUTE_STATIC_ASSERT_V(size(dst) == size(src));
+    #pragma unroll
+    for (int i = 0; i < size(dst); i++){
+        dst(i) = Allreduce<4>::run(src(i), op);
+    }
+}
+
+template<bool zero_init=true, typename Engine0, typename Layout0, typename Engine1, typename Layout1, typename Operator>
+__device__ __forceinline__ void reduce_(Tensor<Engine0, Layout0> const& tensor, Tensor<Engine1, Layout1> &summary, Operator &op) {
+    thread_reduce_<zero_init>(tensor, summary, op);
+    quad_allreduce_(summary, summary, op);
+}
+
+template<bool zero_init=true, typename Engine0, typename Layout0, typename Engine1, typename Layout1>
+__device__ __forceinline__ void reduce_max(Tensor<Engine0, Layout0> const& tensor, Tensor<Engine1, Layout1> &max){
+    MaxOp<float> max_op;
+    reduce_<zero_init>(tensor, max, max_op);
+}
+
+template<bool zero_init=true, typename Engine0, typename Layout0, typename Engine1, typename Layout1>
+__device__ __forceinline__ void reduce_sum(Tensor<Engine0, Layout0> const& tensor, Tensor<Engine1, Layout1> &sum){
+    SumOp<float> sum_op;
+    thread_reduce_<zero_init>(tensor, sum, sum_op);
+}
+
+// Apply the exp to all the elements.
+template <bool Scale_max=true, typename Engine0, typename Layout0, typename Engine1, typename Layout1>
+__forceinline__ __device__ auto scale_apply_exp2(Tensor<Engine0, Layout0> &tensor, Tensor<Engine1, Layout1> const &max, const float scale) {
+    static_assert(Layout0::rank == 2, "Only support 2D Tensor");
+    static_assert(Layout1::rank == 1, "Only support 1D Tensor");
+    CUTE_STATIC_ASSERT_V(size<0>(max) == size<0>(tensor));
+    #pragma unroll
+    for (int mi = 0; mi < size<0>(tensor); ++mi) {
+        // If max is -inf, then all elements must have been -inf (possibly due to masking).
+        // We don't want (-inf - (-inf)) since that would give NaN.
+        // If we don't have float around M_LOG2E the multiplication is done in fp64.
+        const float max_scaled = max(mi) == -INFINITY ? 0.f : max(mi) * (Scale_max ? scale : float(M_LOG2E));
+        #pragma unroll
+        for (int ni = 0; ni < size<1>(tensor); ++ni)  {
+            // Instead of computing exp(x - max), we compute exp2(x * log_2(e) -
+            // max * log_2(e)) This allows the compiler to use the ffma
+            // instruction instead of fadd and fmul separately.
+            // The following macro will disable the use of fma.
+            // See: https://github.com/pytorch/pytorch/issues/121558 for more details
+            // This macro is set in PyTorch and not FlashAttention
+            #ifdef UNFUSE_FMA
+                tensor(mi, ni) = exp2f(__fmul_rn(tensor(mi, ni), scale) - max_scaled);
+            #else
+                tensor(mi, ni) = exp2f(tensor(mi, ni) * scale - max_scaled);
+            #endif
+        }
+    }
+    return tensor;
+}
+
+// Apply the exp to all the elements.
+template <bool zero_init=true, typename Engine0, typename Layout0, typename Engine1, typename Layout1>
+__forceinline__ __device__ void max_scale_exp2_sum(Tensor<Engine0, Layout0> &tensor, Tensor<Engine1, Layout1> &max, Tensor<Engine1, Layout1> &sum, const float scale) {
+    static_assert(Layout0::rank == 2, "Only support 2D Tensor");
+    static_assert(Layout1::rank == 1, "Only support 1D Tensor");
+    CUTE_STATIC_ASSERT_V(size<0>(max) == size<0>(tensor));
+    #pragma unroll
+    for (int mi = 0; mi < size<0>(tensor); ++mi) {
+        MaxOp<float> max_op;
+        max(mi) = zero_init ? tensor(mi, 0) : max_op(max(mi), tensor(mi, 0));
+        #pragma unroll
+        for (int ni = 1; ni < size<1>(tensor); ni++) {
+            max(mi) = max_op(max(mi), tensor(mi, ni));
+        }
+        max(mi) = Allreduce<4>::run(max(mi), max_op);
+        // If max is -inf, then all elements must have been -inf (possibly due to masking).
+        // We don't want (-inf - (-inf)) since that would give NaN.
+        const float max_scaled = max(mi) == -INFINITY ? 0.f : max(mi) * scale;
+        sum(mi) = 0;
+        #pragma unroll
+        for (int ni = 0; ni < size<1>(tensor); ++ni)  {
+            // Instead of computing exp(x - max), we compute exp2(x * log_2(e) -
+            // max * log_2(e)) This allows the compiler to use the ffma
+            // instruction instead of fadd and fmul separately.
+            tensor(mi, ni) = exp2f(tensor(mi, ni) * scale - max_scaled);
+            sum(mi) += tensor(mi, ni);
+        }
+        SumOp<float> sum_op;
+        sum(mi) = Allreduce<4>::run(sum(mi), sum_op);
+    }
+}
+
+template<typename Tensor0, typename Tensor1>
+__forceinline__ __device__ void rescale_o(Tensor0 &acc_o, Tensor1 &scale_o) {
+    // Reshape acc_s from ((2, 2, V), MMA_M, MMA_N) to (nrow=(2, MMA_M), ncol=(2, V, MMA_N))
+    Tensor acc_o_rowcol = make_tensor(acc_o.data(), flash::convert_layout_acc_rowcol(acc_o.layout()));
+    #pragma unroll
+    for (int mi = 0; mi < size(scale_o); ++mi) {
+        #pragma unroll
+        for (int ni = 0; ni < size<1>(acc_o_rowcol); ++ni) { acc_o_rowcol(mi, ni) *= scale_o(mi); }
+    }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <int kNRows>
+struct Softmax {
+
+    using TensorT = decltype(make_tensor<float>(Shape<Int<kNRows>>{}));
+    TensorT row_max, row_sum;
+
+    __forceinline__ __device__ Softmax() {};
+
+    template<bool Is_first, bool Check_inf=false, typename Tensor0>
+    __forceinline__ __device__ TensorT softmax(Tensor0 &acc_s, float softmax_scale_log2) {
+        // Reshape acc_s from ((2, 2, V), MMA_M, MMA_N) to (nrow=(2, MMA_M), ncol=(2, V, MMA_N))
+        Tensor scores = make_tensor(acc_s.data(), flash::convert_layout_acc_rowcol(acc_s.layout()));
+        static_assert(decltype(size<0>(scores))::value == kNRows);
+        TensorT scale_o;
+        clear(scale_o);
+        if (Is_first) {
+            flash::template reduce_max</*zero_init=*/true>(scores, row_max);
+            flash::scale_apply_exp2(scores, row_max, softmax_scale_log2);
+            flash::reduce_sum</*zero_init=*/true>(scores, row_sum);
+        } else {
+            Tensor scores_max_prev = make_fragment_like(row_max);
+            cute::copy(row_max, scores_max_prev);
+            flash::template reduce_max</*zero_init=*/false>(scores, row_max);
+            // Reshape acc_o from (MMA=4, MMA_M, MMA_K) to (nrow=(2, MMA_M), ncol=(2, MMA_K))
+            #pragma unroll
+            for (int mi = 0; mi < size(row_max); ++mi) {
+                float scores_max_cur = !Check_inf
+                    ? row_max(mi)
+                    : (row_max(mi) == -INFINITY ? 0.0f : row_max(mi));
+                float scores_scale = exp2f((scores_max_prev(mi) - scores_max_cur) * softmax_scale_log2);
+                scale_o(mi) = scores_scale;
+                row_sum(mi) *= scores_scale;
+            }
+            flash::scale_apply_exp2(scores, row_max, softmax_scale_log2);
+            // We don't do the reduce across threads here since we don't need to use the row_sum.
+            // We do that reduce at the end when we need to normalize the softmax.
+            flash::reduce_sum</*zero_init=*/false>(scores, row_sum);
+        }
+        return scale_o;
+    };
+
+    template<bool Is_dropout=false, bool Split=false, typename Tensor0>
+    __forceinline__ __device__ TensorT normalize_softmax_lse(Tensor0 &acc_o, float softmax_scale, float rp_dropout=1.0) {
+        SumOp<float> sum_op;
+        quad_allreduce_(row_sum, row_sum, sum_op);
+        TensorT lse = make_fragment_like(row_sum);
+        // Reshape acc_s from ((2, 2, V), MMA_M, MMA_N) to (nrow=(2, MMA_M), ncol=(2, V, MMA_N))
+        Tensor acc_o_rowcol = make_tensor(acc_o.data(), flash::convert_layout_acc_rowcol(acc_o.layout()));
+        static_assert(decltype(size<0>(acc_o_rowcol))::value == kNRows);
+        #pragma unroll
+        for (int mi = 0; mi < size<0>(acc_o_rowcol); ++mi) {
+            float sum = row_sum(mi);
+            float inv_sum = (sum == 0.f || sum != sum) ? 1.f : 1.f / sum;
+            lse(mi) = (sum == 0.f || sum != sum) ? (Split ? -INFINITY : INFINITY) : row_max(mi) * softmax_scale + __logf(sum);
+            float scale = !Is_dropout ? inv_sum : inv_sum * rp_dropout;
+            #pragma unroll
+            for (int ni = 0; ni < size<1>(acc_o_rowcol); ++ni) { acc_o_rowcol(mi, ni) *= scale; }
+        }
+        return lse;
+    };
+};
+
+}  // namespace flash
--- a/examples/68_hopper_flash_mla/csrc/static_switch.h
+++ b/examples/68_hopper_flash_mla/csrc/static_switch.h
@ -0,0 +1,65 @@
+#pragma once
+
+#define CHECK_CUDA(call)                                                                                  \
+    do {                                                                                                  \
+        cudaError_t status_ = call;                                                                       \
+        if (status_ != cudaSuccess) {                                                                     \
+            fprintf(stderr, "CUDA error (%s:%d): %s\n", __FILE__, __LINE__, cudaGetErrorString(status_)); \
+            exit(1);                                                                                      \
+        }                                                                                                 \
+    } while(0)
+
+#define CHECK_CUDA_KERNEL_LAUNCH() CHECK_CUDA(cudaGetLastError())
+
+
+#define FLASH_ASSERT(cond)                                                                                \
+    do {                                                                                                  \
+        if (not (cond)) {                                                                                 \
+            fprintf(stderr, "Assertion failed (%s:%d): %s\n", __FILE__, __LINE__, #cond);                 \
+            exit(1);                                                                                      \
+        }                                                                                                 \
+    } while(0)
+
+
+#define FLASH_DEVICE_ASSERT(cond)                                                                         \
+    do {                                                                                                  \
+        if (not (cond)) {                                                                                 \
+            printf("Assertion failed (%s:%d): %s\n", __FILE__, __LINE__, #cond);                          \
+            asm("trap;");                                                                                 \
+        }                                                                                                 \
+    } while(0)
+
+
+#define BOOL_SWITCH(COND, CONST_NAME, ...)      \
+  [&] {                                         \
+    if (COND) {                                 \
+      constexpr static bool CONST_NAME = true;  \
+      return __VA_ARGS__();                     \
+    } else {                                    \
+      constexpr static bool CONST_NAME = false; \
+      return __VA_ARGS__();                     \
+    }                                           \
+  }()
+
+
+#define MLA_NUM_SPLITS_SWITCH(NUM_SPLITS, NAME, ...) \
+  [&] {                                              \
+    if (NUM_SPLITS <= 32) {                          \
+      constexpr static int NAME = 32;                \
+      return __VA_ARGS__();                          \
+    } else if (NUM_SPLITS <= 64) {                   \
+      constexpr static int NAME = 64;                \
+      return __VA_ARGS__();                          \
+    } else if (NUM_SPLITS <= 96) {                   \
+      constexpr static int NAME = 96;                \
+      return __VA_ARGS__();                          \
+    } else if (NUM_SPLITS <= 128) {                  \
+      constexpr static int NAME = 128;               \
+      return __VA_ARGS__();                          \
+    } else if (NUM_SPLITS <= 160) {                  \
+      constexpr static int NAME = 160;               \
+      return __VA_ARGS__();                          \
+    } else {                                         \
+      FLASH_ASSERT(false);                           \
+    }                                                \
+  }()
--- a/examples/68_hopper_flash_mla/csrc/utils.h
+++ b/examples/68_hopper_flash_mla/csrc/utils.h
@ -0,0 +1,238 @@
+// Adapted from https://github.com/Dao-AILab/flash-attention/blob/main/hopper/utils.h
+
+#pragma once
+
+#include <assert.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include <cuda_bf16.h>
+
+#include <cute/tensor.hpp>
+
+#include <cutlass/array.h>
+#include <cutlass/cutlass.h>
+#include <cutlass/numeric_conversion.h>
+#include <cutlass/numeric_types.h>
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace flash {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename T>
+struct MaxOp {
+__device__ __forceinline__ T operator()(T const & x, T const & y) { return x > y ? x : y; }
+};
+
+template <>
+struct MaxOp<float> {
+// This is slightly faster
+__device__ __forceinline__ float operator()(float const &x, float const &y) { return max(x, y); }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<typename T>
+struct SumOp {
+__device__ __forceinline__ T operator()(T const & x, T const & y) { return x + y; }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<int THREADS>
+struct Allreduce {
+    static_assert(THREADS == 32 || THREADS == 16 || THREADS == 8 || THREADS == 4);
+    template<typename T, typename Operator>
+    static __device__ __forceinline__ T run(T x, Operator &op) {
+        constexpr int OFFSET = THREADS / 2;
+        x = op(x, __shfl_xor_sync(uint32_t(-1), x, OFFSET));
+        return Allreduce<OFFSET>::run(x, op);
+    }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<>
+struct Allreduce<2> {
+template<typename T, typename Operator>
+static __device__ __forceinline__ T run(T x, Operator &op) {
+    x = op(x, __shfl_xor_sync(uint32_t(-1), x, 1));
+    return x;
+}
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <bool zero_init=false, int wg_wait=0, bool arrive=true, bool commit=true, typename Tensor0, typename Tensor1, typename Tensor2, typename TiledMma>
+__forceinline__ __device__ void gemm(TiledMma &tiled_mma, Tensor0 const &tCrA, Tensor1 const &tCrB, Tensor2 &tCrC) {
+    constexpr bool Is_RS = !cute::is_base_of<cute::GMMA::DescriptorIterator, typename TiledMma::FrgTypeA>::value;
+    // Need to cast away const on tCrA since warpgroup_fence_operand doesn't take const
+    if constexpr (Is_RS) { cute::warpgroup_fence_operand(const_cast<Tensor0 &>(tCrA)); }
+    warpgroup_fence_operand(tCrC);
+    if constexpr (arrive) {
+        warpgroup_arrive();
+    }
+    if constexpr (zero_init) {
+        tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
+        // Unroll the K mode manually to set scale D to 1
+        CUTLASS_PRAGMA_UNROLL
+        for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+            cute::gemm(tiled_mma, tCrA(_,_,k_block), tCrB(_,_,k_block), tCrC);
+            tiled_mma.accumulate_ = GMMA::ScaleOut::One;
+        }
+    } else {
+        // cute::gemm(tiled_mma, tCrA, tCrB, tCrC);
+        // Unroll the K mode manually to set scale D to 1
+        CUTLASS_PRAGMA_UNROLL
+        for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+            cute::gemm(tiled_mma, tCrA(_,_,k_block), tCrB(_,_,k_block), tCrC);
+            tiled_mma.accumulate_ = GMMA::ScaleOut::One;
+        }
+    }
+    if constexpr (commit) {
+        warpgroup_commit_batch();
+    }
+    if constexpr (wg_wait >= 0) { warpgroup_wait<wg_wait>(); }
+    warpgroup_fence_operand(tCrC);
+    if constexpr (Is_RS) { warpgroup_fence_operand(const_cast<Tensor0 &>(tCrA)); }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// For SM80, convert acc_layout from (MMA=4, MMA_M, MMA_N) to (nrow=(2, MMA_M), ncol=(2, MMA_N))
+// For SM90, convert acc_layout from ((2, 2, V), MMA_M, MMA_N) to (nrow=(2, MMA_M), ncol=(2, V, MMA_N))
+template<bool Transposed=false, typename Layout0>
+__forceinline__ __device__ auto convert_layout_acc_rowcol(Layout0 acc_layout) {
+    if constexpr (decltype(rank<0>(acc_layout))::value == 3) {  // SM90
+        static_assert(decltype(size<0, 0>(acc_layout))::value == 2);
+        static_assert(decltype(size<0, 1>(acc_layout))::value == 2);
+        static_assert(decltype(rank(acc_layout))::value == 3);
+        auto l = acc_layout;
+        if constexpr (!Transposed) {
+            return make_layout(make_layout(get<0, 1>(l), get<1>(l)), make_layout(get<0, 0>(l), get<0, 2>(l), get<2>(l)));
+        } else {
+             return make_layout(make_layout(get<0, 0>(l), get<0, 2>(l), get<2>(l)), make_layout(get<0, 1>(l), get<1>(l)));
+        }
+
+    } else {  // SM80
+        static_assert(decltype(size<0>(acc_layout))::value == 4);
+        static_assert(decltype(rank(acc_layout))::value == 3);
+        auto l = logical_divide(acc_layout, Shape<_2>{});  // ((2, 2), MMA_M, MMA_N)
+        if constexpr (!Transposed) {
+            return make_layout(make_layout(get<0, 1>(l), get<1>(l)), make_layout(get<0, 0>(l), get<2>(l)));
+        } else {
+            return make_layout(make_layout(get<0, 0>(l), get<2>(l)), make_layout(get<0, 1>(l), get<1>(l)));
+        }
+    }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// For SM80, convert acc_layout from (MMA=4, MMA_M, MMA_N) to ((4, 2), MMA_M, MMA_N / 2)
+// if using m16n8k16, or to (4, MMA_M, MMA_N) if using m16n8k8.
+// For SM90, FP16/BF16, convert acc_layout from ((2, 2, N / 8), MMA_M, MMA_N) to ((2, 2, 2), MMA_M, (N / 16, MMA_N))
+// For SM90, FP8, convert acc_layout from ((2, 2, N / 8), MMA_M, MMA_N) to ((4, 2, 2), MMA_M, (N / 32, MMA_N))
+template<typename MMA_Traits, typename Layout0>
+__forceinline__ __device__ auto convert_layout_acc_Aregs(Layout0 acc_layout) {
+    using X = Underscore;
+    if constexpr (decltype(rank<0>(acc_layout))::value == 3) {  // SM90
+        static_assert(decltype(size<0, 0>(acc_layout))::value == 2);
+        static_assert(decltype(size<0, 1>(acc_layout))::value == 2);
+        static_assert(decltype(rank(acc_layout))::value == 3);
+        static_assert(decltype(rank(get<0>(acc_layout)))::value == 3);
+        if constexpr (sizeof(typename MMA_Traits::ValTypeA) == 2) {
+            auto l = logical_divide(get<0, 2>(acc_layout), Tile<_2>{});  // ((2, N / 16))
+            return make_layout(make_layout(get<0, 0>(acc_layout), get<0, 1>(acc_layout), get<0, 0>(l)), get<1>(acc_layout), coalesce(make_layout(get<0, 1>(l), get<2>(acc_layout))));
+        } else {
+            static_assert(sizeof(typename MMA_Traits::ValTypeA) == 1);
+            static_assert(decltype(stride<0, 0>(acc_layout))::value == 1);
+            static_assert(decltype(stride<0, 1>(acc_layout))::value == 2);
+            auto l = logical_divide(get<0, 2>(acc_layout), Tile<Layout<Shape<_2, _2>>>{});  // (((2, 2), N / 32))
+            // This combines the first two modes (<0, 0> and <0, 1>) into one mode.
+            // Will require register shuffling later to be correct.
+            return make_layout(make_layout(Layout<_4>{}, get<0, 0, 0>(l), get<0, 0, 1>(l)),
+                               get<1>(acc_layout),
+                               coalesce(make_layout(get<0, 1>(l), get<2>(acc_layout))));  // ((4, 2, 2), MMA_M, N / 32 * MMA_N)
+            // This combination is right but doesn't work with register shuffling.
+            // return make_layout(make_layout(coalesce(make_layout(get<0, 0>(acc_layout), get<0, 0, 0>(l))), get<0, 1>(acc_layout), get<0, 0, 1>(l)),
+            //                    get<1>(acc_layout),
+            //                    coalesce(make_layout(get<0, 1>(l), get<2>(acc_layout))));
+        }
+    } else {  // SM80
+        static_assert(decltype(size<0>(acc_layout))::value == 4);
+        static_assert(decltype(rank(acc_layout))::value == 3);
+        constexpr int mma_shape_K = get<2>(typename MMA_Traits::Shape_MNK{});
+        static_assert(mma_shape_K == 8 || mma_shape_K == 16);
+        if constexpr (mma_shape_K == 8) {
+            return acc_layout;
+        } else {
+            auto l = logical_divide(acc_layout, Shape<X, X, _2>{});  // (4, MMA_M, (2, MMA_N / 2)))
+            return make_layout(make_layout(get<0>(l), get<2, 0>(l)), get<1>(l), get<2, 1>(l));
+        }
+    }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename To_type, typename Engine, typename Layout>
+__forceinline__ __device__ auto convert_type(Tensor<Engine, Layout> const &tensor) {
+    using From_type = typename Engine::value_type;
+    constexpr int numel = decltype(size(tensor))::value;
+    cutlass::NumericArrayConverter<To_type, From_type, numel> convert_op;
+    // HACK: this requires tensor to be "contiguous"
+    auto frag = convert_op(*reinterpret_cast<const cutlass::Array<From_type, numel> *>(tensor.data()));
+    return make_tensor(make_rmem_ptr<To_type>(&frag), tensor.layout());
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Blocks until all but N previous cp.async.commit_group operations have committed.
+// This differs from cute::cp_async_wait in that when N = 0 we don't call cp.async.wait_all
+// (which is equivalent to commit_group then wait_group 0).
+// Instead we just call cp.async.wait_group 0, which is slightly faster.
+// https://github.com/NVIDIA/cutlass/blob/master/include/cute/arch/copy_sm80.hpp#L113
+template <int N>
+CUTE_HOST_DEVICE
+void cp_async_wait() {
+#if defined(CUTE_ARCH_CP_ASYNC_SM80_ENABLED)
+    asm volatile("cp.async.wait_group %0;\n" :: "n"(N));
+#endif
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <bool Is_even_MN=true, bool Is_even_K=true, bool Clear_OOB_MN=false, bool Clear_OOB_K=true,
+          typename TiledCopy, typename Engine0, typename Layout0, typename Engine1, typename Layout1,
+          typename Engine2, typename Layout2, typename Engine3, typename Layout3>
+__forceinline__ __device__ void copy(TiledCopy tiled_copy, Tensor<Engine0, Layout0> const &S,
+                            Tensor<Engine1, Layout1> &D, Tensor<Engine2, Layout2> const &identity_MN,
+                            Tensor<Engine3, Layout3> const &predicate_K, const int max_MN=0) {
+    CUTE_STATIC_ASSERT_V(rank(S) == Int<3>{});
+    CUTE_STATIC_ASSERT_V(rank(D) == Int<3>{});
+    CUTE_STATIC_ASSERT_V(size<0>(S) == size<0>(D));                     // MMA
+    CUTE_STATIC_ASSERT_V(size<1>(S) == size<1>(D));                     // MMA_M
+    CUTE_STATIC_ASSERT_V(size<2>(S) == size<2>(D));                     // MMA_K
+    // There's no case where !Clear_OOB_K && Clear_OOB_MN
+    static_assert(!(Clear_OOB_MN && !Clear_OOB_K));
+    #pragma unroll
+    for (int m = 0; m < size<1>(S); ++m) {
+        if (Is_even_MN || get<0>(identity_MN(0, m, 0)) < max_MN) {
+            #pragma unroll
+            for (int k = 0; k < size<2>(S); ++k) {
+                if (Is_even_K || predicate_K(k)) {
+                    cute::copy(tiled_copy, S(_, m, k), D(_, m, k));
+                } else if (Clear_OOB_K) {
+                    cute::clear(D(_, m, k));
+                }
+            }
+        } else if (Clear_OOB_MN) {
+            cute::clear(D(_, m, _));
+        }
+    }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+}  // namespace flash
--- a/examples/68_hopper_flash_mla/flash_mla/init.py
+++ b/examples/68_hopper_flash_mla/flash_mla/init.py
@ -0,0 +1,6 @@
+__version__ = "1.0.0"
+
+from flash_mla.flash_mla_interface import (
+    get_mla_metadata,
+    flash_mla_with_kvcache,
+)
--- a/examples/68_hopper_flash_mla/flash_mla/flash_mla_interface.py
+++ b/examples/68_hopper_flash_mla/flash_mla/flash_mla_interface.py
@ -0,0 +1,67 @@
+from typing import Optional, Tuple
+
+import torch
+
+import flash_mla_cuda
+
+
+def get_mla_metadata(
+    cache_seqlens: torch.Tensor,
+    num_heads_per_head_k: int,
+    num_heads_k: int,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Arguments:
+        cache_seqlens: (batch_size), dtype torch.int32.
+        num_heads_per_head_k: Equals to seq_len_q * num_heads_q // num_heads_k.
+        num_heads_k: num_heads_k.
+
+    Returns:
+        tile_scheduler_metadata: (num_sm_parts, TileSchedulerMetaDataSize), dtype torch.int32.
+        num_splits: (batch_size + 1), dtype torch.int32.
+    """
+    return flash_mla_cuda.get_mla_metadata(cache_seqlens, num_heads_per_head_k, num_heads_k)
+
+
+def flash_mla_with_kvcache(
+    q: torch.Tensor,
+    k_cache: torch.Tensor,
+    block_table: torch.Tensor,
+    cache_seqlens: torch.Tensor,
+    head_dim_v: int,
+    tile_scheduler_metadata: torch.Tensor,
+    num_splits: torch.Tensor,
+    softmax_scale: Optional[float] = None,
+    causal: bool = False,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Arguments:
+        q: (batch_size, seq_len_q, num_heads_q, head_dim).
+        k_cache: (num_blocks, page_block_size, num_heads_k, head_dim).
+        block_table: (batch_size, max_num_blocks_per_seq), torch.int32.
+        cache_seqlens: (batch_size), torch.int32.
+        head_dim_v: Head dimension of v.
+        tile_scheduler_metadata: (num_sm_parts, TileSchedulerMetaDataSize), torch.int32, returned by get_mla_metadata.
+        num_splits: (batch_size + 1), torch.int32, returned by get_mla_metadata.
+        softmax_scale: float. The scale of QK^T before applying softmax. Default to 1 / sqrt(head_dim).
+        causal: bool. Whether to apply causal attention mask.
+
+    Returns:
+        out: (batch_size, seq_len_q, num_heads_q, head_dim_v).
+        softmax_lse: (batch_size, num_heads_q, seq_len_q), torch.float32.
+    """
+    if softmax_scale is None:
+        softmax_scale = q.shape[-1] ** (-0.5)
+    out, softmax_lse = flash_mla_cuda.fwd_kvcache_mla(
+        q,
+        k_cache,
+        None,
+        head_dim_v,
+        cache_seqlens,
+        block_table,
+        softmax_scale,
+        causal,
+        tile_scheduler_metadata,
+        num_splits,
+    )
+    return out, softmax_lse
--- a/examples/68_hopper_flash_mla/setup.py
+++ b/examples/68_hopper_flash_mla/setup.py
@ -0,0 +1,87 @@
+import os
+from pathlib import Path
+from datetime import datetime
+
+from setuptools import setup, find_packages
+
+from torch.utils.cpp_extension import (
+    BuildExtension,
+    CUDAExtension,
+    IS_WINDOWS,
+)
+
+DISABLE_FP16 = os.getenv("FLASH_MLA_DISABLE_FP16", "FALSE") == "TRUE"
+
+def append_nvcc_threads(nvcc_extra_args):
+    nvcc_threads = os.getenv("NVCC_THREADS") or "32"
+    return nvcc_extra_args + ["--threads", nvcc_threads]
+
+def get_sources():
+    sources = [
+        "csrc/flash_api.cpp",
+        "csrc/flash_fwd_mla_bf16_sm90.cu",
+        "csrc/flash_fwd_mla_metadata.cu",
+    ]
+
+    if not DISABLE_FP16:
+        sources.append("csrc/flash_fwd_mla_fp16_sm90.cu")
+
+    return sources
+
+def get_features_args():
+    features_args = []
+    if DISABLE_FP16:
+        features_args.append("-DFLASH_MLA_DISABLE_FP16")
+    return features_args
+
+cc_flag = []
+cc_flag.append("-gencode")
+cc_flag.append("arch=compute_90a,code=sm_90a")
+
+this_dir = os.path.dirname(os.path.abspath(__file__))
+
+if IS_WINDOWS:
+    cxx_args = ["/O2", "/std:c++17", "/DNDEBUG", "/W0"]
+else:
+    cxx_args = ["-O3", "-std=c++17", "-DNDEBUG", "-Wno-deprecated-declarations"]
+
+ext_modules = []
+ext_modules.append(
+    CUDAExtension(
+        name="flash_mla_cuda",
+        sources=get_sources(),
+        extra_compile_args={
+            "cxx": cxx_args + get_features_args(),
+            "nvcc": append_nvcc_threads(
+                [
+                    "-O3",
+                    "-std=c++17",
+                    "-DNDEBUG",
+                    "-D_USE_MATH_DEFINES",
+                    "-Wno-deprecated-declarations",
+                    "-U__CUDA_NO_HALF_OPERATORS__",
+                    "-U__CUDA_NO_HALF_CONVERSIONS__",
+                    "-U__CUDA_NO_HALF2_OPERATORS__",
+                    "-U__CUDA_NO_BFLOAT16_CONVERSIONS__",
+                    "--expt-relaxed-constexpr",
+                    "--expt-extended-lambda",
+                    "--use_fast_math",
+                    "--ptxas-options=-v,--register-usage-level=10"
+                ]
+                + cc_flag
+            ) + get_features_args(),
+        },
+        include_dirs=[
+            Path(this_dir) / "csrc",
+            Path(this_dir) / ".." / ".." / "include",
+        ],
+    )
+)
+
+setup(
+    name="flash_mla",
+    version="1.0.0",
+    packages=find_packages(include=['flash_mla']),
+    ext_modules=ext_modules,
+    cmdclass={"build_ext": BuildExtension},
+)
--- a/examples/68_hopper_flash_mla/tests/test_flash_mla.py
+++ b/examples/68_hopper_flash_mla/tests/test_flash_mla.py
@ -0,0 +1,153 @@
+import argparse
+import math
+import random
+
+import torch
+import triton
+
+from flash_mla import flash_mla_with_kvcache, get_mla_metadata
+
+
+def scaled_dot_product_attention(query, key, value, h_q, h_kv, is_causal=False):
+    query = query.float()
+    key = key.float()
+    value = value.float()
+    key = key.repeat_interleave(h_q // h_kv, dim=0)
+    value = value.repeat_interleave(h_q // h_kv, dim=0)
+    attn_weight = query @ key.transpose(-2, -1) / math.sqrt(query.size(-1))
+    if is_causal:
+        s_q = query.shape[-2]
+        s_k = key.shape[-2]
+        attn_bias = torch.zeros(s_q, s_k, dtype=query.dtype)
+        temp_mask = torch.ones(s_q, s_k, dtype=torch.bool).tril(diagonal=s_k - s_q)
+        attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf"))
+        attn_bias.to(query.dtype)
+        attn_weight += attn_bias
+    lse = attn_weight.logsumexp(dim=-1)
+    attn_weight = torch.softmax(attn_weight, dim=-1, dtype=torch.float32)
+    return attn_weight @ value, lse
+
+
+def cal_diff(x: torch.Tensor, y: torch.Tensor, name: str) -> None:
+    x, y = x.double(), y.double()
+    RMSE = ((x - y) * (x - y)).mean().sqrt().item()
+    cos_diff = 1 - 2 * (x * y).sum().item() / max((x * x + y * y).sum().item(), 1e-12)
+    amax_diff = (x - y).abs().max().item()
+    # print(f"{name}: {cos_diff=}, {RMSE=}, {amax_diff=}")
+    assert cos_diff < 1e-5
+
+
+@torch.inference_mode()
+def test_flash_mla(b, s_q, mean_sk, h_q, h_kv, d, dv, causal, varlen):
+    print(
+        f"{b=}, {s_q=}, {mean_sk=}, {h_q=}, {h_kv=}, {d=}, {dv=}, {causal=}, {varlen=}"
+    )
+
+    cache_seqlens = torch.full((b,), mean_sk, dtype=torch.int32)
+    if varlen:
+        for i in range(b):
+            cache_seqlens[i] = max(random.normalvariate(mean_sk, mean_sk / 2), s_q)
+    total_seqlens = cache_seqlens.sum().item()
+    mean_seqlens = cache_seqlens.float().mean().int().item()
+    max_seqlen = cache_seqlens.max().item()
+    max_seqlen_pad = triton.cdiv(max_seqlen, 256) * 256
+    # print(f"{total_seqlens=}, {mean_seqlens=}, {max_seqlen=}")
+
+    q = torch.randn(b, s_q, h_q, d)
+    block_size = 64
+    block_table = torch.arange(
+        b * max_seqlen_pad // block_size, dtype=torch.int32
+    ).view(b, max_seqlen_pad // block_size)
+    blocked_k = torch.randn(block_table.numel(), block_size, h_kv, d)
+    for i in range(b):
+        blocked_k.view(b, max_seqlen_pad, h_kv, d)[i, cache_seqlens[i].item():] = (
+            float("nan")
+        )
+    blocked_v = blocked_k[..., :dv]
+
+    tile_scheduler_metadata, num_splits = get_mla_metadata(
+        cache_seqlens, s_q * h_q // h_kv, h_kv
+    )
+
+    def flash_mla():
+        return flash_mla_with_kvcache(
+            q,
+            blocked_k,
+            block_table,
+            cache_seqlens,
+            dv,
+            tile_scheduler_metadata,
+            num_splits,
+            causal=causal,
+        )
+
+    def ref_mla():
+        out = torch.empty(b, s_q, h_q, dv, dtype=torch.float32)
+        lse = torch.empty(b, h_q, s_q, dtype=torch.float32)
+        for i in range(b):
+            begin = i * max_seqlen_pad
+            end = begin + cache_seqlens[i]
+            O, LSE = scaled_dot_product_attention(
+                q[i].transpose(0, 1),
+                blocked_k.view(-1, h_kv, d)[begin:end].transpose(0, 1),
+                blocked_v.view(-1, h_kv, dv)[begin:end].transpose(0, 1),
+                h_q=h_q,
+                h_kv=h_kv,
+                is_causal=causal,
+            )
+            out[i] = O.transpose(0, 1)
+            lse[i] = LSE
+        return out, lse
+
+    out_flash, lse_flash = flash_mla()
+    out_torch, lse_torch = ref_mla()
+    cal_diff(out_flash, out_torch, "out")
+    cal_diff(lse_flash, lse_torch, "lse")
+
+    t = triton.testing.do_bench(flash_mla)
+    FLOPS = s_q * total_seqlens * h_q * (d + dv) * 2
+    bytes = (total_seqlens * h_kv * d + b * s_q * h_q * d + b * s_q * h_q * dv) * (
+        torch.finfo(q.dtype).bits // 8
+    )
+    print(
+        f"{t:.3f} ms, {FLOPS / 10 ** 9 / t:.0f} TFLOPS, {bytes / 10 ** 6 / t:.0f} GB/s"
+    )
+
+
+def main(torch_dtype):
+    device = torch.device("cuda:0")
+    torch.set_default_dtype(torch_dtype)
+    torch.set_default_device(device)
+    torch.cuda.set_device(device)
+    torch.manual_seed(0)
+    random.seed(0)
+
+    h_kv = 1
+    d, dv = 576, 512
+    causal = True
+
+    for b in [128]:
+        for s in [4096, 8192]:
+            for h_q in [16, 32, 64, 128]:  # TP = 8, 4, 2, 1
+                for s_q in [1, 2]:  # MTP = 1, 2
+                    for varlen in [False, True]:
+                        test_flash_mla(b, s_q, s, h_q, h_kv, d, dv, causal, varlen)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--dtype",
+        type=str,
+        choices=["bf16", "fp16"],
+        default="bf16",
+        help="Data type to use for testing (bf16 or fp16)",
+    )
+
+    args = parser.parse_args()
+
+    torch_dtype = torch.bfloat16
+    if args.dtype == "fp16":
+        torch_dtype = torch.float16
+
+    main(torch_dtype)
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm.cu
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm.cu
@ -0,0 +1,830 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+/*! \file
+    \brief 
+    Hopper Mixed-input Grouped GEMM example using CUTLASS 3 APIs for NVIDIA Hopper architecture.
+    See 55_hopper_int4_bf16_gemm.cu for more details about W4A16 GEMMs with layout shuffling.
+
+    Limitations:
+      1) Only support row-wise scaling. Zero-points and block-wise scaling is currently not supported.
+
+    To run this example:
+
+      $ ./examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm --m=2048 --n=2048 --k=2048 --mode=1 --groups=10
+
+      The above example command makes all 10 groups to be sized at the given m, n, k sizes.
+      Skipping any of the problem dimensions randomizes it across the different groups.
+      Same applies for alpha and beta values that are randomized across the different groups.
+*/
+
+#include <iostream>
+#include <fstream>
+#include <sstream>
+#include <vector>
+#include <numeric>
+#include <typeinfo>
+#include <float.h>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/group_array_problem_shape.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/mixed_dtype_utils.hpp"
+
+#include "helper.h"
+#include "grouped_mixed_dtype_utils.hpp"
+
+using namespace cute;
+
+using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int,int,int>>; // <M,N,K> per group
+using MmaType = cutlass::bfloat16_t;
+using QuantType = cutlass::int4b_t;
+constexpr int TileShapeK = 128 * 8 / sizeof_bits<MmaType>::value;
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = MmaType;
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = QuantType;                                      // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// This example manually swaps and transposes, so keep transpose of input layouts
+using LayoutA_Transpose = typename cutlass::layout::LayoutTranspose<LayoutA>::type;
+using LayoutB_Transpose = typename cutlass::layout::LayoutTranspose<LayoutB>::type;
+
+// Need to pass a pointer type to make the 3rd dimension of Stride be _0
+using StrideA = cute::remove_pointer_t<cutlass::detail::TagToStrideA_t<LayoutA*>>;
+using StrideB = cute::remove_pointer_t<cutlass::detail::TagToStrideB_t<LayoutB*>>;
+
+// Define the CuTe layout for reoredered quantized tensor B
+// LayoutAtomQuant places values that will be read by the same thread in contiguous locations in global memory.
+// It specifies the reordering within a single warp's fragment
+// using ValueShuffle = Layout<_1>;                          // no value reordering
+using ValueShuffle = Layout<Shape<_2,_4>, Stride<_4,_1>>; // order [0,2,4,6,1,3,5,7]
+int constexpr NumShuffleAtoms = 1;
+using MmaAtomShape = Layout<Shape<_1,Int<NumShuffleAtoms>>>;
+using LayoutAtomQuant = decltype(cutlass::compute_memory_reordering_atom<MmaType, MmaAtomShape, ValueShuffle>());
+using LayoutB_Reordered = decltype(cute::tile_to_shape(LayoutAtomQuant{}, Layout<Shape<int,int,Int<1>>, StrideB>{}));
+
+using ElementZero = cutlass::bfloat16_t;
+using ElementScale = cutlass::bfloat16_t;
+using LayoutScale = cutlass::layout::RowMajor;
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::RowMajor;                      // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// D matrix configuration
+using         ElementD    = ElementC;
+using         LayoutD     = LayoutC;
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+using TileShape           = Shape<_128,_16,cute::Int<TileShapeK>>;                           // Threadblock-level tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;           // Stage count maximized based on the tile size
+using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative;
+using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative; // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, typename cutlass::layout::LayoutTranspose<LayoutC>::type *, AlignmentC,
+    ElementD, typename cutlass::layout::LayoutTranspose<LayoutD>::type *, AlignmentD,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloopConvertOnly = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementB, LayoutB_Transpose *, AlignmentB,
+    ElementA, LayoutA_Transpose *, AlignmentA,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelConvertOnly = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloopConvertOnly,
+    CollectiveEpilogue
+>;
+
+using GemmConvertOnly = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelConvertOnly>;
+
+using CollectiveMainloopConvertOnlyShuffled = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementB, LayoutB_Reordered *, AlignmentB,
+    ElementA, LayoutA_Transpose *, AlignmentA,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelConvertOnlyShuffled = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloopConvertOnlyShuffled,
+    CollectiveEpilogue
+>;
+
+using GemmConvertOnlyShuffled = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelConvertOnlyShuffled>;
+
+// =========================================================== MIXED INPUT WITH SCALES ===========================================================================
+// The Scale information must get paired with the operand that will be scaled. In this example, B is scaled so we make a tuple of B's information and the scale information.
+using CollectiveMainloopScaleOnly = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    cute::tuple<ElementB, ElementScale>, LayoutB_Transpose *, AlignmentB,
+    ElementA, LayoutA_Transpose *, AlignmentA,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelScaleOnly = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape, 
+    CollectiveMainloopScaleOnly,
+    CollectiveEpilogue
+>;
+
+using GemmScaleOnly = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelScaleOnly>;
+
+using CollectiveMainloopScaleOnlyShuffled = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    cute::tuple<ElementB, ElementScale>, LayoutB_Reordered *, AlignmentB,
+    ElementA, LayoutA_Transpose *, AlignmentA,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelScaleOnlyShuffled = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape, 
+    CollectiveMainloopScaleOnlyShuffled,
+    CollectiveEpilogue
+>;
+
+using GemmScaleOnlyShuffled = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelScaleOnlyShuffled>;
+
+using StrideC = typename GemmKernelConvertOnly::InternalStrideC;
+using StrideD = typename GemmKernelConvertOnly::InternalStrideD;
+using StrideC_ref = cutlass::detail::TagToStrideC_t<LayoutC>;
+using StrideD_ref = cutlass::detail::TagToStrideC_t<LayoutD>;
+using StrideS = typename CollectiveMainloopScaleOnly::StrideScale;
+using StrideS_ref = cutlass::detail::TagToStrideB_t<LayoutScale>;
+
+// Host-side allocations
+std::vector<int64_t> offset_A;
+std::vector<int64_t> offset_B;
+std::vector<int64_t> offset_B_dq;
+std::vector<int64_t> offset_C;
+std::vector<int64_t> offset_D;
+std::vector<int64_t> offset_scale;
+std::vector<int64_t> offset_zero;
+
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+std::vector<StrideC_ref> stride_C_host_ref;
+std::vector<StrideD_ref> stride_D_host_ref;
+std::vector<StrideS> stride_S_host;
+std::vector<StrideS_ref> stride_S_host_ref;
+
+std::vector<ElementAccumulator> alpha_host;
+std::vector<ElementAccumulator> beta_host;
+
+uint64_t seed = 2020;
+
+// Device-side allocations
+cutlass::DeviceAllocation<typename ProblemShape::UnderlyingProblemShape> problem_sizes;
+
+cutlass::DeviceAllocation<MmaType> block_A;
+cutlass::DeviceAllocation<QuantType> block_B;
+cutlass::DeviceAllocation<MmaType> block_B_dq;
+cutlass::DeviceAllocation<ElementScale> block_scale;
+cutlass::DeviceAllocation<ElementZero> block_zero;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<typename GemmConvertOnly::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename GemmConvertOnly::EpilogueOutputOp::ElementOutput> block_ref_D;
+
+cutlass::DeviceAllocation<const MmaType *> ptr_A;
+cutlass::DeviceAllocation<const QuantType *> ptr_B;
+cutlass::DeviceAllocation<const MmaType *> ptr_B_dq;
+cutlass::DeviceAllocation<const ElementScale *> ptr_scale;
+cutlass::DeviceAllocation<const ElementZero *> ptr_zero;
+cutlass::DeviceAllocation<const ElementC *> ptr_C;
+cutlass::DeviceAllocation<typename GemmConvertOnly::EpilogueOutputOp::ElementOutput *> ptr_D;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<LayoutB_Reordered> layout_B_reordered;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+cutlass::DeviceAllocation<StrideC_ref> stride_C_ref;
+cutlass::DeviceAllocation<StrideD_ref> stride_D_ref;
+cutlass::DeviceAllocation<StrideS_ref> stride_S_ref;
+cutlass::DeviceAllocation<StrideS> stride_S;
+
+// Note, this is an array of pointers to alpha and beta scaling values per group
+cutlass::DeviceAllocation<ElementAccumulator*> alpha_device;
+cutlass::DeviceAllocation<ElementAccumulator*> beta_device;
+cutlass::DeviceAllocation<ElementAccumulator> block_alpha;
+cutlass::DeviceAllocation<ElementAccumulator> block_beta;
+
+#endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options : GroupedMixedDtypeOptions<QuantType> {
+  using Base = GroupedMixedDtypeOptions<QuantType>;
+
+  bool shuffle = true;
+
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+    cmd.get_cmd_line_argument("shuffle", shuffle);
+
+    this->Base::parse(argc, args);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "69_hopper_int4_bf16_grouped_gemm\n\n"
+      << "  Hopper Mixed Dtype Grouped GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM for all groups\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM for all groups\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM for all groups\n"
+      << "  --groups=<int>              Sets the number of individual GEMM problems for Grouped GEMM\n"
+      << "  --mode=<int>                The mode to run the gemm. 0 does (A @ B), 1 means A @ (scale * B), 2 means A @ (scale * B + zero-point).\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform\n\n"
+      << "  --warmup=<int>              Number of warmup iterations to perform\n\n"
+      << "  --shuffle=<boolean>         Enable the offline layout swizzling.\n\n"
+      << "  --benchmark=<str>           Executes a benchmark problem size.\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "69_hopper_int4_bf16_grouped_gemm" << " --m=1024 --n=512 --k=1024 --groups=10 --alpha=1 --beta=0 \n\n";
+
+    return out;
+  }
+};
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Allocates device-side data
+void allocate(Options const& options) {
+  int64_t total_elements_A = 0;
+  int64_t total_elements_B = 0;
+  int64_t total_elements_B_dq = 0;
+  int64_t total_elements_C = 0;
+  int64_t total_elements_D = 0;
+  int64_t total_elements_scale = 0;
+  int64_t total_elements_zero = 0;
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    const int scale_k = 1;
+
+    offset_A.push_back(total_elements_A);
+    offset_B.push_back(total_elements_B * cutlass::sizeof_bits<QuantType>::value / 8);
+    offset_B_dq.push_back(total_elements_B_dq);
+    offset_C.push_back(total_elements_C);
+    offset_D.push_back(total_elements_D);
+    offset_scale.push_back(total_elements_scale);
+    offset_zero.push_back(total_elements_zero);
+
+    int64_t elements_A = M * K;
+    int64_t elements_B = K * N ;
+    int64_t elements_B_dq = K * N;
+    int64_t elements_C = M * N;
+    int64_t elements_D = M * N;
+    int64_t elements_scale = scale_k * N;
+    int64_t elements_zero = scale_k * N;
+
+    total_elements_A += elements_A;
+    total_elements_B += elements_B;
+    total_elements_B_dq += elements_B_dq;
+    total_elements_C += elements_C;
+    total_elements_D += elements_D;
+    total_elements_scale += elements_scale;
+    total_elements_zero += elements_zero;
+
+    stride_A_host.push_back(cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1}));
+    stride_B_host.push_back(cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}));
+    stride_C_host.push_back(cutlass::make_cute_packed_stride(StrideC{}, {N, M, 1}));
+    stride_D_host.push_back(cutlass::make_cute_packed_stride(StrideD{}, {N, M, 1}));
+    stride_C_host_ref.push_back(cutlass::make_cute_packed_stride(StrideC_ref{}, {M, N, 1}));
+    stride_D_host_ref.push_back(cutlass::make_cute_packed_stride(StrideD_ref{}, {M, N, 1}));
+    stride_S_host_ref.push_back(cutlass::make_cute_packed_stride(StrideS_ref{}, {N, scale_k, 1}));
+    stride_S_host.push_back(cutlass::make_cute_packed_stride(StrideS{}, {N, scale_k, 1}));
+  }
+
+  block_A.reset(total_elements_A);
+  block_B.reset(total_elements_B);
+  block_B_dq.reset(total_elements_B_dq);
+  block_C.reset(total_elements_C);
+  block_D.reset(total_elements_D);
+  block_ref_D.reset(total_elements_D);
+  block_scale.reset(total_elements_scale);
+  block_zero.reset(total_elements_zero);
+
+  block_alpha.reset(options.groups);
+  block_beta.reset(options.groups);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(Options &options) {
+
+  uint64_t seed = 2020;
+
+  problem_sizes.reset(options.groups);
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+
+  //
+  // Assign pointers
+  //
+
+  std::vector<MmaType *> ptr_A_host(options.groups);
+  std::vector<QuantType *> ptr_B_host(options.groups);
+  std::vector<MmaType *> ptr_B_dq_host(options.groups);
+  std::vector<ElementC *> ptr_C_host(options.groups);
+  std::vector<ElementC *> ptr_D_host(options.groups);
+  std::vector<ElementScale *> ptr_scale_host(options.groups);
+  std::vector<ElementZero *> ptr_zero_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_alpha_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_beta_host(options.groups);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    ptr_A_host.at(i) = block_A.get() + offset_A.at(i);
+    ptr_B_host.at(i) = block_B.get() + offset_B.at(i);
+    ptr_B_dq_host.at(i) = block_B_dq.get() + offset_B_dq.at(i);
+    ptr_C_host.at(i) = block_C.get() + offset_C.at(i);
+    ptr_D_host.at(i) = block_D.get() + offset_D.at(i);
+    ptr_scale_host.at(i) = block_scale.get() + offset_scale.at(i);
+    ptr_zero_host.at(i) = block_zero.get() + offset_zero.at(i);
+    alpha_host.push_back((options.alpha == FLT_MAX) ? static_cast<ElementAccumulator>((rand() % 5) + 1) : options.alpha);
+    beta_host.push_back((options.beta == FLT_MAX) ? static_cast<ElementAccumulator>(rand() % 5) : options.beta);
+    ptr_alpha_host.at(i) = block_alpha.get() + i;
+    ptr_beta_host.at(i) = block_beta.get() + i;
+  }
+
+  ptr_A.reset(options.groups);
+  ptr_A.copy_from_host(ptr_A_host.data());
+
+  ptr_B.reset(options.groups);
+  ptr_B.copy_from_host(ptr_B_host.data());
+
+  ptr_B_dq.reset(options.groups);
+  ptr_B_dq.copy_from_host(ptr_B_dq_host.data());
+
+  ptr_C.reset(options.groups);
+  ptr_C.copy_from_host(ptr_C_host.data());
+
+  ptr_D.reset(options.groups);
+  ptr_D.copy_from_host(ptr_D_host.data());
+
+  ptr_scale.reset(options.groups);
+  ptr_scale.copy_from_host(ptr_scale_host.data());
+
+  ptr_zero.reset(options.groups);
+  ptr_zero.copy_from_host(ptr_zero_host.data());
+
+  stride_A.reset(options.groups);
+  stride_A.copy_from_host(stride_A_host.data());
+
+  stride_B.reset(options.groups);
+  stride_B.copy_from_host(stride_B_host.data());
+
+  stride_C.reset(options.groups);
+  stride_C.copy_from_host(stride_C_host.data());
+
+  stride_D.reset(options.groups);
+  stride_D.copy_from_host(stride_D_host.data());
+
+  stride_C_ref.reset(options.groups);
+  stride_C_ref.copy_from_host(stride_C_host_ref.data());
+
+  stride_D_ref.reset(options.groups);
+  stride_D_ref.copy_from_host(stride_D_host_ref.data());
+
+  stride_S_ref.reset(options.groups);
+  stride_S_ref.copy_from_host(stride_S_host_ref.data());
+
+  stride_S.reset(options.groups);
+  stride_S.copy_from_host(stride_S_host.data());
+
+  alpha_device.reset(options.groups);
+  alpha_device.copy_from_host(ptr_alpha_host.data());
+  beta_device.reset(options.groups);
+  beta_device.copy_from_host(ptr_beta_host.data());
+
+  initialize_tensor(block_A, seed + 2023);
+  initialize_quant_tensor(block_B, seed + 2022);
+  initialize_tensor(block_C, seed + 2021);
+  initialize_scale(block_scale, options);
+  initialize_zero(block_zero, options);
+  block_alpha.copy_from_host(alpha_host.data());
+  block_beta.copy_from_host(beta_host.data());
+
+  
+  for (int32_t i = 0; i < options.groups; ++i) {
+    const int scale_k = 1;
+    auto shape_B = cute::make_shape(cute::get<1>(options.problem_sizes_host[i]), cute::get<2>(options.problem_sizes_host[i]), Int<1>{});
+    auto shape_scale = cute::make_shape(cute::get<1>(options.problem_sizes_host[i]), scale_k, Int<1>{});
+    auto layout_B = make_layout(shape_B, stride_B_host.at(i));
+    auto layout_scale = make_layout(shape_scale, stride_S_host_ref.at(i));
+    cudaStream_t stream = cudaStreamDefault;
+    cutlass::dequantize(block_B_dq.get() + offset_B_dq.at(i), block_B.get() + offset_B.at(i), layout_B, block_scale.get() + offset_scale.at(i), block_zero.get() + offset_zero.at(i), layout_scale, options.k, stream);
+  }
+
+  problem_sizes.reset(options.groups);
+
+  if (options.shuffle) {
+    std::vector<LayoutB_Reordered> layout_B_reordered_host(options.groups);
+    for (int32_t i = 0; i < options.groups; ++i) {
+      auto shape_B = cute::make_shape(cute::get<1>(options.problem_sizes_host[i]), cute::get<2>(options.problem_sizes_host[i]), Int<1>{});
+      auto layout_B = make_layout(shape_B, stride_B_host.at(i));
+      // Repeat the reorder layout atom to tile the whole tensor shape 
+      layout_B_reordered_host[i] = tile_to_shape(LayoutAtomQuant{}, shape_B);
+      cutlass::reorder_tensor(block_B.get() + offset_B.at(i), layout_B, layout_B_reordered_host[i]);
+      if (i == 0) {
+        print("Quantized tensor layout: ");
+        print(layout_B_reordered_host[0]);
+        print("\n");
+      }
+    }
+    layout_B_reordered.reset(options.groups);
+    layout_B_reordered.copy_from_host(layout_B_reordered_host.data());
+  }
+
+  // Reverse MN -> NM for SwapAB
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto [M, N, K] = options.problem_sizes_host[i];
+    options.problem_sizes_host[i] = make_tuple(N, M, K);
+  }
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+template <typename Gemm>
+typename Gemm::Arguments args_from_options(Options const& options, bool host_problem_shapes_available = true)
+{
+  using Args = typename Gemm::Arguments;
+  auto&& dB = [&]() {
+    // NOTE: add GemmScaleWithZeroPointShuffled
+    if constexpr (cute::is_same_v<Gemm, GemmConvertOnlyShuffled> ||
+                  cute::is_same_v<Gemm, GemmScaleOnlyShuffled>) {
+      // offline swizzling is enabled.
+      return layout_B_reordered.get();
+    }
+    else {
+      return stride_B.get();
+    }
+  }();
+  cutlass::KernelHardwareInfo hw_info;
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+
+  Args arguments;
+  decltype(arguments.epilogue.thread) fusion_args;
+
+  if (options.alpha != FLT_MAX && options.beta != FLT_MAX) {
+    // If both alpha/beta are provided (via cmd line args) and are scalar, i.e., same alpha/beta applies to all batches.
+    fusion_args.alpha = options.alpha;
+    fusion_args.beta = options.beta;
+    fusion_args.alpha_ptr = nullptr;
+    fusion_args.beta_ptr = nullptr;
+    fusion_args.alpha_ptr_array = nullptr;
+    fusion_args.beta_ptr_array = nullptr;
+    // Single alpha and beta for all groups
+    fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 0};
+    fusion_args.dBeta = {cute::_0{}, cute::_0{}, 0};
+  }
+  else {
+    // If pointers to alpha/beta are provided, i.e., alpha/beta can differ between batches/groups.
+    fusion_args.alpha = 0;
+    fusion_args.beta = 0;
+    fusion_args.alpha_ptr = nullptr;
+    fusion_args.beta_ptr = nullptr;
+    fusion_args.alpha_ptr_array = alpha_device.get();
+    fusion_args.beta_ptr_array = beta_device.get();
+    // One alpha and beta per each group
+    fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 1};
+    fusion_args.dBeta = {cute::_0{}, cute::_0{}, 1};
+  }
+
+  if constexpr (Gemm::CollectiveMainloop::KernelConversionMode == Gemm::CollectiveMainloop::ConversionMode::DirectConvert) {
+    arguments = Args {
+      cutlass::gemm::GemmUniversalMode::kGrouped,
+      {options.groups, problem_sizes.get(), nullptr},
+      {ptr_B.get(), dB, ptr_A.get(), stride_A.get()},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      hw_info
+    };
+  }
+  else if constexpr (Gemm::CollectiveMainloop::KernelConversionMode == Gemm::CollectiveMainloop::ConversionMode::ConvertAndScale) {
+    arguments = Args {
+      cutlass::gemm::GemmUniversalMode::kGrouped,
+      {options.groups, problem_sizes.get(), nullptr},
+      {ptr_B.get(), dB, ptr_A.get(), stride_A.get(), ptr_scale.get(), stride_S.get(), options.k},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      hw_info
+    };
+  } 
+  else {
+    std::cerr << "Invalid mode " << options.mode << ". Must be 0, 1 or 2." << std::endl;
+    exit(-1);
+  }
+  return arguments;
+}
+
+bool verify(Options const& options) {
+  bool passed = true;
+
+  constexpr bool IsFP8Input = cute::is_same_v<MmaType, cutlass::float_e4m3_t> || cute::is_same_v<MmaType, cutlass::float_e5m2_t>;
+  using FP8Sched = cute::conditional_t<size<0>(TileShape{}) == 64, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>;
+  using ScheduleRef = cute::conditional_t<IsFP8Input, FP8Sched, cutlass::gemm::collective::KernelScheduleAuto>;
+
+
+  using CollectiveMainloopRef = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    MmaType, LayoutA, AlignmentA,
+    MmaType, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAuto,
+    ScheduleRef
+  >::CollectiveOp;
+
+  using CollectiveEpilogueRef = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC, AlignmentC,
+    ElementD, LayoutD, AlignmentD,
+    cutlass::epilogue::NoSmemWarpSpecialized
+  >::CollectiveOp;
+
+  using GemmKernelRef = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int>, // Indicates ProblemShape
+    CollectiveMainloopRef,
+    CollectiveEpilogueRef
+  >;
+
+  using GemmRef = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelRef>;
+  using StrideA_verif = typename GemmRef::GemmKernel::StrideA;
+  using StrideB_verif = typename GemmRef::GemmKernel::StrideB;
+  using StrideC_verif = typename GemmRef::GemmKernel::StrideC;
+  using StrideD_verif = typename GemmRef::GemmKernel::StrideD;
+
+  const ElementD epsilon(1e-2f);
+  const ElementD non_zero_floor(1e-4f);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto problem = options.problem_sizes_host.at(i);
+    auto N = get<0>(problem);
+    auto M = get<1>(problem);
+    auto K = get<2>(problem);
+    if (M == 0) {
+      continue;
+    }
+    else {
+      StrideA_verif stride_A_verif;
+      StrideB_verif stride_B_verif;
+
+      stride_A_verif = cutlass::make_cute_packed_stride(StrideA_verif{}, cute::make_shape(M, K, 1));
+      stride_B_verif = cutlass::make_cute_packed_stride(StrideB_verif{}, cute::make_shape(N, K, 1));
+
+      //
+      // Compute reference output
+      //
+
+      typename GemmRef::Arguments arguments{
+        cutlass::gemm::GemmUniversalMode::kGemm,
+        {M, N, K},
+        {block_A.get() + offset_A.at(i), stride_A_verif, block_B_dq.get() + offset_B_dq.at(i), stride_B_verif},
+        {{alpha_host.at(i), beta_host.at(i)}, block_C.get() + offset_C.at(i), stride_C_host_ref.at(i), block_ref_D.get() + offset_D.at(i), stride_D_host_ref.at(i)}
+      };
+
+      // Run the gemm where the scaling is performed outside of the kernel.
+      GemmRef gemm_ref;
+      size_t workspace_size = GemmRef::get_workspace_size(arguments);
+      cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+      CUTLASS_CHECK(gemm_ref.can_implement(arguments));
+      CUTLASS_CHECK(gemm_ref.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm_ref.run());
+
+      // Wait for kernel to finish
+      CUDA_CHECK(cudaDeviceSynchronize());
+
+      passed &= cutlass::reference::device::BlockCompareRelativelyEqual(block_ref_D.get() + offset_D.at(i), block_D.get() + offset_D.at(i), M * N, epsilon, non_zero_floor);
+      std::cout << "Group: " << i << " Status: " << passed << std::endl;
+    }
+  }
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options, bool host_problem_shapes_available = true)
+{
+  allocate(options);
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options<Gemm>(options, host_problem_shapes_available);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  std::cout << "We passed all checks\n";
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  MixedDtypeResult result;
+  result.passed = verify(options);
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+  grouped_mixed_dtype_profiling(gemm, options, result, alpha_host, beta_host);
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.3 Toolkit to run this example
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 3)) {
+    std::cerr << "This example requires CUDA 12.3 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 9 || props.minor != 0) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
+    return 0;
+  }
+  
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+  if (options.mode == MixedDtypeGemmMode::ConvertOnly) {
+    std::cout << "Running in no scale mode." << std::endl;
+    if (options.shuffle) {
+      std::cout << "Offline shuffle enabled." << std::endl;
+      run<GemmConvertOnlyShuffled>(options, false);
+    } else {
+      std::cout << "Offline shuffle disabled." << std::endl;
+      run<GemmConvertOnly>(options, false);
+    }
+  }
+  else if (options.mode == MixedDtypeGemmMode::ScaleOnly) {
+    std::cout << "Running in per-column scale mode." << std::endl;
+    if (options.shuffle) {
+      std::cout << "Offline shuffle enabled." << std::endl;
+      run<GemmScaleOnlyShuffled>(options, false);
+    } else {
+      std::cout << "Offline shuffle disabled." << std::endl;
+      run<GemmScaleOnly>(options, false);
+    }
+  }
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_fp8_grouped_gemm.cu
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_fp8_grouped_gemm.cu
@ -0,0 +1,765 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+/*! \file
+    \brief 
+    Hopper Mixed-input Grouped GEMM example using CUTLASS 3 APIs for NVIDIA Hopper architecture. 
+    See 55_hopper_int4_fp8_gemm.cu for more details about W4A8 GEMMs with lookup table.
+
+    Limitations:
+      1) Only support row-wise scaling. Zero-points and block-wise scaling is currently not supported.
+
+    To run this example:
+
+      $ ./examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_fp8_grouped_gemm --m=2048 --n=2048 --k=2048 --mode=1 --groups=10
+
+      The above example command makes all 10 groups to be sized at the given m, n, k sizes.
+      Skipping any of the problem dimensions randomizes it across the different groups.
+      Same applies for alpha and beta values that are randomized across the different groups.
+*/
+
+#include <iostream>
+#include <fstream>
+#include <sstream>
+#include <vector>
+#include <numeric>
+#include <typeinfo>
+#include <float.h>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/group_array_problem_shape.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/mixed_dtype_utils.hpp"
+
+#include "helper.h"
+#include "grouped_mixed_dtype_utils.hpp"
+
+using namespace cute;
+
+using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int,int,int>>; // <M,N,K> per group
+using MmaType = cutlass::float_e4m3_t;
+using QuantType = cutlass::int4b_t;
+constexpr int TileShapeK = 128 * 8 / sizeof_bits<MmaType>::value;
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = MmaType;
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = QuantType;                                      // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// This example manually swaps and transposes, so keep transpose of input layouts
+using LayoutA_Transpose = typename cutlass::layout::LayoutTranspose<LayoutA>::type;
+using LayoutB_Transpose = typename cutlass::layout::LayoutTranspose<LayoutB>::type;
+
+// Need to pass a pointer type to make the 3rd dimension of Stride be _0
+using StrideA = cute::remove_pointer_t<cutlass::detail::TagToStrideA_t<LayoutA*>>;
+using StrideB = cute::remove_pointer_t<cutlass::detail::TagToStrideB_t<LayoutB*>>;
+
+// Define the CuTe layout for reoredered quantized tensor B
+// LayoutAtomQuant places values that will be read by the same thread in contiguous locations in global memory.
+// It specifies the reordering within a single warp's fragment
+using LayoutAtomQuant = decltype(cutlass::compute_memory_reordering_atom<MmaType>());
+using LayoutB_Reordered = decltype(cute::tile_to_shape(LayoutAtomQuant{}, Layout<Shape<int,int,Int<1>>, StrideB>{}));
+
+using ElementZero = cutlass::float_e4m3_t;
+using ElementScale = cutlass::float_e4m3_t;
+using LayoutScale = cutlass::layout::RowMajor;
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::RowMajor;                      // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// D matrix configuration
+using         ElementD    = ElementC;
+using         LayoutD     = LayoutC;
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+using TileShape           = Shape<_128,_16,cute::Int<TileShapeK>>;                           // Threadblock-level tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;           // Stage count maximized based on the tile size
+using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative;
+using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative; // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, typename cutlass::layout::LayoutTranspose<LayoutC>::type *, AlignmentC,
+    ElementD, typename cutlass::layout::LayoutTranspose<LayoutD>::type *, AlignmentD,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+// =========================================================== MIXED INPUT WITH SCALES ===========================================================================
+// The Scale information must get paired with the operand that will be scaled. In this example, B is scaled so we make a tuple of B's information and the scale information.
+using CollectiveMainloopScaleOnly = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    cute::tuple<ElementB, cutlass::Array<ElementScale, 8>>, LayoutB_Transpose *, AlignmentB,
+    ElementA, LayoutA_Transpose *, AlignmentA,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelScaleOnly = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape, 
+    CollectiveMainloopScaleOnly,
+    CollectiveEpilogue
+>;
+
+using CollectiveMainloopShuffled = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    cute::tuple<ElementB, cutlass::Array<ElementScale, 8>>, LayoutB_Reordered *, AlignmentB,
+    ElementA, LayoutA_Transpose *, AlignmentA,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelShuffled = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape, 
+    CollectiveMainloopShuffled,
+    CollectiveEpilogue
+>;
+
+using GemmScaleOnly = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelScaleOnly>;
+using GemmShuffled  = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelShuffled>;
+
+using StrideC = typename GemmKernelScaleOnly::InternalStrideC;
+using StrideD = typename GemmKernelScaleOnly::InternalStrideD;
+
+using StrideC_ref = cutlass::detail::TagToStrideC_t<LayoutC>;
+using StrideD_ref = cutlass::detail::TagToStrideC_t<LayoutD>;
+using StrideS = typename CollectiveMainloopScaleOnly::StrideScale;
+using StrideS_ref = cutlass::detail::TagToStrideB_t<LayoutScale>;
+
+// Host-side allocations
+std::vector<int64_t> offset_A;
+std::vector<int64_t> offset_B;
+std::vector<int64_t> offset_B_dq;
+std::vector<int64_t> offset_C;
+std::vector<int64_t> offset_D;
+std::vector<int64_t> offset_scale;
+std::vector<int64_t> offset_zero;
+
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+std::vector<StrideC_ref> stride_C_host_ref;
+std::vector<StrideD_ref> stride_D_host_ref;
+std::vector<StrideS> stride_S_host;
+std::vector<StrideS_ref> stride_S_host_ref;
+
+std::vector<ElementAccumulator> alpha_host;
+std::vector<ElementAccumulator> beta_host;
+
+uint64_t seed = 2020;
+
+// Device-side allocations
+cutlass::DeviceAllocation<typename ProblemShape::UnderlyingProblemShape> problem_sizes;
+
+cutlass::DeviceAllocation<MmaType> block_A;
+cutlass::DeviceAllocation<QuantType> block_B;
+cutlass::DeviceAllocation<ElementB> block_B_modified;
+cutlass::DeviceAllocation<MmaType> block_B_dq;
+cutlass::DeviceAllocation<ElementScale> block_scale;
+cutlass::DeviceAllocation<cutlass::Array<ElementScale, 8>> block_scale_packed;
+cutlass::DeviceAllocation<ElementZero> block_zero;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<typename GemmScaleOnly::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename GemmScaleOnly::EpilogueOutputOp::ElementOutput> block_ref_D;
+
+cutlass::DeviceAllocation<const MmaType *> ptr_A;
+cutlass::DeviceAllocation<const QuantType *> ptr_B;
+cutlass::DeviceAllocation<const MmaType *> ptr_B_dq;
+cutlass::DeviceAllocation<const cutlass::Array<ElementScale, 8> *> ptr_scale_packed;
+cutlass::DeviceAllocation<const ElementZero *> ptr_zero;
+cutlass::DeviceAllocation<const ElementC *> ptr_C;
+cutlass::DeviceAllocation<typename GemmScaleOnly::EpilogueOutputOp::ElementOutput *> ptr_D;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<LayoutB_Reordered> layout_B_reordered;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+cutlass::DeviceAllocation<StrideC_ref> stride_C_ref;
+cutlass::DeviceAllocation<StrideD_ref> stride_D_ref;
+cutlass::DeviceAllocation<StrideS_ref> stride_S_ref;
+cutlass::DeviceAllocation<StrideS> stride_S;
+
+// Note, this is an array of pointers to alpha and beta scaling values per group
+cutlass::DeviceAllocation<ElementAccumulator*> alpha_device;
+cutlass::DeviceAllocation<ElementAccumulator*> beta_device;
+cutlass::DeviceAllocation<ElementAccumulator> block_alpha;
+cutlass::DeviceAllocation<ElementAccumulator> block_beta;
+
+#endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options : GroupedMixedDtypeOptions<QuantType> {
+  using Base = GroupedMixedDtypeOptions<QuantType>;
+
+  bool shuffle = true;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+    cmd.get_cmd_line_argument("shuffle", shuffle);
+
+    this->Base::parse(argc, args);
+
+    mode = 1; // override the mode value to always be scale only mode
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "69_hopper_int4_fp8_grouped_gemm\n\n"
+      << "  Hopper Mixed Dtype Grouped GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM for all groups\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM for all groups\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM for all groups\n"
+      << "  --groups=<int>              Sets the number of individual GEMM problems for Grouped GEMM\n"
+      << "  --c=<int>                   The size of each chunk for the scales and zeros. To broadcast a vector of scales or zeros, set the group size to K.\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform\n\n"
+      << "  --warmup=<int>              Number of warmup iterations to perform\n\n"
+      << "  --shuffle=<boolean>         Enable the offline layout swizzling.\n\n"
+      << "  --benchmark=<str>           Executes a benchmark problem size.\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "69_hopper_int4_fp8_grouped_gemm" << " --m=1024 --n=512 --k=1024 --groups=10 --alpha=1 --beta=0 \n\n";
+
+    return out;
+  }
+};
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// In the mainloop, PRMT selects 1 byte from only 8 bytes so the sign bit is handled in an extra PRMT.
+// Here the encodings of positive values and negative values are unified (except for the sign bit). 
+// For instance, 1 becomes 0b0111, which is the same encoding as -1 (0b1111).
+
+/// Allocates device-side data
+void allocate(Options const& options) {
+  int64_t total_elements_A = 0;
+  int64_t total_elements_B = 0;
+  int64_t total_elements_B_dq = 0;
+  int64_t total_elements_C = 0;
+  int64_t total_elements_D = 0;
+  int64_t total_elements_scale = 0;
+  int64_t total_elements_zero = 0;
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    const int scale_k = 1;
+
+    offset_A.push_back(total_elements_A);
+    offset_B.push_back(total_elements_B * cutlass::sizeof_bits<QuantType>::value / 8);
+    offset_B_dq.push_back(total_elements_B_dq);
+    offset_C.push_back(total_elements_C);
+    offset_D.push_back(total_elements_D);
+    offset_scale.push_back(total_elements_scale);
+    offset_zero.push_back(total_elements_zero);
+
+    int64_t elements_A = M * K;
+    int64_t elements_B = K * N ;
+    int64_t elements_B_dq = K * N;
+    int64_t elements_C = M * N;
+    int64_t elements_D = M * N;
+    int64_t elements_scale = scale_k * N;
+    int64_t elements_zero = scale_k * N;
+
+    total_elements_A += elements_A;
+    total_elements_B += elements_B;
+    total_elements_B_dq += elements_B_dq;
+    total_elements_C += elements_C;
+    total_elements_D += elements_D;
+    total_elements_scale += elements_scale;
+    total_elements_zero += elements_zero;
+
+    stride_A_host.push_back(cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1}));
+    stride_B_host.push_back(cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}));
+    stride_C_host.push_back(cutlass::make_cute_packed_stride(StrideC{}, {N, M, 1}));
+    stride_D_host.push_back(cutlass::make_cute_packed_stride(StrideD{}, {N, M, 1}));
+    stride_C_host_ref.push_back(cutlass::make_cute_packed_stride(StrideC_ref{}, {M, N, 1}));
+    stride_D_host_ref.push_back(cutlass::make_cute_packed_stride(StrideD_ref{}, {M, N, 1}));
+    stride_S_host_ref.push_back(cutlass::make_cute_packed_stride(StrideS_ref{}, {N, scale_k, 1}));
+    stride_S_host.push_back(cutlass::make_cute_packed_stride(StrideS{}, {N, scale_k, 1}));
+  }
+
+  block_A.reset(total_elements_A);
+  block_B.reset(total_elements_B);
+  block_B_modified.reset(total_elements_B);
+  block_B_dq.reset(total_elements_B_dq);
+  block_C.reset(total_elements_C);
+  block_D.reset(total_elements_D);
+  block_ref_D.reset(total_elements_D);
+  block_scale.reset(total_elements_scale);
+  block_scale_packed.reset(total_elements_scale);
+  block_zero.reset(total_elements_zero);
+
+  block_alpha.reset(options.groups);
+  block_beta.reset(options.groups);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(Options& options) {
+
+  uint64_t seed = 2020;
+
+  problem_sizes.reset(options.groups);
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+
+  //
+  // Assign pointers
+  //
+
+  std::vector<MmaType *> ptr_A_host(options.groups);
+  std::vector<QuantType *> ptr_B_host(options.groups);
+  std::vector<MmaType *> ptr_B_dq_host(options.groups);
+  std::vector<ElementC *> ptr_C_host(options.groups);
+  std::vector<ElementC *> ptr_D_host(options.groups);
+  std::vector<cutlass::Array<ElementScale, 8> *> ptr_scale_packed_host(options.groups);
+  std::vector<ElementZero *> ptr_zero_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_alpha_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_beta_host(options.groups);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    ptr_A_host.at(i) = block_A.get() + offset_A.at(i);
+    ptr_B_host.at(i) = block_B_modified.get() + offset_B.at(i);
+    ptr_B_dq_host.at(i) = block_B_dq.get() + offset_B_dq.at(i);
+    ptr_C_host.at(i) = block_C.get() + offset_C.at(i);
+    ptr_D_host.at(i) = block_D.get() + offset_D.at(i);
+    ptr_scale_packed_host.at(i) = block_scale_packed.get() + offset_scale.at(i);
+    ptr_zero_host.at(i) = block_zero.get() + offset_zero.at(i);
+    alpha_host.push_back((options.alpha == FLT_MAX) ? static_cast<ElementAccumulator>((rand() % 5) + 1) : options.alpha);
+    beta_host.push_back((options.beta == FLT_MAX) ? static_cast<ElementAccumulator>(rand() % 5) : options.beta);
+    ptr_alpha_host.at(i) = block_alpha.get() + i;
+    ptr_beta_host.at(i) = block_beta.get() + i;
+  }
+
+  ptr_A.reset(options.groups);
+  ptr_A.copy_from_host(ptr_A_host.data());
+
+  ptr_B.reset(options.groups);
+  ptr_B.copy_from_host(ptr_B_host.data());
+
+  ptr_B_dq.reset(options.groups);
+  ptr_B_dq.copy_from_host(ptr_B_dq_host.data());
+
+  ptr_C.reset(options.groups);
+  ptr_C.copy_from_host(ptr_C_host.data());
+
+  ptr_D.reset(options.groups);
+  ptr_D.copy_from_host(ptr_D_host.data());
+
+  ptr_scale_packed.reset(options.groups);
+  ptr_scale_packed.copy_from_host(ptr_scale_packed_host.data());
+
+  ptr_zero.reset(options.groups);
+  ptr_zero.copy_from_host(ptr_zero_host.data());
+
+  stride_A.reset(options.groups);
+  stride_A.copy_from_host(stride_A_host.data());
+
+  stride_B.reset(options.groups);
+  stride_B.copy_from_host(stride_B_host.data());
+
+  stride_C.reset(options.groups);
+  stride_C.copy_from_host(stride_C_host.data());
+
+  stride_D.reset(options.groups);
+  stride_D.copy_from_host(stride_D_host.data());
+
+  stride_C_ref.reset(options.groups);
+  stride_C_ref.copy_from_host(stride_C_host_ref.data());
+
+  stride_D_ref.reset(options.groups);
+  stride_D_ref.copy_from_host(stride_D_host_ref.data());
+
+  stride_S_ref.reset(options.groups);
+  stride_S_ref.copy_from_host(stride_S_host_ref.data());
+
+  stride_S.reset(options.groups);
+  stride_S.copy_from_host(stride_S_host.data());
+
+  alpha_device.reset(options.groups);
+  alpha_device.copy_from_host(ptr_alpha_host.data());
+  beta_device.reset(options.groups);
+  beta_device.copy_from_host(ptr_beta_host.data());
+
+  initialize_tensor(block_A, seed + 2023);
+  initialize_quant_tensor(block_B, seed + 2022);
+  cutlass::unified_encode_int4b(block_B.get(), block_B_modified.get(), block_B.size());
+  initialize_tensor(block_C, seed + 2021);
+  initialize_scale(block_scale, options);
+  cutlass::pack_scale_fp8(block_scale.get(), block_scale_packed.get(), block_scale.size());
+  initialize_zero(block_zero, options);
+  block_alpha.copy_from_host(alpha_host.data());
+  block_beta.copy_from_host(beta_host.data());
+
+  problem_sizes.reset(options.groups);
+
+  if (options.shuffle) {
+    std::vector<LayoutB_Reordered> layout_B_reordered_host(options.groups);
+    for (int32_t i = 0; i < options.groups; ++i) {
+      auto shape_B = cute::make_shape(cute::get<1>(options.problem_sizes_host[i]), cute::get<2>(options.problem_sizes_host[i]), Int<1>{});
+      auto layout_B = make_layout(shape_B, stride_B_host.at(i));
+      // Repeat the reorder layout atom to tile the whole tensor shape 
+      layout_B_reordered_host[i] = tile_to_shape(LayoutAtomQuant{}, shape_B);
+      cutlass::reorder_tensor(block_B_modified.get() + offset_B.at(i), layout_B, layout_B_reordered_host[i]);
+      if (i == 0) {
+        print("Quantized tensor layout: ");
+        print(layout_B_reordered_host[0]);
+        print("\n");
+      }
+    }
+    layout_B_reordered.reset(options.groups);
+    layout_B_reordered.copy_from_host(layout_B_reordered_host.data());
+  }
+
+  // Reverse MN -> NM for SwapAB
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto [M, N, K] = options.problem_sizes_host[i];
+    options.problem_sizes_host[i] = make_tuple(N, M, K);
+  }
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+template <typename Gemm>
+typename Gemm::Arguments args_from_options(Options const& options, bool host_problem_shapes_available = true)
+{
+  using Args = typename Gemm::Arguments;
+  auto&& dB = [&]() {
+    if constexpr (cute::is_same_v<Gemm, GemmShuffled>) { // offline swizzling is enabled.
+      return layout_B_reordered.get();
+    }
+    else {
+      return stride_B.get();
+    }
+  }();
+  cutlass::KernelHardwareInfo hw_info;
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+
+  Args arguments;
+  decltype(arguments.epilogue.thread) fusion_args;
+
+  if (options.alpha != FLT_MAX && options.beta != FLT_MAX) {
+    // If both alpha/beta are provided (via cmd line args) and are scalar, i.e., same alpha/beta applies to all batches.
+    fusion_args.alpha = options.alpha;
+    fusion_args.beta = options.beta;
+    fusion_args.alpha_ptr = nullptr;
+    fusion_args.beta_ptr = nullptr;
+    fusion_args.alpha_ptr_array = nullptr;
+    fusion_args.beta_ptr_array = nullptr;
+    // Single alpha and beta for all groups
+    fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 0};
+    fusion_args.dBeta = {cute::_0{}, cute::_0{}, 0};
+  }
+  else {
+    // If pointers to alpha/beta are provided, i.e., alpha/beta can differ between batches/groups.
+    fusion_args.alpha = 0;
+    fusion_args.beta = 0;
+    fusion_args.alpha_ptr = nullptr;
+    fusion_args.beta_ptr = nullptr;
+    fusion_args.alpha_ptr_array = alpha_device.get();
+    fusion_args.beta_ptr_array = beta_device.get();
+    // One alpha and beta per each group
+    fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 1};
+    fusion_args.dBeta = {cute::_0{}, cute::_0{}, 1};
+  }
+  arguments = Args {
+    cutlass::gemm::GemmUniversalMode::kGrouped,
+    {options.groups, problem_sizes.get(), nullptr},
+    {ptr_B.get(), dB, ptr_A.get(), stride_A.get(), ptr_scale_packed.get(), stride_S.get(), options.k},
+    {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+    hw_info
+  };
+  return arguments;
+}
+
+
+bool verify(Options const& options) {
+  bool passed = true;
+
+  constexpr bool IsFP8Input = cute::is_same_v<MmaType, cutlass::float_e4m3_t> || cute::is_same_v<MmaType, cutlass::float_e5m2_t>;
+  using FP8Sched = cute::conditional_t<size<0>(TileShape{}) == 64, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>;
+  using ScheduleRef = cute::conditional_t<IsFP8Input, FP8Sched, cutlass::gemm::collective::KernelScheduleAuto>;
+
+  using CollectiveMainloopRef = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    MmaType, LayoutA, AlignmentA,
+    MmaType, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAuto,
+    ScheduleRef
+  >::CollectiveOp;
+
+  using CollectiveEpilogueRef = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC, AlignmentC,
+    ElementD, LayoutD, AlignmentD,
+    cutlass::epilogue::NoSmemWarpSpecialized
+  >::CollectiveOp;
+
+  using GemmKernelRef = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int>, // Indicates ProblemShape
+    CollectiveMainloopRef,
+    CollectiveEpilogueRef
+  >;
+
+  using GemmRef = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelRef>;
+  using StrideA_verif = typename GemmRef::GemmKernel::StrideA;
+  using StrideB_verif = typename GemmRef::GemmKernel::StrideB;
+  using StrideC_verif = typename GemmRef::GemmKernel::StrideC;
+  using StrideD_verif = typename GemmRef::GemmKernel::StrideD;
+
+  const ElementD epsilon(1e-2f);
+  const ElementD non_zero_floor(1e-4f);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto problem = options.problem_sizes_host.at(i);
+    auto N = get<0>(problem);
+    auto M = get<1>(problem);
+    auto K = get<2>(problem);
+    if (M == 0) {
+      continue;
+    }
+    else {
+      StrideA_verif stride_A_verif;
+      StrideB_verif stride_B_verif;
+
+      stride_A_verif = cutlass::make_cute_packed_stride(StrideA_verif{}, cute::make_shape(M, K, 1));
+      stride_B_verif = cutlass::make_cute_packed_stride(StrideB_verif{}, cute::make_shape(N, K, 1));
+
+      const int scale_k = 1;
+      auto layout_B = make_layout(cute::make_shape(N, K, Int<1>{}), stride_B_host.at(i));
+      auto layout_scale_zero = make_layout(cute::make_shape(N, scale_k, Int<1>{}), stride_S_host_ref.at(i));
+      cudaStream_t stream = cudaStreamDefault;
+      cutlass::dequantize(block_B_dq.get() + offset_B_dq.at(i), block_B.get() + offset_B.at(i), layout_B, block_scale.get() + offset_scale.at(i), block_zero.get() + offset_zero.at(i), layout_scale_zero, options.k, stream);
+
+      //
+      // Compute reference output
+      //
+
+      typename GemmRef::Arguments arguments{
+        cutlass::gemm::GemmUniversalMode::kGemm,
+        {M, N, K},
+        {block_A.get() + offset_A.at(i), stride_A_verif, block_B_dq.get() + offset_B_dq.at(i), stride_B_verif},
+        {{alpha_host.at(i), beta_host.at(i)}, block_C.get() + offset_C.at(i), stride_C_host_ref.at(i), block_ref_D.get() + offset_D.at(i), stride_D_host_ref.at(i)}
+      };
+
+      // Run the gemm where the scaling is performed outside of the kernel.
+      GemmRef gemm_ref;
+      size_t workspace_size = GemmRef::get_workspace_size(arguments);
+      cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+      CUTLASS_CHECK(gemm_ref.can_implement(arguments));
+      CUTLASS_CHECK(gemm_ref.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm_ref.run());
+
+      // Wait for kernel to finish
+      CUDA_CHECK(cudaDeviceSynchronize());
+
+      passed &= cutlass::reference::device::BlockCompareRelativelyEqual(block_ref_D.get() + offset_D.at(i), block_D.get() + offset_D.at(i), M * N, epsilon, non_zero_floor);
+      std::cout << "Group: " << i << " Status: " << passed << std::endl;
+    }
+  }
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options, bool host_problem_shapes_available = true)
+{
+  allocate(options);
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options<Gemm>(options, host_problem_shapes_available);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  std::cout << "We passed all checks\n";
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  MixedDtypeResult result;
+  result.passed = verify(options);
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+  grouped_mixed_dtype_profiling(gemm, options, result, alpha_host, beta_host);
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.3 Toolkit to run this example
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 3)) {
+    std::cerr << "This example requires CUDA 12.3 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 9 || props.minor != 0) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
+    return 0;
+  }
+  
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+  std::cout << "Running in per-column scale mode." << std::endl;
+  if (options.shuffle) {
+    std::cout << "Offline shuffle enabled." << std::endl;
+    run<GemmShuffled>(options, false);
+  } else {
+    std::cout << "Offline shuffle disabled." << std::endl;
+    run<GemmScaleOnly>(options, false);
+  }
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_mixed_dtype_grouped_gemm.cu
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_mixed_dtype_grouped_gemm.cu
@ -0,0 +1,690 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief 
+    Hopper Mixed-input Grouped GEMM example using CUTLASS 3 APIs for NVIDIA Hopper architecture. 
+    See 55_hopper_mixed_dtype_gemm.cu for more details about Mixed-input GEMMs.
+
+    Limitations:
+      1) Only support row-wise scaling. Zero-points and block-wise scaling is currently not supported.
+
+    To run this example:
+
+      $ ./examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_mixed_dtype_grouped_gemm --m=2048 --n=2048 --k=2048 --mode=1 --groups=10
+
+      The above example command makes all 10 groups to be sized at the given m, n, k sizes.
+      Skipping any of the problem dimensions randomizes it across the different groups.
+      Same applies for alpha and beta values that are randomized across the different groups.
+*/
+
+#include <iostream>
+#include <fstream>
+#include <sstream>
+#include <vector>
+#include <numeric>
+#include <typeinfo>
+#include <float.h>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/group_array_problem_shape.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/mixed_dtype_utils.hpp"
+
+#include "helper.h"
+#include "grouped_mixed_dtype_utils.hpp"
+
+using namespace cute;
+using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int,int,int>>; // <M,N,K> per group
+using MmaType = cutlass::bfloat16_t;
+using QuantType = cutlass::float_e5m2_t;
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+constexpr int TileShapeK = 128 * 8 / sizeof_bits<MmaType>::value;
+
+// A matrix configuration
+using         ElementA    = MmaType;
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = QuantType;                                      // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// This example manually swaps and transposes, so keep transpose of input layouts
+using LayoutA_Transpose = typename cutlass::layout::LayoutTranspose<LayoutA>::type;
+using LayoutB_Transpose = typename cutlass::layout::LayoutTranspose<LayoutB>::type;
+
+using ElementZero = cutlass::bfloat16_t;
+using ElementScale = cutlass::bfloat16_t;
+using LayoutScale = cutlass::layout::RowMajor;
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::RowMajor;                      // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// D matrix configuration
+using         ElementD    = ElementC;
+using         LayoutD     = LayoutC;
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+using TileShape           = Shape<_128,_16,cute::Int<TileShapeK>>;                           // Threadblock-level tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;           // Stage count maximized based on the tile size
+using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative;
+using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative; // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, typename cutlass::layout::LayoutTranspose<LayoutC>::type *, AlignmentC,
+    ElementD, typename cutlass::layout::LayoutTranspose<LayoutD>::type *, AlignmentD,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloopConvertOnly = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementB, LayoutB_Transpose *, AlignmentB,
+    ElementA, LayoutA_Transpose *, AlignmentA,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelConvertOnly = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloopConvertOnly,
+    CollectiveEpilogue
+>;
+
+using GemmConvertOnly = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelConvertOnly>;
+
+// =========================================================== MIXED INPUT WITH SCALES ===========================================================================
+// The Scale information must get paired with the operand that will be scaled. In this example, B is scaled so we make a tuple of B's information and the scale information.
+using CollectiveMainloopScaleOnly = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    cute::tuple<ElementB, ElementScale>, LayoutB_Transpose *, AlignmentB,
+    ElementA, LayoutA_Transpose *, AlignmentA,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelScaleOnly = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape, 
+    CollectiveMainloopScaleOnly,
+    CollectiveEpilogue
+>;
+
+using GemmScaleOnly = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelScaleOnly>;
+
+using StrideA = typename GemmConvertOnly::GemmKernel::InternalStrideA;
+using StrideB = typename GemmConvertOnly::GemmKernel::InternalStrideB;
+using StrideC = typename GemmConvertOnly::GemmKernel::InternalStrideC;
+using StrideD = typename GemmConvertOnly::GemmKernel::InternalStrideD;
+using StrideC_ref = cutlass::detail::TagToStrideC_t<LayoutC>;
+using StrideD_ref = cutlass::detail::TagToStrideC_t<LayoutD>;
+using StrideS = typename CollectiveMainloopScaleOnly::StrideScale;
+using StrideS_ref = cutlass::detail::TagToStrideB_t<LayoutScale>;
+
+// Host-side allocations
+std::vector<int64_t> offset_A;
+std::vector<int64_t> offset_B;
+std::vector<int64_t> offset_B_dq;
+std::vector<int64_t> offset_C;
+std::vector<int64_t> offset_D;
+std::vector<int64_t> offset_scale;
+std::vector<int64_t> offset_zero;
+
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+std::vector<StrideC_ref> stride_C_host_ref;
+std::vector<StrideD_ref> stride_D_host_ref;
+std::vector<StrideS> stride_S_host;
+std::vector<StrideS_ref> stride_S_host_ref;
+
+std::vector<ElementAccumulator> alpha_host;
+std::vector<ElementAccumulator> beta_host;
+
+uint64_t seed = 2020;
+
+// Device-side allocations
+cutlass::DeviceAllocation<typename ProblemShape::UnderlyingProblemShape> problem_sizes;
+
+cutlass::DeviceAllocation<MmaType> block_A;
+cutlass::DeviceAllocation<QuantType> block_B;
+cutlass::DeviceAllocation<MmaType> block_B_dq;
+cutlass::DeviceAllocation<ElementScale> block_scale;
+cutlass::DeviceAllocation<ElementZero> block_zero;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<typename GemmConvertOnly::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename GemmConvertOnly::EpilogueOutputOp::ElementOutput> block_ref_D;
+
+cutlass::DeviceAllocation<const MmaType *> ptr_A;
+cutlass::DeviceAllocation<const QuantType *> ptr_B;
+cutlass::DeviceAllocation<const MmaType *> ptr_B_dq;
+cutlass::DeviceAllocation<const ElementScale *> ptr_scale;
+cutlass::DeviceAllocation<const ElementZero *> ptr_zero;
+cutlass::DeviceAllocation<const ElementC *> ptr_C;
+cutlass::DeviceAllocation<typename GemmConvertOnly::EpilogueOutputOp::ElementOutput *> ptr_D;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+cutlass::DeviceAllocation<StrideC_ref> stride_C_ref;
+cutlass::DeviceAllocation<StrideD_ref> stride_D_ref;
+cutlass::DeviceAllocation<StrideS_ref> stride_S_ref;
+cutlass::DeviceAllocation<StrideS> stride_S;
+
+// Note, this is an array of pointers to alpha and beta scaling values per group
+cutlass::DeviceAllocation<ElementAccumulator*> alpha_device;
+cutlass::DeviceAllocation<ElementAccumulator*> beta_device;
+cutlass::DeviceAllocation<ElementAccumulator> block_alpha;
+cutlass::DeviceAllocation<ElementAccumulator> block_beta;
+
+#endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+using Options = GroupedMixedDtypeOptions<QuantType>;
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Allocates device-side data
+void allocate(Options const& options) {
+  int64_t total_elements_A = 0;
+  int64_t total_elements_B = 0;
+  int64_t total_elements_B_dq = 0;
+  int64_t total_elements_C = 0;
+  int64_t total_elements_D = 0;
+  int64_t total_elements_scale = 0;
+  int64_t total_elements_zero = 0;
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    const int scale_k = 1;
+
+    offset_A.push_back(total_elements_A);
+    offset_B.push_back(total_elements_B * cutlass::sizeof_bits<QuantType>::value / 8);
+    offset_B_dq.push_back(total_elements_B_dq);
+    offset_C.push_back(total_elements_C);
+    offset_D.push_back(total_elements_D);
+    offset_scale.push_back(total_elements_scale);
+    offset_zero.push_back(total_elements_zero);
+
+    int64_t elements_A = M * K;
+    int64_t elements_B = K * N ;
+    int64_t elements_B_dq = K * N;
+    int64_t elements_C = M * N;
+    int64_t elements_D = M * N;
+    int64_t elements_scale = scale_k * N;
+    int64_t elements_zero = scale_k * N;
+
+    total_elements_A += elements_A;
+    total_elements_B += elements_B;
+    total_elements_B_dq += elements_B_dq;
+    total_elements_C += elements_C;
+    total_elements_D += elements_D;
+    total_elements_scale += elements_scale;
+    total_elements_zero += elements_zero;
+
+    stride_A_host.push_back(cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1}));
+    stride_B_host.push_back(cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}));
+    stride_C_host.push_back(cutlass::make_cute_packed_stride(StrideC{}, {N, M, 1}));
+    stride_D_host.push_back(cutlass::make_cute_packed_stride(StrideD{}, {N, M, 1}));
+    stride_C_host_ref.push_back(cutlass::make_cute_packed_stride(StrideC_ref{}, {M, N, 1}));
+    stride_D_host_ref.push_back(cutlass::make_cute_packed_stride(StrideD_ref{}, {M, N, 1}));
+    stride_S_host_ref.push_back(cutlass::make_cute_packed_stride(StrideS_ref{}, {N, scale_k, 1}));
+    stride_S_host.push_back(cutlass::make_cute_packed_stride(StrideS{}, {N, scale_k, 1}));
+  }
+
+  block_A.reset(total_elements_A);
+  block_B.reset(total_elements_B);
+  block_B_dq.reset(total_elements_B_dq);
+  block_C.reset(total_elements_C);
+  block_D.reset(total_elements_D);
+  block_ref_D.reset(total_elements_D);
+  block_scale.reset(total_elements_scale);
+  block_zero.reset(total_elements_zero);
+
+  block_alpha.reset(options.groups);
+  block_beta.reset(options.groups);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(Options &options) {
+
+  uint64_t seed = 2020;
+
+  problem_sizes.reset(options.groups);
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+
+  //
+  // Assign pointers
+  //
+
+  std::vector<MmaType *> ptr_A_host(options.groups);
+  std::vector<QuantType *> ptr_B_host(options.groups);
+  std::vector<MmaType *> ptr_B_dq_host(options.groups);
+  std::vector<ElementC *> ptr_C_host(options.groups);
+  std::vector<ElementC *> ptr_D_host(options.groups);
+  std::vector<ElementScale *> ptr_scale_host(options.groups);
+  std::vector<ElementZero *> ptr_zero_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_alpha_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_beta_host(options.groups);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    ptr_A_host.at(i) = block_A.get() + offset_A.at(i);
+    ptr_B_host.at(i) = block_B.get() + offset_B.at(i);
+    ptr_B_dq_host.at(i) = block_B_dq.get() + offset_B_dq.at(i);
+    ptr_C_host.at(i) = block_C.get() + offset_C.at(i);
+    ptr_D_host.at(i) = block_D.get() + offset_D.at(i);
+    ptr_scale_host.at(i) = block_scale.get() + offset_scale.at(i);
+    ptr_zero_host.at(i) = block_zero.get() + offset_zero.at(i);
+    alpha_host.push_back((options.alpha == FLT_MAX) ? static_cast<ElementAccumulator>((rand() % 5) + 1) : options.alpha);
+    beta_host.push_back((options.beta == FLT_MAX) ? static_cast<ElementAccumulator>(rand() % 5) : options.beta);
+    ptr_alpha_host.at(i) = block_alpha.get() + i;
+    ptr_beta_host.at(i) = block_beta.get() + i;
+  }
+
+  ptr_A.reset(options.groups);
+  ptr_A.copy_from_host(ptr_A_host.data());
+
+  ptr_B.reset(options.groups);
+  ptr_B.copy_from_host(ptr_B_host.data());
+
+  ptr_B_dq.reset(options.groups);
+  ptr_B_dq.copy_from_host(ptr_B_dq_host.data());
+
+  ptr_C.reset(options.groups);
+  ptr_C.copy_from_host(ptr_C_host.data());
+
+  ptr_D.reset(options.groups);
+  ptr_D.copy_from_host(ptr_D_host.data());
+
+  ptr_scale.reset(options.groups);
+  ptr_scale.copy_from_host(ptr_scale_host.data());
+
+  ptr_zero.reset(options.groups);
+  ptr_zero.copy_from_host(ptr_zero_host.data());
+
+  stride_A.reset(options.groups);
+  stride_A.copy_from_host(stride_A_host.data());
+
+  stride_B.reset(options.groups);
+  stride_B.copy_from_host(stride_B_host.data());
+
+  stride_C.reset(options.groups);
+  stride_C.copy_from_host(stride_C_host.data());
+
+  stride_D.reset(options.groups);
+  stride_D.copy_from_host(stride_D_host.data());
+
+  stride_C_ref.reset(options.groups);
+  stride_C_ref.copy_from_host(stride_C_host_ref.data());
+
+  stride_D_ref.reset(options.groups);
+  stride_D_ref.copy_from_host(stride_D_host_ref.data());
+
+  stride_S_ref.reset(options.groups);
+  stride_S_ref.copy_from_host(stride_S_host_ref.data());
+
+  stride_S.reset(options.groups);
+  stride_S.copy_from_host(stride_S_host.data());
+
+  alpha_device.reset(options.groups);
+  alpha_device.copy_from_host(ptr_alpha_host.data());
+  beta_device.reset(options.groups);
+  beta_device.copy_from_host(ptr_beta_host.data());
+
+  initialize_tensor(block_A, seed + 2023);
+  initialize_quant_tensor(block_B, seed + 2022);
+  initialize_tensor(block_C, seed + 2021);
+  initialize_scale(block_scale, options);
+  initialize_zero(block_zero, options);
+  block_alpha.copy_from_host(alpha_host.data());
+  block_beta.copy_from_host(beta_host.data());
+
+  problem_sizes.reset(options.groups);
+  // Reverse MN -> NM for SwapAB
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto [M, N, K] = options.problem_sizes_host[i];
+    options.problem_sizes_host[i] = make_tuple(N, M, K);
+  }
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+template <typename Gemm>
+typename Gemm::Arguments args_from_options(Options const& options, bool host_problem_shapes_available = true)
+{ 
+  cutlass::KernelHardwareInfo hw_info;
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+
+  typename Gemm::Arguments arguments;
+  decltype(arguments.epilogue.thread) fusion_args;
+
+  if (options.alpha != FLT_MAX && options.beta != FLT_MAX) {
+    // If both alpha/beta are provided (via cmd line args) and are scalar, i.e., same alpha/beta applies to all batches.
+    fusion_args.alpha = options.alpha;
+    fusion_args.beta = options.beta;
+    fusion_args.alpha_ptr = nullptr;
+    fusion_args.beta_ptr = nullptr;
+    fusion_args.alpha_ptr_array = nullptr;
+    fusion_args.beta_ptr_array = nullptr;
+    // Single alpha and beta for all groups
+    fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 0};
+    fusion_args.dBeta = {cute::_0{}, cute::_0{}, 0};
+  }
+  else {
+    // If pointers to alpha/beta are provided, i.e., alpha/beta can differ between batches/groups.
+    fusion_args.alpha = 0;
+    fusion_args.beta = 0;
+    fusion_args.alpha_ptr = nullptr;
+    fusion_args.beta_ptr = nullptr;
+    fusion_args.alpha_ptr_array = alpha_device.get();
+    fusion_args.beta_ptr_array = beta_device.get();
+    // One alpha and beta per each group
+    fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 1};
+    fusion_args.dBeta = {cute::_0{}, cute::_0{}, 1};
+  }
+
+  if constexpr (Gemm::CollectiveMainloop::KernelConversionMode == Gemm::CollectiveMainloop::ConversionMode::DirectConvert) {
+    arguments = typename Gemm::Arguments {
+      cutlass::gemm::GemmUniversalMode::kGrouped,
+      {options.groups, problem_sizes.get(), nullptr},
+      {ptr_B.get(), stride_B.get(), ptr_A.get(), stride_A.get()},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      hw_info
+    };
+  }
+  else if constexpr (Gemm::CollectiveMainloop::KernelConversionMode == Gemm::CollectiveMainloop::ConversionMode::ConvertAndScale) {
+    arguments = typename Gemm::Arguments {
+      cutlass::gemm::GemmUniversalMode::kGrouped,
+      {options.groups, problem_sizes.get(), nullptr},
+      {ptr_B.get(), stride_B.get(), ptr_A.get(), stride_A.get(), ptr_scale.get(), stride_S.get(), options.k},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      hw_info
+    };
+  } 
+  else {
+    std::cerr << "Invalid mode " << options.mode << ". Must be 0, 1 or 2." << std::endl;
+    exit(-1);
+  }
+  return arguments;
+}
+
+bool verify(Options const& options) {
+  bool passed = true;
+
+  constexpr bool IsFP8Input = cute::is_same_v<MmaType, cutlass::float_e4m3_t> || cute::is_same_v<MmaType, cutlass::float_e5m2_t>;
+  using FP8Sched = cute::conditional_t<size<0>(TileShape{}) == 64, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>;
+  using ScheduleRef = cute::conditional_t<IsFP8Input, FP8Sched, cutlass::gemm::collective::KernelScheduleAuto>;
+
+
+  using CollectiveMainloopRef = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    MmaType, LayoutA, AlignmentA,
+    MmaType, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAuto,
+    ScheduleRef
+  >::CollectiveOp;
+
+  using CollectiveEpilogueRef = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC, AlignmentC,
+    ElementD, LayoutD, AlignmentD,
+    cutlass::epilogue::NoSmemWarpSpecialized
+  >::CollectiveOp;
+
+  using GemmKernelRef = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int>, // Indicates ProblemShape
+    CollectiveMainloopRef,
+    CollectiveEpilogueRef
+  >;
+
+  using GemmRef = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelRef>;
+  using StrideA_verif = typename GemmRef::GemmKernel::StrideA;
+  using StrideB_verif = typename GemmRef::GemmKernel::StrideB;
+  using StrideC_verif = typename GemmRef::GemmKernel::StrideC;
+  using StrideD_verif = typename GemmRef::GemmKernel::StrideD;
+
+  const ElementD epsilon(1e-2f);
+  const ElementD non_zero_floor(1e-4f);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto problem = options.problem_sizes_host.at(i);
+    auto N = get<0>(problem);
+    auto M = get<1>(problem);
+    auto K = get<2>(problem);
+    if (M == 0) {
+      continue;
+    }
+    else {
+      StrideA_verif stride_A_verif;
+      StrideB_verif stride_B_verif;
+
+      stride_A_verif = cutlass::make_cute_packed_stride(StrideA_verif{}, cute::make_shape(M, K, 1));
+      stride_B_verif = cutlass::make_cute_packed_stride(StrideB_verif{}, cute::make_shape(N, K, 1));
+
+      const int scale_k = 1;
+      auto layout_B = make_layout(cute::make_shape(N, K, Int<1>{}), stride_B_host.at(i));
+      auto layout_scale_zero = make_layout(cute::make_shape(N, scale_k, Int<1>{}), stride_S_host_ref.at(i));
+      cudaStream_t stream = cudaStreamDefault;
+      cutlass::dequantize(block_B_dq.get() + offset_B_dq.at(i), block_B.get() + offset_B.at(i), layout_B, block_scale.get() + offset_scale.at(i), block_zero.get() + offset_zero.at(i), layout_scale_zero, options.k, stream);
+
+      //
+      // Compute reference output
+      //
+
+      typename GemmRef::Arguments arguments{
+        cutlass::gemm::GemmUniversalMode::kGemm,
+        {M, N, K},
+        {block_A.get() + offset_A.at(i), stride_A_verif, block_B_dq.get() + offset_B_dq.at(i), stride_B_verif},
+        {{alpha_host.at(i), beta_host.at(i)}, block_C.get() + offset_C.at(i), stride_C_host_ref.at(i), block_ref_D.get() + offset_D.at(i), stride_D_host_ref.at(i)}
+      };
+
+      // Run the gemm where the scaling is performed outside of the kernel.
+      GemmRef gemm_ref;
+      size_t workspace_size = GemmRef::get_workspace_size(arguments);
+      cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+      CUTLASS_CHECK(gemm_ref.can_implement(arguments));
+      CUTLASS_CHECK(gemm_ref.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm_ref.run());
+
+      // Wait for kernel to finish
+      CUDA_CHECK(cudaDeviceSynchronize());
+
+      passed &= cutlass::reference::device::BlockCompareRelativelyEqual(block_ref_D.get() + offset_D.at(i), block_D.get() + offset_D.at(i), M * N, epsilon, non_zero_floor);
+      std::cout << "Group: " << i << " Status: " << passed << std::endl;
+    }
+  }
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options, bool host_problem_shapes_available = true)
+{
+  allocate(options);
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options<Gemm>(options, host_problem_shapes_available);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  std::cout << "We passed all checks\n";
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  MixedDtypeResult result;
+  result.passed = verify(options);
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+  grouped_mixed_dtype_profiling(gemm, options, result, alpha_host, beta_host);
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.3 Toolkit to run this example
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 3)) {
+    std::cerr << "This example requires CUDA 12.3 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 9 || props.minor != 0) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
+    return 0;
+  }
+  
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+  if (options.mode == MixedDtypeGemmMode::ConvertOnly) {
+    std::cout << "Running in no scale mode." << std::endl;
+    run<GemmConvertOnly>(options, false);
+  }
+  else if (options.mode == MixedDtypeGemmMode::ScaleOnly) {
+    std::cout << "Running in group scale mode." << std::endl;
+    run<GemmScaleOnly>(options, false);
+  }
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/CMakeLists.txt
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/CMakeLists.txt
@ -0,0 +1,112 @@
+# Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# Note that we set --iterations=0 for all tests below to disable the performance benchmarking.
+# Only the correctness check will be run by these commands.
+
+set(TEST_RANDOM --iterations=0)                                                          # Random problem sizes
+set(TEST_RANDOM_LARGE_GROUP --groups=100 --iterations=0)                                 # Random problem sizes
+    
+set(TEST_EPILOGUE --alpha=0.5 --beta=0.5 --iterations=0)                                 # Random problem sizes
+set(TEST_EPILOGUE_LARGE_GROUP --alpha=2.0 --beta=2.0 --groups=100 --iterations=0)        # Random problem sizes
+    
+set(TEST_EPILOGUE_OP --beta=0.5 --iterations=1)                                          # Random problem sizes
+set(TEST_EPILOGUE_OP_LARGE_GROUP --alpha=0.25 --iterations=1)                            # Random problem sizes
+    
+set(TEST_FIXED --m=2048 --n=5120 --k=8192 --groups=16 --iterations=0)                    # Fixed problem sizes
+set(TEST_FIXED_LARGE_GROUP --m=2048 --n=512 --k=512 --groups=100 --iterations=0)         # Fixed problem sizes
+    
+set(TEST_SMALL --m=256 --n=128 --iterations=0)                                           # Small problem sizes
+set(TEST_SMALL_LARGE_GROUP --m=128 --n=128 --groups=100 --iterations=0)                  # Small problem sizes
+    
+set(TEST_RANDOM_PERF --iterations=10)                                                    # Random problem sizes
+set(TEST_RANDOM_PERF_LARGE_GROUP --groups=100 --iterations=10)                           # Random problem sizes
+    
+set(TEST_DIRECT_BATCHED --m=2048 --n=5120 --k=8192 --mode=0 --iterations=0)              # Direct conversion
+
+set(TEST_SCALE_PERCOL --m=4096 --n=5120 --k=8192 --c=8192 --mode=1 --iterations=0)       # Per Column scaling
+
+cutlass_example_add_executable(
+  69_hopper_mixed_dtype_grouped_gemm
+  69_hopper_mixed_dtype_grouped_gemm.cu
+  TEST_COMMAND_OPTIONS
+  TEST_RANDOM
+  TEST_RANDOM_LARGE_GROUP
+  TEST_EPILOGUE
+  TEST_EPILOGUE_LARGE_GROUP
+  TEST_EPILOGUE_OP
+  TEST_EPILOGUE_OP_LARGE_GROUP
+  TEST_FIXED
+  TEST_FIXED_LARGE_GROUP
+  TEST_SMALL
+  TEST_SMALL_LARGE_GROUP
+  TEST_RANDOM_PERF
+  TEST_RANDOM_PERF_LARGE_GROUP
+  TEST_DIRECT_BATCHED
+  TEST_SCALE_PERCOL
+)
+
+cutlass_example_add_executable(
+  69_hopper_int4_fp8_grouped_gemm
+  69_hopper_int4_fp8_grouped_gemm.cu
+  TEST_COMMAND_OPTIONS
+  TEST_RANDOM
+  TEST_RANDOM_LARGE_GROUP
+  TEST_EPILOGUE
+  TEST_EPILOGUE_LARGE_GROUP
+  TEST_EPILOGUE_OP
+  TEST_EPILOGUE_OP_LARGE_GROUP
+  TEST_FIXED
+  TEST_FIXED_LARGE_GROUP
+  TEST_SMALL
+  TEST_SMALL_LARGE_GROUP
+  TEST_RANDOM_PERF
+  TEST_RANDOM_PERF_LARGE_GROUP
+  TEST_DIRECT_BATCHED
+  TEST_SCALE_PERCOL
+)
+
+cutlass_example_add_executable(
+  69_hopper_int4_bf16_grouped_gemm
+  69_hopper_int4_bf16_grouped_gemm.cu
+  TEST_COMMAND_OPTIONS
+  TEST_RANDOM
+  TEST_RANDOM_LARGE_GROUP
+  TEST_EPILOGUE
+  TEST_EPILOGUE_LARGE_GROUP
+  TEST_EPILOGUE_OP
+  TEST_EPILOGUE_OP_LARGE_GROUP
+  TEST_FIXED
+  TEST_FIXED_LARGE_GROUP
+  TEST_SMALL
+  TEST_SMALL_LARGE_GROUP
+  TEST_RANDOM_PERF
+  TEST_RANDOM_PERF_LARGE_GROUP
+  TEST_DIRECT_BATCHED
+  TEST_SCALE_PERCOL
+)
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/README.md
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/README.md
@ -0,0 +1,14 @@
+This example extends Example 55 to support Grouped GEMMs in CUTLASS.
+
+## High level overview
+
+This example shows how to perform Grouped GEMMs on Hopper when A and B have different types. In the Grouped GEMM, multiple GEMMs with potentially different problem shapes can be excetued in a batch. The interface is similar to the standard mixed-input GEMM presented in Example 55, with a few noteworthy differences:
+- inside the collective builder, replace the layout types with layout pointer types.
+- in the arguments, pass the group size, array of the problem sizes, and the array of strides for matrix A and B.
+- if scales and zero-points are included, also pass the array of their strides in the arguments.
+
+Note that in Example 55, the argument `--g` is used to determine the block scale size. It is important not to confuse this with the `--groups` argument in this example, which specifies the number of GEMMs.
+
+## Upcoming features
+
+Currently, the Mixed-input Grouped GEMM only supports row-wise scaling. Please contact us if zero-points or block-wise scaling are needed.
--- a/examples/69_hopper_mixed_dtype_grouped_gemm/grouped_mixed_dtype_utils.hpp
+++ b/examples/69_hopper_mixed_dtype_grouped_gemm/grouped_mixed_dtype_utils.hpp
@ -0,0 +1,194 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include <vector>
+#include <fstream>
+#include <stdexcept>
+
+#include "../55_hopper_mixed_dtype_gemm/mixed_dtype_utils.hpp"
+
+template<class QuantType>
+class GroupedMixedDtypeOptions : public MixedDtypeOptions {
+public:
+    using ProblemShape = cutlass::gemm::GroupProblemShape<cute::Shape<int,int,int>>;
+    using UnderlyingProblemShape = typename ProblemShape::UnderlyingProblemShape;
+
+    int groups = 6;
+    int c = 512;
+    std::string benchmark_path;
+    std::vector<UnderlyingProblemShape> problem_sizes_host;
+
+    GroupedMixedDtypeOptions() : MixedDtypeOptions()
+    {
+      m = 1024;
+      n = 2048;
+      k = 512;
+    };
+
+    void parse(int argc, char const **args) {
+        cutlass::CommandLine cmd(argc, args);
+        cmd.get_cmd_line_argument("groups", groups);
+        cmd.get_cmd_line_argument("c", c);
+        MixedDtypeOptions::parse(argc, args);
+
+        problem_sizes_host = benchmark_path.empty() ? randomize_problems(cmd) : load_benchmark_problems();
+    }
+
+    std::ostream& print_usage(std::ostream& out) const {
+        out << "69_hopper_mixed_dtype_grouped_gemm\n\n"
+            << "Options:\n"
+            << "  --help                      Display this usage statement\n"
+            << "  --m=<int>                   Sets the M extent of the GEMM for all groups\n"
+            << "  --n=<int>                   Sets the N extent of the GEMM for all groups\n"
+            << "  --k=<int>                   Sets the K extent of the GEMM for all groups\n"
+            << "  --groups=<int>              Sets the number of individual GEMM problems\n"
+            << "  --mode=<int>                The mode to run the gemm\n"
+            << "  --alpha=<f32>               Epilogue scalar alpha\n"
+            << "  --beta=<f32>                Epilogue scalar beta\n"
+            << "  --iterations=<int>          Number of profiling iterations\n"
+            << "  --warmup=<int>              Number of warmup iterations\n"
+            << "  --benchmark=<str>           Executes a benchmark problem size\n";
+        return out;
+    }
+
+    double gflops(double runtime_s) const {
+        uint64_t fmas = std::accumulate(problem_sizes_host.begin(), problem_sizes_host.end(), 0ULL,
+            [](uint64_t sum, const UnderlyingProblemShape& problem) {
+                return sum + static_cast<uint64_t>(cute::get<0>(problem)) *
+                             static_cast<uint64_t>(cute::get<1>(problem)) *
+                             static_cast<uint64_t>(cute::get<2>(problem));
+            });
+        return (2.0 * fmas) / (runtime_s * 1e9);
+    }
+
+private:
+    static constexpr int tma_alignment_bits = 128;
+    const int alignment = tma_alignment_bits / cutlass::sizeof_bits<QuantType>::value;
+
+    std::vector<UnderlyingProblemShape> randomize_problems(cutlass::CommandLine& cmd) {
+        std::vector<UnderlyingProblemShape> problems;
+        problems.reserve(groups);
+
+        int cmd_line_m = -1, cmd_line_n = -1, cmd_line_k = -1;
+        cmd.get_cmd_line_argument("m", cmd_line_m);
+        cmd.get_cmd_line_argument("n", cmd_line_n);
+        cmd.get_cmd_line_argument("k", cmd_line_k);
+
+        for (int i = 0; i < groups; ++i) {
+            int m = (cmd_line_m >= 0) ? cmd_line_m : alignment * ((rand() % 64) + 1);
+            int n = (cmd_line_n >= 0) ? cmd_line_n : this->n;
+            int k = (cmd_line_k >= 0) ? cmd_line_k : this->k;
+
+            if (k % alignment != 0) {
+                throw std::runtime_error("Error: k dimension must be a multiple of " + std::to_string(alignment));
+            }
+            problems.push_back({m, n, k});
+        }
+        return problems;
+    }
+
+    std::vector<UnderlyingProblemShape> load_benchmark_problems() {
+        std::ifstream file(benchmark_path);
+        if (!file) {
+            throw std::runtime_error("Failed to open benchmark file: " + benchmark_path);
+        }
+
+        std::vector<UnderlyingProblemShape> problems;
+        int idx;
+        std::string extent_str;
+
+        while (file >> idx >> extent_str) {
+            if (idx < 0 || extent_str.empty()) break;
+
+            std::vector<std::string> tokens;
+            cutlass::CommandLine::tokenize(tokens, extent_str, 'x');
+            
+            cutlass::gemm::GemmCoord extent;
+            for (int i = 0; i < std::min(3, static_cast<int>(tokens.size())); ++i) {
+                int x = std::stoi(tokens[i]);
+                extent.at(i) = (x % alignment) ? x + (alignment - (x % alignment)) : x;
+            }
+
+            if (extent.product()) {
+                problems.push_back({extent.m(), extent.n(), extent.k()});
+            }
+        }
+        groups = static_cast<int>(problems.size());
+        return problems;
+    }
+};
+
+template <class QuantType, class Gemm, class ElementAccumulator>
+void grouped_mixed_dtype_profiling(
+    Gemm& gemm,
+    const GroupedMixedDtypeOptions<QuantType>& options,
+    MixedDtypeResult& result,
+    const std::vector<ElementAccumulator>& alpha_host,
+    const std::vector<ElementAccumulator>& beta_host) {
+
+    if (options.iterations <= 0) return;
+
+    cudaEvent_t start, stop;
+    cudaEventCreate(&start);
+    cudaEventCreate(&stop);
+
+    std::vector<float> runtimes;
+    runtimes.reserve(options.iterations); 
+
+    for (int iter = 0; iter < options.warmup + options.iterations; ++iter) {
+        cudaEventRecord(start);
+        CUTLASS_CHECK(gemm.run());
+        cudaEventRecord(stop);
+        cudaEventSynchronize(stop);
+
+        if (iter >= options.warmup) {
+            float milliseconds = 0;
+            cudaEventElapsedTime(&milliseconds, start, stop);
+            runtimes.push_back(milliseconds);
+        }
+    }
+
+    cudaEventDestroy(start);
+    cudaEventDestroy(stop);
+
+    result.avg_runtime_ms = std::accumulate(runtimes.begin(), runtimes.end(), 0.0f) / runtimes.size();
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Sizes, Alpha, Beta\n";
+    for (int32_t i = 0; i < options.groups; ++i) {
+        std::cout << "    " << options.problem_sizes_host[i] << ", " << alpha_host[i] << ", " << beta_host[i] << '\n';
+    }
+    std::cout << "  Groups      : " << options.groups << '\n'
+              << "  Avg runtime : " << result.avg_runtime_ms << " ms\n"
+              << "  GFLOPS      : " << result.gflops << '\n';
+}
--- a/examples/70_blackwell_gemm/70_blackwell_fp16_gemm.cu
+++ b/examples/70_blackwell_gemm/70_blackwell_fp16_gemm.cu
@ -0,0 +1,479 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief A FP16 dense GEMM example for the NVIDIA Blackwell SM100 architecture using CUTLASS.
+
+    This example demonstrates minimal set of changes needed to transition from a Hopper CUTLASS 3.x 
+    GEMM kernel (see example 48_hopper_warp_specialized_gemm) to a Blackwell 3.x CUTLASS GEMM kernel.
+    
+    The Blackwell SM100 CUTLASS kernel uses of the following Blackwell SM100 features:
+
+    1. New series of Tensor Core MMA Instructions (tcgen05) introduced on the Blackwell architecture (sm100a) 
+    which have 2x throughput compared to Hopper Tensor Core MMA instructions (WGMMA). 
+    
+    Note that Hopper WGMMA Tensor Core MMA instructions are not compatible on Blackwell (See https://docs.nvidia.com/cuda/parallel-thread-execution). 
+
+    2. A new per-SM memory called Tensor Memory (TMEM) introduced on the Blackwell architecture (sm100a). 
+    Blackwell SM100 Tensor Core MMA instructions store their accumulation results in TMEM instead of the 
+    Register File. (Please refer to CUDA 12.8 docs on https://docs.nvidia.com/cuda/).
+
+    3. An extended flavor of the warp-specialized kernel design introduced in Hopper enabled by use of TMEM 
+    which allows us to decouple the execution of MMA and epilogue into separate warps. 
+
+    4. A new SW controlled dynamic scheduler based on cluster launch control (See https://docs.nvidia.com/cuda/parallel-thread-execution). 
+
+    Usage:
+      $ ./examples/70_blackwell_gemm/70_blackwell_fp16_gemm --m=8192 --n=8192 --k=8192
+*/
+
+
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = half_t;                                         // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = half_t;                                         // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = float;                                          // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Kernel functional config
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm100;                           // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+
+// MMA and Cluster Tile Shapes
+// Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape %2 == 0 
+using MmaTileShape_MNK = Shape<_256,_128,_64>;                          
+// Shape of the threadblocks in a cluster
+using ClusterShape_MNK = Shape<_2,_2,_1>;
+
+// Build the epilogue
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass, 
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC, AlignmentC,
+    ElementC, LayoutC, AlignmentC,
+    cutlass::epilogue::collective::EpilogueScheduleAuto
+  >::CollectiveOp;
+
+// Build the mainloop
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::gemm::collective::KernelScheduleAuto
+  >::CollectiveOp;
+
+// Compose into a kernel
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int, int>, // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;                   // Default to ClusterLaunchControl (CLC) based tile scheduler 
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference device GEMM implementation type
+using DeviceGemmReference = cutlass::reference::device::Gemm<
+  ElementA,
+  LayoutA,
+  ElementB,
+  LayoutB,
+  ElementC,
+  LayoutC,
+  ElementAccumulator,
+  ElementAccumulator>;
+
+using StrideA = typename Gemm::GemmKernel::StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+uint64_t seed;
+
+cutlass::DeviceAllocation<typename Gemm::ElementA> block_A;
+cutlass::DeviceAllocation<typename Gemm::ElementB> block_B;
+cutlass::DeviceAllocation<typename Gemm::ElementC> block_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_ref_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int m, n, k;
+
+  Options():
+    help(false),
+    m(8192), n(8192), k(8192),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "70_blackwell_fp16_gemm\n\n"
+      << "  Blackwell FP16 GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "70_blackwell_fp16_gemm" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_block(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+
+  Element scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = Element(2);
+    scope_min = Element(0);
+  } else if (bits_input <= 8) {
+    scope_max = Element(2);
+    scope_min = Element(-2);
+  } else {
+    scope_max = Element(8);
+    scope_min = Element(-8);
+  }
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+
+  block_A.reset(options.m * options.k);
+  block_B.reset(options.k * options.n);
+  block_C.reset(options.m * options.n);
+  block_D.reset(options.m * options.n);
+  block_ref_D.reset(options.m * options.n);
+
+  initialize_block(block_A, seed + 2023);
+  initialize_block(block_B, seed + 2022);
+  initialize_block(block_C, seed + 2021);
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, 1},
+    {block_A.get(), stride_A, block_B.get(), stride_B},
+    {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  cutlass::TensorRef ref_A(block_A.get(), Gemm::LayoutA::packed({options.m, options.k}));
+  cutlass::TensorRef ref_B(block_B.get(), Gemm::LayoutB::packed({options.k, options.n}));
+  cutlass::TensorRef ref_C(block_C.get(), Gemm::LayoutC::packed({options.m, options.n}));
+  cutlass::TensorRef ref_D(block_ref_D.get(), Gemm::LayoutD::packed({options.m, options.n}));
+
+  //
+  // Compute reference output
+  //
+
+  // Create instantiation for device reference gemm kernel
+  DeviceGemmReference gemm_reference;
+
+  // Launch device reference gemm kernel
+  gemm_reference(
+    {options.m, options.n, options.k},
+    ElementAccumulator(options.alpha),
+    ref_A,
+    ref_B,
+    ElementAccumulator(options.beta),
+    ref_C,
+    ref_D);
+
+  // Wait for kernel to finish
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  bool passed = cutlass::reference::device::BlockCompareEqual(block_ref_D.get(), block_D.get(), block_D.size());
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 100a.
+
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));  
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 10 || props.minor != 0) {
+    std::cerr << "This example requires a GPU with compute capability 100a)." << std::endl;
+    return 0;
+  } 
+  
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/70_blackwell_gemm/70_blackwell_fp8_gemm.cu
+++ b/examples/70_blackwell_gemm/70_blackwell_fp8_gemm.cu
@ -0,0 +1,667 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief A FP8 dense GEMM example for the NVIDIA Blackwell SM100 architecture using CUTLASS.
+
+    This example demonstrates minimal set of changes needed to transition from a Hopper CUTLASS 3.x 
+    FP8 GEMM kernel (see example 54_hopper_fp8_warp_specialized_gemm) to a Blackwell SM100 FP8 GEMM kernel.
+    
+    This example shows all important fusions used by FP8 gemm kernels, 
+    i.e., scale factor for A, B, C, D tensor, the abs_max value of D tensor.
+    
+    The Blackwell SM100 CUTLASS kernel uses of the following Blackwell SM100 features:
+
+     1. New series of Tensor Core MMA Instructions (tcgen05) introduced on the Blackwell architecture (sm100a) 
+    which have 2x throughput compared to Hopper Tensor Core MMA instructions (WGMMA). 
+    
+    Note that Hopper WGMMA Tensor Core MMA instructions are not compatible on Blackwell (See https://docs.nvidia.com/cuda/parallel-thread-execution). 
+
+    2. A new per-SM memory called Tensor Memory (TMEM) introduced on the Blackwell architecture (sm100a). 
+    Blackwell SM100 Tensor Core MMA instructions store their accumulation results in TMEM instead of the 
+    Register File. (Please refer to CUDA 12.8 docs on https://docs.nvidia.com/cuda/).
+
+    3. An extended flavor of the warp-specialized kernel design introduced in Hopper enabled by use of TMEM 
+    which allows us to decouple the execution of MMA and epilogue into separate warps. 
+
+    4. A new SW controlled dynamic scheduler based on cluster launch control (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+
+    Usage:
+      $ ./examples/70_blackwell_gemm/70_blackwell_fp8_gemm --m=8192 --n=8192 --k=8192
+*/
+
+
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/activation.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/gett.hpp"
+
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+// A matrix configuration
+using ElementA            = cutlass::float_e4m3_t;                          // Element type for A matrix operand
+using LayoutA             = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using ElementB            = cutlass::float_e4m3_t;                          // Element type for B matrix operand
+using LayoutB             = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using ElementC            = cutlass::float_e4m3_t;                          // Element type for C and D matrix operands
+using LayoutC             = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+using ElementD           = ElementC;
+using LayoutD            = LayoutC;
+constexpr int AlignmentD = AlignmentC;
+
+// MMA type
+using ElementAccumulator = float;
+
+// Epilogue types
+using ElementBias        = cutlass::half_t;
+using ElementCompute     = float;
+using ElementAux         = ElementC;
+using LayoutAux          = LayoutC;
+using ElementAmax        = float;
+
+// MMA and Cluster Tile Shapes
+// Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape %2 == 0 
+using MmaTileShape_MNK = Shape<_256,_128,_64>;                          
+// Shape of the threadblocks in a cluster
+using ClusterShape_MNK = Shape<_2,_2,_1>;
+
+using FusionOp = cutlass::epilogue::fusion::ScaledLinCombPerRowBiasEltActAmaxAux<
+  LayoutC, cutlass::epilogue::thread::ReLU, ElementD, ElementCompute, ElementAux, ElementAmax, ElementBias>;
+  
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementCompute,
+    ElementC, LayoutC, AlignmentC,
+    ElementD, LayoutC, AlignmentD,
+    cutlass::epilogue::collective::EpilogueScheduleAuto,
+    FusionOp
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::gemm::collective::KernelScheduleAuto
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>,
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;                // Default to ClusterLaunchControl (CLC) based tile scheduler
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Extract information from Gemm kernel.
+using EpilogueOutputOp  = typename Gemm::EpilogueOutputOp;
+using ElementScalar     = typename EpilogueOutputOp::ElementScalar;
+using ElementAmax       = typename EpilogueOutputOp::ElementAmax;
+using ActivationFunctor = typename EpilogueOutputOp::ActivationFn;
+
+using StrideA = typename Gemm::GemmKernel::StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+using StrideAux = StrideC;
+
+constexpr bool IsDFp8 = 
+    cute::is_same_v<ElementD, cutlass::float_e4m3_t> or
+    cute::is_same_v<ElementD, cutlass::float_e5m2_t>;
+
+constexpr bool IsAuxFp8 =
+    cute::is_same_v<ElementAux, cutlass::float_e4m3_t> or
+    cute::is_same_v<ElementAux, cutlass::float_e5m2_t>;
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+StrideAux stride_aux;
+uint64_t seed;
+
+cutlass::HostTensor<ElementA  , LayoutA  > tensor_A;
+cutlass::HostTensor<ElementB  , LayoutB  > tensor_B;
+cutlass::HostTensor<ElementC  , LayoutC  > tensor_C;
+cutlass::HostTensor<ElementD  , LayoutD  > tensor_D;
+cutlass::HostTensor<ElementD  , LayoutD  > tensor_ref_D;
+cutlass::HostTensor<ElementAux, LayoutAux> tensor_aux;
+cutlass::HostTensor<ElementAux, LayoutAux> tensor_ref_aux;
+
+using LayoutScalar = cutlass::layout::PackedVectorLayout;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scalar_alpha;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scalar_beta;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scale_A;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scale_B;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scale_C;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scale_D;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scale_aux;
+cutlass::HostTensor<ElementAmax  , LayoutScalar> abs_max_D;
+cutlass::HostTensor<ElementAmax  , LayoutScalar> reference_abs_max_D;
+cutlass::HostTensor<ElementAmax  , LayoutScalar> abs_max_aux;
+cutlass::HostTensor<ElementAmax  , LayoutScalar> reference_abs_max_aux;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+
+  float alpha = 1.f, beta = 0.f;
+  float scale_a = 1.f, scale_b = 1.f, scale_c = 1.f, scale_d = 1.f, scale_aux = 1.f;
+  bool device_scale = false;
+  bool save_aux = true;
+  bool save_amax = true;
+  int iterations = 1000;
+  int m = 1024, n = 512, k = 1024, l = 1;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("scale_a", scale_a, 1.f);
+    cmd.get_cmd_line_argument("scale_b", scale_b, 1.f);
+    cmd.get_cmd_line_argument("scale_c", scale_c, 1.f);
+    cmd.get_cmd_line_argument("scale_d", scale_d, 1.f);
+    cmd.get_cmd_line_argument("scale_aux", scale_aux, 1.f);
+    cmd.get_cmd_line_argument("device_scale", device_scale, false);
+    cmd.get_cmd_line_argument("save_aux", save_aux, true);
+    cmd.get_cmd_line_argument("save_amax", save_amax, true);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "70_blackwell_fp8_gemm\n\n"
+      << "  Blackwell FP8 GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   Sets the l extent (batch) of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n"
+      << "  --scale_a=<f32>             Scaling factor for A\n"
+      << "  --scale_b=<f32>             Scaling factor for B\n"
+      << "  --scale_c=<f32>             Scaling factor for C\n"
+      << "  --scale_d=<f32>             Scaling factor for D (ignored for non-fp8 D)\n"
+      << "  --scale_aux=<f32>           Scaling factor for the auxiliary tensor (ignored for non-fp8 aux)\n"
+      << "  --device_scale=<bool>       Copy scalars to device memory before kernel launch (default: false)\n"
+      << "  --save_aux=<bool>           Save the pre-activation as an auxiliary tensor (default: true)\n"
+      << "  --save_amax=<bool>          Save the pre-scaled max absolute value of any fp8 outputs (aux and/or D) (default: true)\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "70_blackwell_fp8_gemm" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+
+  double scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+  int bits_output = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if (bits_input <= 8) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if (bits_output == 16) {
+    scope_max = 5;
+    scope_min = -5;
+  }
+  else {
+    scope_max = 8;
+    scope_min = -8;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(options.n, options.k, options.l));
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(options.m, options.n, options.l));
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(options.m, options.n, options.l));
+  stride_aux = stride_D;
+
+  auto a_coord = cutlass::make_Coord(options.m * options.l, options.k);
+  auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
+  auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
+
+  tensor_A.resize(a_coord);
+  tensor_B.resize(b_coord);
+  tensor_C.resize(c_coord);
+  tensor_D.resize(c_coord);
+  tensor_ref_D.resize(c_coord);
+
+  initialize_tensor(tensor_A.host_view(), seed + 2022);
+  initialize_tensor(tensor_B.host_view(), seed + 2023);
+  initialize_tensor(tensor_C.host_view(), seed + 2024);
+
+  tensor_A.sync_device();
+  tensor_B.sync_device();
+  tensor_C.sync_device();
+  tensor_D.sync_device();
+
+  if (options.save_aux) {
+    tensor_aux.resize(c_coord);
+    tensor_aux.sync_device();
+    tensor_ref_aux.resize(c_coord);
+  }
+
+  if (options.device_scale) {
+    scalar_alpha.resize(cutlass::make_Coord(1));
+    scalar_beta.resize(cutlass::make_Coord(1));
+    scale_A.resize(cutlass::make_Coord(1));
+    scale_B.resize(cutlass::make_Coord(1));
+    scale_C.resize(cutlass::make_Coord(1));
+    scale_D.resize(cutlass::make_Coord(1));
+    scale_aux.resize(cutlass::make_Coord(1));
+
+    cutlass::reference::host::TensorFill(scalar_alpha.host_view(), options.alpha);
+    cutlass::reference::host::TensorFill(scalar_beta.host_view(), options.beta);
+    cutlass::reference::host::TensorFill(scale_A.host_view(), options.scale_a);
+    cutlass::reference::host::TensorFill(scale_B.host_view(), options.scale_b);
+    cutlass::reference::host::TensorFill(scale_C.host_view(), options.scale_c);
+    cutlass::reference::host::TensorFill(scale_D.host_view(), options.scale_d);
+    cutlass::reference::host::TensorFill(scale_aux.host_view(), options.scale_aux);
+
+    scalar_alpha.sync_device();
+    scalar_beta.sync_device();
+    scale_A.sync_device();
+    scale_B.sync_device();
+    scale_C.sync_device();
+    scale_D.sync_device();
+    scale_aux.sync_device();
+  }
+
+  if (IsDFp8 && options.save_amax) {
+    abs_max_D.resize(cutlass::make_Coord(1));
+    abs_max_D.sync_device();
+    reference_abs_max_D.resize(cutlass::make_Coord(1));
+  }
+
+  if (IsAuxFp8 && options.save_aux && options.save_amax) {
+    abs_max_aux.resize(cutlass::make_Coord(1));
+    abs_max_aux.sync_device();
+    reference_abs_max_aux.resize(cutlass::make_Coord(1));
+  }
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, options.l},
+    {tensor_A.device_data(), stride_A, tensor_B.device_data(), stride_B},
+    {
+      {}, // epilogue.thread
+      tensor_C.device_data(), stride_C,
+      tensor_D.device_data(), stride_D
+    }
+  };
+
+  auto &fusion_args = arguments.epilogue.thread;
+  fusion_args.alpha = options.alpha;
+  fusion_args.beta = options.beta;
+  fusion_args.alpha_ptr = scalar_alpha.device_data();
+  fusion_args.beta_ptr = scalar_beta.device_data();
+  fusion_args.scale_a = options.scale_a;
+  fusion_args.scale_b = options.scale_b;
+  fusion_args.scale_c = options.scale_c;
+  fusion_args.scale_a_ptr = scale_A.device_data();
+  fusion_args.scale_b_ptr = scale_B.device_data();
+  fusion_args.scale_c_ptr = scale_C.device_data();
+
+  // ignored if tensor types are not fp8
+  fusion_args.scale_d = options.scale_d;
+  fusion_args.scale_aux = options.scale_aux;
+  fusion_args.scale_d_ptr = scale_D.device_data();
+  fusion_args.scale_aux_ptr = scale_aux.device_data();
+
+  // leaving/setting these as nullptr disables the fusion at runtime
+  fusion_args.bias_ptr = nullptr;
+
+  if (options.save_aux) {
+    fusion_args.aux_ptr = tensor_aux.device_data();
+    fusion_args.dAux = stride_aux;
+    if (options.save_amax) {
+      fusion_args.amax_aux_ptr = abs_max_aux.device_data();
+    }
+  }
+
+  if (options.save_amax) {
+    fusion_args.amax_D_ptr = abs_max_D.device_data();
+  }
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  //
+  // Compute reference output
+  //
+
+  // Create instantiation for device reference gemm kernel
+  auto A = cute::make_tensor(tensor_A.host_data(),
+      cute::make_layout(cute::make_shape(options.m, options.k, options.l), stride_A));
+  auto B = cute::make_tensor(tensor_B.host_data(),
+      cute::make_layout(cute::make_shape(options.n, options.k, options.l), stride_B));
+  auto C = cute::make_tensor(tensor_C.host_data(),
+      cute::make_layout(cute::make_shape(options.m, options.n, options.l), stride_C));
+  auto D = cute::make_tensor(tensor_ref_D.host_data(),
+      cute::make_layout(cute::make_shape(options.m, options.n, options.l), stride_D));
+  auto Aux = cute::make_tensor(tensor_ref_aux.host_data(),
+      cute::make_layout(cute::make_shape(options.m, options.n, options.l), stride_aux));
+  using unused_t = decltype(D);
+
+  cutlass::reference::host::GettMainloopParams<ElementAccumulator, decltype(A), decltype(B)> mainloop_params{A, B};
+
+  cutlass::reference::host::GettEpilogueParams<
+      ElementScalar,
+      ElementScalar,
+      ElementAccumulator,
+      ElementCompute,
+      decltype(C),
+      decltype(D),
+      unused_t, // bias
+      decltype(Aux),
+      unused_t, // valpha
+      unused_t, // vbeta
+      ActivationFunctor
+  > epilogue_params;
+
+  epilogue_params.C = C;
+  epilogue_params.D = D;
+  epilogue_params.Aux = Aux;
+  epilogue_params.alpha = options.alpha;
+  epilogue_params.beta = options.beta;
+  epilogue_params.scale_a = options.scale_a;
+  epilogue_params.scale_b = options.scale_b;
+  epilogue_params.scale_c = options.scale_c;
+  epilogue_params.scale_d = options.scale_d;
+  epilogue_params.scale_aux = options.scale_aux;
+  epilogue_params.abs_max_D = reference_abs_max_D.host_data();
+  epilogue_params.abs_max_Aux = reference_abs_max_aux.host_data();
+
+  // get reference result
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  // compare_reference
+  tensor_D.sync_host();
+  bool passed = cutlass::reference::host::TensorEquals(tensor_ref_D.host_view(), tensor_D.host_view());
+
+  if (IsDFp8 && options.save_amax) {
+    abs_max_D.sync_host();
+    passed &= abs_max_D.at(cutlass::make_Coord(0)) == reference_abs_max_D.at(cutlass::make_Coord(0));
+  }
+
+  if (options.save_aux) {
+    tensor_aux.sync_host();
+    passed &= cutlass::reference::host::TensorEquals(tensor_ref_aux.host_view(), tensor_aux.host_view());
+    if (IsAuxFp8 && options.save_amax) {
+      abs_max_aux.sync_host();
+      passed &= abs_max_aux.at(cutlass::make_Coord(0)) == reference_abs_max_aux.at(cutlass::make_Coord(0));
+    }
+  }
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+ 
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << 'x' << options.l << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least sm100a.
+  
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 10 || props.minor != 0) {
+    std::cerr << "This example requires a GPU with compute capability 100a)." << std::endl;
+    return 0;
+  } 
+  
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Run
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/70_blackwell_gemm/CMakeLists.txt
+++ b/examples/70_blackwell_gemm/CMakeLists.txt
@ -0,0 +1,41 @@
+
+# Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+cutlass_example_add_executable(
+  70_blackwell_fp16_gemm
+  70_blackwell_fp16_gemm.cu
+)  
+
+cutlass_example_add_executable(
+  70_blackwell_fp8_gemm
+  70_blackwell_fp8_gemm.cu
+)
+endif()
--- a/examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu
+++ b/examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu
@ -0,0 +1,566 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Blackwell SM100 GEMM example demonstrating compatible mainloop+epilogue builder schedules
+    and epilogue visitor tree (EVT) construction
+
+    Example usage:
+      $ ./examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder \
+            --m=2048 --n=2048 --k=2048 --l=2
+*/
+
+#include <iostream>
+
+#include "cute/tensor.hpp"
+
+#include "cutlass/cutlass.h"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm_complex.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+using namespace cute;
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Command line options parsing
+struct Options {
+
+  bool help;
+  bool error;
+
+  int m, n, k, l;
+  float alpha, beta;
+
+  Options():
+    help(false),
+    error(false),
+    m(2048), n(2048), k(2048), l(1),
+    alpha(1.f), beta(0.f)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m, 2048);
+    cmd.get_cmd_line_argument("n", n, 2048);
+    cmd.get_cmd_line_argument("k", k, 2048);
+    cmd.get_cmd_line_argument("l", l, 1);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "71_blackwell_gemm_with_collective_builder\n\n"
+      << "  This example showcases the use of CUTLASS's collective operation builders to easily construct\n"
+      << "  performant kernels targeting NVIDIA's Blackwell architecture.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   Sets the L extent (batch count) of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n";
+
+    return out;
+  }
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_block(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+
+  Element scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  } else if (bits_input <= 8) {
+    scope_max = 2;
+    scope_min = -2;
+  } else {
+    scope_max = 8;
+    scope_min = -8;
+  }
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+// Wrapper to construct, run, and verify a GEMM. This example showcases CUTLASS's collective
+// operation builders by specializing the GEMM on the kernel+epilogue schedule it will use and the
+// number of pipeline stages.
+template <
+  // Type of kernel schedule to generate
+  class MainloopScheduleType = cutlass::gemm::collective::KernelScheduleAuto,
+  // Type of epilogue schedule to generate
+  class EpilogueScheduleType = cutlass::epilogue::collective::EpilogueScheduleAuto,
+  // Number of pipeline stages to use
+  class StageCountType = cutlass::gemm::collective::StageCountAuto,
+  // Do we use custom epilogue visitor tree (EVT) fusion
+  bool UseCustomEVT = false
+>
+struct ExampleRunner {
+
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::ColumnMajor;
+  using LayoutD = cutlass::layout::ColumnMajor;
+
+  using ElementA = cutlass::half_t;
+  using ElementB = cutlass::half_t;
+  using ElementC = cutlass::half_t;
+  using ElementD = cutlass::half_t;
+  using ElementAccumulator = float;
+  using ElementCompute = float;
+  using ElementScalar = float;
+
+  using ClusterShapeMNK = Shape<_2,_2,_1>;
+  static constexpr bool Use2SmMma =
+      // Manually specified 2sm cluster MMA schedule, will error if cluster M is not a multiple of 2
+      std::is_same_v<MainloopScheduleType, cutlass::gemm::KernelTmaWarpSpecialized2SmSm100> ||
+      // Auto schedule will try to select 2sm cluster MMA based on cluster M
+      std::is_same_v<MainloopScheduleType, cutlass::gemm::collective::KernelScheduleAuto> && size<0>(ClusterShapeMNK{}) % 2 == 0;
+  // The MMA tile used by the mainloop collective. Blackwell 1sm MMA supports up to MMA tile M = 128, 2sm MMA supports up to MMA tile M = 256
+  using MmaTileMNK    = std::conditional_t<Use2SmMma, Shape<_256,_128,_64>, Shape<_128,_128,_64>>;
+
+  // 16B alignment lets us use TMA
+  static constexpr int AlignmentA = 128 / cutlass::sizeof_bits<ElementA>::value;
+  static constexpr int AlignmentB = 128 / cutlass::sizeof_bits<ElementB>::value;
+  static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
+  static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value;
+
+  static constexpr auto RoundStyle = cutlass::FloatRoundStyle::round_to_nearest;
+
+  // Blackwell fusions for the most part use the same EVT nodes used in Hopper. Most Blackwell EVTs will alias to their Hopper counterparts.
+  // EVT nodes new to Blackwell mainly relate to narrow precision scale factor generation and are contained in include/cutlass/epilogue/fusion/sm100_visitor_*.hpp
+  // See include/cutlass/epilogue/fusion/sm100_callbacks_tma_warpspecialized.hpp for EVT construction using these new nodes
+  // Fusions relating to narrow-precision scale factor generation are demonstrated in example 72b and can only be used in blackwell kernels
+  using CustomEVT =  // alpha * acc + beta * C
+    cutlass::epilogue::fusion::Sm90EVT<cutlass::epilogue::fusion::Sm90Compute<cutlass::homogeneous_multiply_add, ElementD, ElementCompute, RoundStyle>, // beta * C + (alpha * acc)
+      cutlass::epilogue::fusion::Sm90ScalarBroadcast<ElementScalar>, // beta
+      cutlass::epilogue::fusion::Sm90SrcFetch<ElementC>, // C
+      cutlass::epilogue::fusion::Sm90EVT<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, ElementCompute, ElementCompute, RoundStyle>, // alpha * acc
+        cutlass::epilogue::fusion::Sm90ScalarBroadcast<ElementScalar>, // alpha
+        cutlass::epilogue::fusion::Sm90AccFetch // acc
+      >
+    >;
+
+  // As in Hopper, a predefined set of fusion operations are provided in include/cutlass/epilogue/fusion/operations.hpp and can be passed to the epilogue builder
+  // Fusions operations supported by the Hopper TMA epilogue will also be supported by the Blackwell TMA epilogue
+  // Fusions relating to narrow-precision scale factor generation are demonstrated in example 72b and can only be used in blackwell kernels
+  using DefaultOperation = cutlass::epilogue::fusion::LinearCombination<ElementD, ElementCompute, ElementC, ElementScalar, RoundStyle>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+      MmaTileMNK, ClusterShapeMNK,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementCompute,
+      ElementC, LayoutC, AlignmentC,
+      ElementD, LayoutD, AlignmentD,
+      EpilogueScheduleType,
+      cute::conditional_t<UseCustomEVT, CustomEVT, DefaultOperation>
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
+      ElementA, LayoutA, AlignmentA,
+      ElementB, LayoutB, AlignmentB,
+      ElementAccumulator,
+      MmaTileMNK, ClusterShapeMNK,
+      cute::conditional_t<cute::is_same_v<StageCountType, cutlass::gemm::collective::StageCountAuto>,
+          cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+          StageCountType>,
+      MainloopScheduleType
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+  >;
+
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+  using ProblemShapeType = typename Gemm::GemmKernel::ProblemShape;
+
+  using StrideA = typename Gemm::GemmKernel::StrideA;
+  using StrideB = typename Gemm::GemmKernel::StrideB;
+  using StrideC = typename Gemm::GemmKernel::StrideC;
+  using StrideD = typename Gemm::GemmKernel::StrideD;
+
+  using LayoutTagA = cutlass::gemm::detail::StrideToLayoutTagA_t<StrideA>;
+  using LayoutTagB = cutlass::gemm::detail::StrideToLayoutTagB_t<StrideB>;
+  using LayoutTagC = cutlass::gemm::detail::StrideToLayoutTagC_t<StrideC>;
+  using LayoutTagD = cutlass::gemm::detail::StrideToLayoutTagC_t<StrideD>;
+
+  //
+  // Data members
+  //
+
+  /// Initialization
+  StrideA stride_A;
+  StrideB stride_B;
+  StrideC stride_C;
+  StrideD stride_D;
+  uint64_t seed = 0;
+
+  cutlass::DeviceAllocation<typename Gemm::ElementA> block_A;
+  cutlass::DeviceAllocation<typename Gemm::ElementB> block_B;
+  cutlass::DeviceAllocation<typename Gemm::ElementC> block_C;
+  cutlass::DeviceAllocation<typename Gemm::ElementD> block_D;
+  cutlass::DeviceAllocation<typename Gemm::ElementD> block_ref_D;
+
+  //
+  // Methods
+  //
+
+  bool verify(const ProblemShapeType& problem_size, float alpha, float beta) {
+    auto [M, N, K, L] = problem_size;
+
+    cutlass::TensorRef ref_A(block_A.get(), Gemm::LayoutA::packed({M, K}));
+    cutlass::TensorRef ref_B(block_B.get(), Gemm::LayoutB::packed({K, N}));
+    cutlass::TensorRef ref_C(block_C.get(), Gemm::LayoutC::packed({M, N}));
+    cutlass::TensorRef ref_D(block_ref_D.get(), Gemm::LayoutD::packed({M, N}));
+
+    cutlass::reference::device::GemmComplex(
+          {M, N, K},
+          ElementScalar(alpha),
+          ref_A,
+          cutlass::ComplexTransform::kNone,
+          ref_B,
+          cutlass::ComplexTransform::kNone,
+          ElementScalar(beta),
+          ref_C,
+          ref_D,
+          ElementAccumulator(0),
+          L,     // batch_count
+          M * K, // batch_stride_A
+          K * N, // batch_stride_B
+          M * N, // batch_stride_C
+          M * N  // batch_stride_D
+        );
+
+    cudaError_t result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Reference kernel failed. Last CUDA error: "
+                << cudaGetErrorString(result) << std::endl;
+      return false;
+    }
+
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    bool passed = cutlass::reference::device::BlockCompareEqual(block_ref_D.get(), block_D.get(), block_D.size());
+
+    return passed;
+  }
+
+  /// Initialize operands to be used in the GEMM and reference GEMM
+  void initialize(const ProblemShapeType& problem_size) {
+    auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
+    auto [M, N, K, L] = problem_shape_MNKL;
+
+    stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(M, K, L));
+    stride_B = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(N, K, L));
+    stride_C = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(M, N, L));
+    stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(M, N, L));
+
+    block_A.reset(M * K * L);
+    block_B.reset(K * N * L);
+    block_C.reset(M * N * L);
+    block_D.reset(M * N * L);
+    block_ref_D.reset(M * N * L);
+
+    initialize_block(block_A, seed + 2023);
+    initialize_block(block_B, seed + 2022);
+    initialize_block(block_C, seed + 2021);
+  }
+
+  bool run(const Options& options, const cutlass::KernelHardwareInfo& hw_info) {
+    ProblemShapeType problem_size = ProblemShapeType{options.m, options.n, options.k, options.l};
+
+    initialize(problem_size);
+
+    typename Gemm::Arguments arguments{
+      cutlass::gemm::GemmUniversalMode::kGemm,
+      problem_size,
+      {block_A.get(), stride_A, block_B.get(), stride_B},
+      {{}, // epilogue.thread
+       block_C.get(), stride_C, block_D.get(), stride_D},
+      hw_info
+    };
+
+    // See example 48 for details on custom EVT construction
+    if constexpr (UseCustomEVT) {
+      arguments.epilogue.thread =
+        {    // ternary op : beta * C + (alpha * acc)
+          {{options.beta}}, // leaf op+args : beta
+          {},               // leaf op+args : C
+          {                 // binary op : alpha * acc
+            {{options.alpha}}, // leaf op+args : alpha
+            {},                // leaf op+args : acc
+            {}              // binary args : multiplies
+          },                // end binary op
+          {} // ternary args : multiply_add
+        };   // end ternary op
+    }
+    // Pre-defined fusions will have flat, named args for user-friendlyness
+    else {
+      arguments.epilogue.thread.alpha = options.alpha;
+      arguments.epilogue.thread.beta = options.beta;
+    }
+
+    Gemm gemm_op;
+
+    size_t workspace_size = Gemm::get_workspace_size(arguments);
+    cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+    cutlass::Status status = gemm_op.can_implement(arguments);
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "This kernel is not supported. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return false;
+    }
+
+    status = gemm_op.initialize(arguments, workspace.get());
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "Failed to initialize the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return false;
+    }
+
+    // Run the GEMM
+    status = gemm_op.run();
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "Failed to launch the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return false;
+    }
+
+    cudaError_t result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Error running the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(result) << std::endl;
+      return false;
+    }
+
+    // Verify that the result is correct
+    bool passed = verify(problem_size, options.alpha, options.beta);
+    if (!passed) {
+      std::cerr << "Reference check failed" << std::endl;
+    }
+
+    return passed;
+  }
+
+};
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to print a description of the example run and its result
+void print_result(const std::string& description, bool passed) {
+  std::cout << description << ": " << (passed ? "Passed" : "Failed") << std::endl;
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  cudaDeviceProp props;
+
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (error != cudaSuccess) {
+    std::cerr << "cudaGetDeviceProperties() returned an error: " << cudaGetErrorString(error) << std::endl;
+    return -1;
+  }
+
+if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+  
+  if (!(props.major == 10 && props.minor == 0)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100)." << std::endl;
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  if (options.error) {
+    std::cerr << "Aborting execution." << std::endl;
+    return -1;
+  }
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  //
+  // Run examples
+  //
+
+  // The KernelHardwareInfo struct holds the number of SMs on the GPU with a given device ID. This
+  // information is used by the underlying kernel.
+  cutlass::KernelHardwareInfo hw_info;
+
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+
+  bool passed;
+
+  // Auto mainloop and epilogue schedules must be used together to guarantee functionality
+  ExampleRunner<> runner_0;
+  passed = runner_0.run(options, hw_info);
+  print_result("KernelScheduleAuto mainloop schedule with EpilogueScheduleAuto epilogue schedule", passed);
+
+  // Mainloop stage counts can be specified manually
+  // It is the user's responsibility to ensure there is enough device smem to allocate manual stage counts
+  ExampleRunner<
+    cutlass::gemm::collective::KernelScheduleAuto,
+    cutlass::epilogue::collective::EpilogueScheduleAuto,
+    _3> runner_1;
+  passed = runner_1.run(options, hw_info);
+  print_result("KernelScheduleAuto mainloop schedule with EpilogueScheduleAuto epilogue schedule and 3 mainloop stages", passed);
+
+  // 1SM cluster MMA mainloop schedules can be used with direct store ("no-smem") epilogue schedules
+  ExampleRunner<cutlass::gemm::KernelTmaWarpSpecialized1SmSm100, cutlass::epilogue::NoSmemWarpSpecialized1Sm> runner_2;
+  passed = runner_2.run(options, hw_info);
+  print_result("KernelTmaWarpSpecialized1SmSm100 mainloop schedule with NoSmemWarpSpecialized1Sm epilogue schedule", passed);
+
+  // 1SM cluster MMA mainloop schedules can also be used with 1SM TMA epilogue schedules
+  // 1SM cluster MMA mainloop schedules will not work with 2SM TMA epilogue schedules
+  ExampleRunner<cutlass::gemm::KernelTmaWarpSpecialized1SmSm100, cutlass::epilogue::TmaWarpSpecialized1Sm> runner_3;
+  passed = runner_3.run(options, hw_info);
+  print_result("KernelTmaWarpSpecialized1SmSm100 mainloop schedule with TmaWarpSpecialized1Sm epilogue schedule", passed);
+
+  // 2SM cluster MMA mainloop schedules can be used with direct store ("no-smem") epilogue schedules
+  ExampleRunner<cutlass::gemm::KernelTmaWarpSpecialized2SmSm100, cutlass::epilogue::NoSmemWarpSpecialized2Sm> runner_4;
+  passed = runner_4.run(options, hw_info);
+  print_result("KernelTmaWarpSpecialized2SmSm100 mainloop schedule with NoSmemWarpSpecialized2Sm epilogue schedule", passed);
+
+  // 2SM cluster MMA mainloop schedules can also be used with 2SM TMA epilogue schedules
+  // 2SM cluster MMA mainloop schedules will not work with SM TMA epilogue schedules
+  ExampleRunner<cutlass::gemm::KernelTmaWarpSpecialized2SmSm100, cutlass::epilogue::TmaWarpSpecialized2Sm> runner_5;
+  passed = runner_5.run(options, hw_info);
+  print_result("KernelTmaWarpSpecialized2SmSm100 mainloop schedule with TmaWarpSpecialized2Sm epilogue schedule", passed);
+
+  // Blackwell Auto schedule supports custom EVT fusions
+  constexpr bool UseCustomEVT = true;
+  ExampleRunner<
+    cutlass::gemm::collective::KernelScheduleAuto,
+    cutlass::epilogue::collective::EpilogueScheduleAuto,
+    cutlass::gemm::collective::StageCountAuto,
+    UseCustomEVT> runner_6;
+  passed = runner_6.run(options, hw_info);
+  print_result("KernelScheduleAuto mainloop schedule with EpilogueScheduleAuto epilogue schedule and custom EVT", passed);
+
+  // 1SM TMA epilogue schedules support custom EVT fusions
+  ExampleRunner<
+    cutlass::gemm::KernelTmaWarpSpecialized1SmSm100,
+    cutlass::epilogue::TmaWarpSpecialized1Sm,
+    cutlass::gemm::collective::StageCountAuto,
+    UseCustomEVT> runner_7;
+  passed = runner_7.run(options, hw_info);
+  print_result("KernelTmaWarpSpecialized1SmSm100 mainloop schedule with TmaWarpSpecialized1Sm epilogue and custom EVT", passed);
+
+  // 2SM TMA epilogue schedules support custom EVT fusions
+  ExampleRunner<
+    cutlass::gemm::KernelTmaWarpSpecialized2SmSm100,
+    cutlass::epilogue::TmaWarpSpecialized2Sm,
+    cutlass::gemm::collective::StageCountAuto,
+    UseCustomEVT> runner_8;
+  passed = runner_8.run(options, hw_info);
+  print_result("KernelTmaWarpSpecialized2SmSm100 mainloop schedule with TmaWarpSpecialized2Sm epilogue and custom EVT", passed);
+
+
+  // Blackwell direct store epilogue schedule supports custom EVTs and named fusion operations as well (not supported for pre-Blackwell kernels)
+  ExampleRunner<
+    cutlass::gemm::KernelTmaWarpSpecialized1SmSm100,
+    cutlass::epilogue::NoSmemWarpSpecialized1Sm,
+    cutlass::gemm::collective::StageCountAuto,
+    UseCustomEVT> runner_9;
+  passed = runner_9.run(options, hw_info);
+  print_result("KernelTmaWarpSpecialized1SmSm100 mainloop schedule with NoSmemWarpSpecialized1Sm epilogue and custom EVT", passed);
+
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/71_blackwell_gemm_with_collective_builder/CMakeLists.txt
+++ b/examples/71_blackwell_gemm_with_collective_builder/CMakeLists.txt
@ -0,0 +1,35 @@
+# Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# Both filenames are shorter to avoid MAX_PATH issues on Windows.
+if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+cutlass_example_add_executable(
+  71_blackwell_gemm_with_collective_builder
+  71_blackwell_gemm_with_collective_builder.cu
+  )
+endif()
--- a/examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu
+++ b/examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu
@ -0,0 +1,543 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief A GEMM example using CUTLASS for the NVIDIA Blackwell SM100 architecture.
+
+    This example demonstrates a simple way to instantiate and run a blockscaled NVFP4 GEMM on the NVIDIA Blackwell SM100 architecture.
+
+    The Blackwell SM100 CUTLASS kernel uses the new Block Scaled Tensor Core MMA Instructions (tcgen05.mma.blockscaled) introduced
+    on the Blackwell architecture (sm100a) which have 2x throughput compared to fp8 Tensor Core MMA instructions (tcgen05.mma)
+    and 4x throughput compared to fp8 Hopper Tensor Core MMA Instructions (WGMMA) (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+
+    Similar to 70_blackwell_gemm, this kernel leverages:
+    1. Per-SM memory called Tensor Memory (TMEM)  (Please refer to CUDA 12.8 docs on https://docs.nvidia.com/cuda/).
+    
+    2. The extended warp-specialized kernel design introduced in Hopper enabled by use of TMEM 
+    which allows us to decouple the execution of MMA and epilogue into separate warps. 
+    
+    3. A new SW controlled dynamic scheduler based on cluster launch control (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+
+    Usage:
+
+      $ ./examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm --m=2048 --n=2048 --k=2048
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/detail/sm100_blockscaled_layout.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+
+
+#include <iostream>
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = cutlass::nv_float4_t<cutlass::float_e2m1_t>;    // Element type for A matrix operand
+using         LayoutATag  = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 32;                                             // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::nv_float4_t<cutlass::float_e2m1_t>;    // Element type for A matrix operand
+using         LayoutBTag  = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 32;                                             // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementD    = cutlass::bfloat16_t;                            // Element type for D matrix operand
+using         ElementC    = cutlass::bfloat16_t;                            // Element type for C matrix operand
+using         LayoutCTag  = cutlass::layout::RowMajor;                      // Layout type for C matrix operand
+using         LayoutDTag  = cutlass::layout::RowMajor;                      // Layout type for D matrix operand
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+// Kernel functional config
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm100;                           // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassBlockScaledTensorOp;      // Operator class tag
+
+// Kernel Perf config
+using MmaTileShape        = Shape<_256,_256,_256>;                          // MMA's tile size
+using ClusterShape        = Shape<_4,_4,_1>;                                // Shape of the threadblocks in a cluster
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,                      
+    MmaTileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutCTag, AlignmentC,
+    ElementD, LayoutDTag, AlignmentD,
+    cutlass::epilogue::collective::EpilogueScheduleAuto                      // Epilogue schedule policy
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutATag, AlignmentA,
+    ElementB, LayoutBTag, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::gemm::collective::KernelScheduleAuto                             // Kernel schedule policy. Auto or using targeted scheduling policy
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>,                                                   // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference device GEMM implementation type
+using StrideA   = typename Gemm::GemmKernel::StrideA;
+using LayoutA   = decltype(cute::make_layout(make_shape(0,0,0), StrideA{}));
+using LayoutSFA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFA;      // Scale Factor tensors have an interleaved layout. Bring Layout instead of stride.
+using StrideB   = typename Gemm::GemmKernel::StrideB;
+using LayoutB   = decltype(cute::make_layout(make_shape(0,0,0), StrideB{}));
+using LayoutSFB = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFB;      // Scale Factor tensors have an interleaved layout. Bring Layout instead of stride.
+using StrideC   = typename Gemm::GemmKernel::StrideC;
+using LayoutC   = decltype(cute::make_layout(make_shape(0,0,0), StrideC{}));
+using StrideD   = typename Gemm::GemmKernel::StrideD;
+using LayoutD   = decltype(cute::make_layout(make_shape(0,0,0), StrideD{}));
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+LayoutA layout_A;
+LayoutSFA layout_SFA;
+StrideB stride_B;
+LayoutB layout_B;
+LayoutSFB layout_SFB;
+StrideC stride_C;
+LayoutC layout_C;
+StrideD stride_D;
+LayoutD layout_D;
+uint64_t seed;
+
+// The HostTensors are only used for allocating memory on host and device, and transferring data between host and device
+// Use cute::Tensor and cute::Layout for iterating thru the matrix elements
+cutlass::HostTensor<ElementA::DataType, cutlass::layout::PackedVectorLayout> block_A;
+cutlass::HostTensor<ElementA::ScaleFactorType, cutlass::layout::PackedVectorLayout> block_SFA;
+cutlass::HostTensor<ElementB::DataType, cutlass::layout::PackedVectorLayout> block_B;
+cutlass::HostTensor<ElementB::ScaleFactorType, cutlass::layout::PackedVectorLayout> block_SFB;
+cutlass::HostTensor<ElementC, cutlass::layout::PackedVectorLayout> block_C;
+// Output Tensor
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_D;
+// Reference Output Tensor
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_reference_D;
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+template <typename T>
+auto make_iterator(T* ptr) {
+  using namespace cute;
+  if constexpr (cute::is_subbyte_v<T>) {
+    return subbyte_iterator<T>(ptr);
+  }
+  else {
+    return ptr;
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int m, n, k;
+
+  Options():
+    help(false),
+    m(1024), n(1024), k(1024),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "72a_blackwell_nvfp4_bf16_gemm\n\n"
+      << "  Blackwell NVFP4 GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out << "\n\nExamples:\n\n"
+      << "$ " << "./examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_block(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+
+  double scope_max, scope_min;
+  constexpr int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if constexpr (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if constexpr (bits_input <= 6) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if constexpr (bits_input <= 8) {
+    if constexpr (cute::is_same_v<Element, cutlass::float_ue8m0_t>) {
+      scope_max = 4;
+      scope_min = 1;
+    }
+    else {
+      scope_max = 1;
+      scope_min = -1;
+    }
+  }
+  else{
+    scope_max = 4;
+    scope_min = -4;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+  
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+  using namespace cute;
+  // For SFA and SFB tensors layouts
+  using Sm100BlkScaledConfig =  typename Gemm::GemmKernel::CollectiveMainloop::Sm100BlkScaledConfig;
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+
+  layout_A = make_layout(make_shape(options.m, options.k, 1), stride_A);
+  layout_B = make_layout(make_shape(options.n, options.k, 1), stride_B);
+  layout_C = make_layout(make_shape(options.m, options.n, 1), stride_C);
+  layout_D = make_layout(make_shape(options.m, options.n, 1), stride_D);
+  layout_SFA = Sm100BlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(options.m, options.n, options.k, 1));
+  layout_SFB = Sm100BlkScaledConfig::tile_atom_to_shape_SFB(cute::make_shape(options.m, options.n, options.k, 1));
+
+  block_A.reset(cutlass::make_Coord(size(layout_A)));
+  block_B.reset(cutlass::make_Coord(size(layout_B)));
+  block_C.reset(cutlass::make_Coord(size(layout_C)));
+  block_D.reset(cutlass::make_Coord(size(layout_D)));
+  block_reference_D.reset(cutlass::make_Coord(size(layout_D)));
+  block_SFA.reset(cutlass::make_Coord(size(filter_zeros(layout_SFA))));
+  block_SFB.reset(cutlass::make_Coord(size(filter_zeros(layout_SFB))));
+
+  initialize_block(block_A.host_view(), seed + 2021);
+  initialize_block(block_B.host_view(), seed + 2022);
+  initialize_block(block_C.host_view(), seed + 2023);
+  initialize_block(block_SFA.host_view(), seed + 2024);
+  initialize_block(block_SFB.host_view(), seed + 2025);
+
+  block_A.sync_device();
+  block_B.sync_device();
+  block_C.sync_device();
+  block_SFA.sync_device();
+  block_SFB.sync_device();
+}
+
+// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  typename Gemm::Arguments arguments {
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, 1},
+    { // Mainloop arguments
+      block_A.device_data(), stride_A,
+      block_B.device_data(), stride_B,
+      block_SFA.device_data(), layout_SFA,
+      block_SFB.device_data(), layout_SFB
+    },
+    { // Epilogue arguments
+      {options.alpha, options.beta},
+      block_C.device_data(), stride_C,
+      block_D.device_data(), stride_D
+    }
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  using namespace cute;
+  // Create the arguments for host reference implementation
+  Tensor tensor_A = make_tensor(make_iterator(block_A.host_data()), layout_A);
+  Tensor tensor_SFA = make_tensor(block_SFA.host_data(), layout_SFA);
+  Tensor tensor_B = make_tensor(make_iterator(block_B.host_data()), layout_B);
+  Tensor tensor_SFB = make_tensor(block_SFB.host_data(), layout_SFB);
+
+  cutlass::reference::host::GettBlockScalingMainloopParams<
+      ElementAccumulator,                 // ElementAccumulator
+      decltype(tensor_A),                 // TensorA
+      decltype(tensor_SFA),               // TensorSfA
+      decltype(tensor_B),                 // TensorB
+      decltype(tensor_SFB)                // TensorSfB
+    > mainloop_params{tensor_A, tensor_SFA, tensor_B, tensor_SFB};
+
+  auto tensor_C = cute::make_tensor(make_iterator(block_C.host_data()), layout_C);
+  auto tensor_D = cute::make_tensor(make_iterator(block_reference_D.host_data()), layout_D);
+ 
+  cutlass::reference::host::GettBlockScalingEpilogueParams<
+      ElementAccumulator,                   // ElementScalar
+      ElementAccumulator,                   // ElementAccumulator
+      ElementAccumulator,                   // ElementCompute
+      decltype(tensor_C),                   // TensorC
+      decltype(tensor_D)                    // TensorD
+    > epilogue_params{options.alpha, options.beta, tensor_C, tensor_D};
+
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  // Comparison
+  block_D.sync_host();
+  bool passed = cutlass::reference::host::TensorEquals(block_reference_D.host_view(), block_D.host_view());
+  passed &= (cutlass::reference::host::TensorNorm(block_reference_D.host_view()) > 0);
+  passed &= (cutlass::reference::host::TensorNorm(block_D.host_view()) > 0);
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  cudaDeviceSynchronize();
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 or higher Toolkit to run this example
+  // and must have compute capability at least 100.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  
+  if (!(props.major == 10 && props.minor == 0)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100)." << std::endl;
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu
+++ b/examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu
@ -0,0 +1,593 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief A GEMM example using CUTLASS for the NVIDIA Blackwell SM100 architecture.
+
+    This example demonstrate a simple way to instantiate and run a blockscaled NVFP4 GEMM on the NVIDIA Blackwell SM100 architecture
+    on NVIDIA Blackwell SM100 architecture. The kernel outputs quantized fp4 values with scale factors that be the input of another GEMM.
+
+    Similar to 72a_blackwell_nvfp4_bf16_gemm, this kernel leverages:
+    1. Blockscaled tcgen05.mma instructions.
+
+    2. Per-SM memory called Tensor Memory (TMEM)
+    
+    3. The extended warp-specialized kernel design introduced in Hopper enabled by use of TMEM 
+    which allows us to decouple the execution of MMA and epilogue into separate warps. 
+    
+    4. A new SW controlled dynamic scheduler based on cluster launch control (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+
+    Usage:
+
+      $ ./examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm --m=2048 --n=2048 --k=2048
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/detail/sm100_blockscaled_layout.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+
+
+#include <iostream>
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = cutlass::nv_float4_t<cutlass::float_e2m1_t>;    // Element type for A matrix operand
+using         LayoutATag  = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 32;                                             // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::nv_float4_t<cutlass::float_e2m1_t>;    // Element type for A matrix operand
+using         LayoutBTag  = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 32;                                             // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementD    = cutlass::float_e2m1_t;                          // Element type for D matrix operand
+using         ElementSFD  = cutlass::float_ue8m0_t;                         // Element type for SFB matrix operand
+using         ElementC    = float;                                          // Element type for C matrix operand
+using         LayoutCTag  = cutlass::layout::RowMajor;                      // Layout type for C matrix operand
+using         LayoutDTag  = cutlass::layout::RowMajor;                      // Layout type for D matrix operand
+using         LayoutSFDTag = LayoutDTag;                                    // Layout type for SFD should be same as D matrix operand
+
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Kernel functional config
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ElementCompute      = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm100;                           // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassBlockScaledTensorOp;      // Operator class tag
+
+// Kernel Perf config
+using MmaTileShape        = Shape<_128,_128,_256>;                          // MMA's tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                // Shape of the threadblocks in a cluster
+
+constexpr int InputSFVectorSize  = 16;
+constexpr int OutputSFVectorSize = InputSFVectorSize;
+
+// D = alpha * acc + beta * C
+//      With BlockScaleFactor generation.
+using FusionOperation = cutlass::epilogue::fusion::LinCombBlockScaleFactor<
+    OutputSFVectorSize,
+    ElementD, 
+    ElementCompute, 
+    ElementSFD, LayoutSFDTag,
+    ElementC>;
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,                     
+    MmaTileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutCTag, AlignmentC,
+    ElementD, LayoutDTag, AlignmentD,
+    cutlass::epilogue::collective::EpilogueScheduleAuto,                      // Epilogue schedule policy
+    FusionOperation
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutATag, AlignmentA,
+    ElementB, LayoutBTag, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::gemm::collective::KernelScheduleAuto                              // Kernel schedule policy. Auto or using targeted scheduling policy
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int, int>, // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference device GEMM implementation type
+using StrideA   = typename Gemm::GemmKernel::StrideA;
+using LayoutA   = decltype(cute::make_layout(make_shape(0,0,0), StrideA{}));
+using LayoutSFA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFA;      // Scale Factor tensors have an interleaved layout. Bring Layout instead of stride.
+using StrideB   = typename Gemm::GemmKernel::StrideB;
+using LayoutB   = decltype(cute::make_layout(make_shape(0,0,0), StrideB{}));
+using LayoutSFB = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFB;      // Scale Factor tensors have an interleaved layout. Bring Layout instead of stride.
+using StrideC   = typename Gemm::GemmKernel::StrideC;
+using LayoutC   = decltype(cute::make_layout(make_shape(0,0,0), StrideC{}));
+using StrideD   = typename Gemm::GemmKernel::StrideD;
+using LayoutD   = decltype(cute::make_layout(make_shape(0,0,0), StrideD{}));
+
+using FusionOp = typename Gemm::EpilogueOutputOp;
+constexpr bool IsBlockScaleSupported = FusionOp::IsBlockScaleSupported;
+using SfdOutputCfg = cutlass::detail::Sm100BlockScaledOutputConfig<OutputSFVectorSize>;
+using LayoutSFD = typename SfdOutputCfg::LayoutSF;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+LayoutA layout_A;
+LayoutSFA layout_SFA;
+StrideB stride_B;
+LayoutB layout_B;
+LayoutSFB layout_SFB;
+StrideC stride_C;
+LayoutC layout_C;
+StrideD stride_D;
+LayoutD layout_D;
+LayoutSFD layout_SFD;
+
+uint64_t seed;
+
+// The HostTensors are only used for allocating memory on host and device, and transferring data between host and device
+// Use cute::Tensor and cute::Layout for iterating thru the matrix elements
+cutlass::HostTensor<ElementA::DataType, cutlass::layout::PackedVectorLayout> block_A;
+cutlass::HostTensor<ElementA::ScaleFactorType, cutlass::layout::PackedVectorLayout> block_SFA;
+cutlass::HostTensor<ElementB::DataType, cutlass::layout::PackedVectorLayout> block_B;
+cutlass::HostTensor<ElementB::ScaleFactorType, cutlass::layout::PackedVectorLayout> block_SFB;
+cutlass::HostTensor<ElementC, cutlass::layout::PackedVectorLayout> block_C;
+// Output Tensors
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_D;
+cutlass::HostTensor<ElementSFD, cutlass::layout::PackedVectorLayout> block_SFD;
+// Reference Output Tensors
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_reference_D;
+cutlass::HostTensor<ElementSFD, cutlass::layout::PackedVectorLayout> block_reference_SFD;
+// Matrix-wide normalization constant
+cutlass::HostTensor<ElementCompute, cutlass::layout::PackedVectorLayout> block_Normconst;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+template <typename T>
+auto make_iterator(T* ptr) {
+  using namespace cute;
+  if constexpr (cute::is_subbyte_v<T>) {
+    return subbyte_iterator<T>(ptr);
+  }
+  else {
+    return ptr;
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int m, n, k;
+
+  Options():
+    help(false),
+    m(1024), n(1024), k(1024),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "72b_blackwell_nvfp4_nvfp4_gemm\n\n"
+      << "  Blackwell NVFP4 GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out << "\n\nExamples:\n\n"
+      << "$ " << "./examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_block(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+
+  double scope_max, scope_min;
+  constexpr int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if constexpr (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if constexpr (bits_input <= 6) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if constexpr (bits_input <= 8) {
+    if constexpr (cute::is_same_v<Element, cutlass::float_ue8m0_t>) {
+      scope_max = 4;
+      scope_min = 1;
+    }
+    else {
+      scope_max = 1;
+      scope_min = -1;
+    }
+  }
+  else{
+    scope_max = 4;
+    scope_min = -4;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+  
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+  using namespace cute;
+  // For SFA and SFB tensors layouts
+  using Sm100BlkScaledConfig =  typename Gemm::GemmKernel::CollectiveMainloop::Sm100BlkScaledConfig;
+  // For SFD tensor layout
+  using Sm100BlockScaledOutputConfig =  typename Gemm::GemmKernel::CollectiveMainloop::Sm100BlkScaledConfig;
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+
+  layout_A = make_layout(make_shape(options.m, options.k, 1), stride_A);
+  layout_B = make_layout(make_shape(options.n, options.k, 1), stride_B);
+  layout_C = make_layout(make_shape(options.m, options.n, 1), stride_C);
+  layout_D = make_layout(make_shape(options.m, options.n, 1), stride_D);
+  layout_SFA = Sm100BlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(options.m, options.n, options.k, 1));
+  layout_SFB = Sm100BlkScaledConfig::tile_atom_to_shape_SFB(cute::make_shape(options.m, options.n, options.k, 1));
+  layout_SFD = SfdOutputCfg::tile_atom_to_shape_SFD(cute::make_shape(options.m, options.n, options.k, 1));
+
+  block_A.reset(cutlass::make_Coord(size(layout_A)));
+  block_B.reset(cutlass::make_Coord(size(layout_B)));
+  block_C.reset(cutlass::make_Coord(size(layout_C)));
+  block_D.reset(cutlass::make_Coord(size(layout_D)));
+  block_reference_D.reset(cutlass::make_Coord(size(layout_D)));
+  block_reference_SFD.reset(cutlass::make_Coord(size(filter_zeros(layout_SFD))));
+  block_Normconst.reset(cutlass::make_Coord(1));
+
+  block_SFA.reset(cutlass::make_Coord(size(filter_zeros(layout_SFA))));
+  block_SFB.reset(cutlass::make_Coord(size(filter_zeros(layout_SFB))));
+  block_SFD.reset(cutlass::make_Coord(size(filter_zeros(layout_SFD))));
+
+  initialize_block(block_A.host_view(), seed + 2021);
+  initialize_block(block_B.host_view(), seed + 2022);
+  initialize_block(block_C.host_view(), seed + 2023);
+  initialize_block(block_SFA.host_view(), seed + 2024);
+  initialize_block(block_SFB.host_view(), seed + 2025);
+  block_Normconst.at(cutlass::make_Coord(0)) = 2;
+
+  block_A.sync_device();
+  block_B.sync_device();
+  block_C.sync_device();
+  block_D.sync_device();
+  block_SFA.sync_device();
+  block_SFB.sync_device();
+  block_SFD.sync_device();
+  block_Normconst.sync_device();
+
+}
+
+// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  typename Gemm::Arguments arguments {
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, 1},
+    { // Mainloop arguments
+      block_A.device_data(), stride_A,
+      block_B.device_data(), stride_B,
+      block_SFA.device_data(), layout_SFA,
+      block_SFB.device_data(), layout_SFB
+    },
+    { // Epilogue arguments
+      { options.alpha, options.beta },
+      block_C.device_data(), stride_C,
+      block_D.device_data(), stride_D}
+  };
+
+  if constexpr (IsBlockScaleSupported) {
+    arguments.epilogue.thread.block_scale_factor_ptr = block_SFD.device_data();
+    arguments.epilogue.thread.norm_constant_ptr      = block_Normconst.device_data();
+  }
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  using namespace cute;
+  // Create the arguments for host reference implementation
+  Tensor tensor_A = make_tensor(make_iterator(block_A.host_data()), layout_A);
+  Tensor tensor_SFA = make_tensor(block_SFA.host_data(), layout_SFA);
+  Tensor tensor_B = make_tensor(make_iterator(block_B.host_data()), layout_B);
+  Tensor tensor_SFB = make_tensor(block_SFB.host_data(), layout_SFB);
+
+  // think about how to simplify the gemm3x interface.
+  cutlass::reference::host::GettBlockScalingMainloopParams<
+      ElementAccumulator,                   // ElementAccumulator
+      decltype(tensor_A),                   // TensorA
+      decltype(tensor_SFA),                 // TensorSfA
+      decltype(tensor_B),                   // TensorB
+      decltype(tensor_SFB)                  // TensorSfB
+    > mainloop_params{tensor_A, tensor_SFA, tensor_B, tensor_SFB};
+  
+  Tensor tensor_C = cute::make_tensor(make_iterator(block_C.host_data()), layout_C);
+  Tensor tensor_D = cute::make_tensor(make_iterator(block_reference_D.host_data()), layout_D);
+  Tensor tensor_SFD = make_tensor(block_reference_SFD.host_data(), layout_SFD);
+
+  cutlass::reference::host::GettBlockScalingEpilogueParams<
+      ElementCompute,                       // ElementScalar
+      ElementAccumulator,                   // ElementAccumulator
+      ElementCompute,                       // ElementCompute
+      decltype(tensor_C),                   // TensorC
+      decltype(tensor_D),                   // TensorD
+      decltype(tensor_SFD),                 // TensorSfD
+      cute::Int<OutputSFVectorSize>,
+      cutlass::reference::host::SfStrategy::SfDGen
+    > epilogue_params {options.alpha, options.beta, tensor_C, tensor_D, tensor_SFD, block_Normconst.at(cutlass::make_Coord(0))};
+
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  // Comparison
+  block_D.sync_host();
+  bool passed = cutlass::reference::host::TensorEquals(block_reference_D.host_view(), block_D.host_view());
+  passed &= (cutlass::reference::host::TensorNorm(block_reference_D.host_view()) > 0);
+  passed &= (cutlass::reference::host::TensorNorm(block_D.host_view()) > 0);
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  cudaDeviceSynchronize();
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 or higher Toolkit to run this example
+  // and must have compute capability at least 100.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  
+  if (!(props.major == 10 && props.minor == 0)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100)." << std::endl;
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm.cu
+++ b/examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm.cu
@ -0,0 +1,544 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief A GEMM example using CUTLASS for the NVIDIA Blackwell SM100 architecture.
+
+    This example demonstrates a simple way to instantiate and run a mixed precision blockscaled GEMM on the NVIDIA Blackwell SM100 architecture.
+    This Blackwell SM100 CUTLASS kernel uses the new Block Scaled Tensor Core MMA Instructions (tcgen05.mma.blockscaled) introduced
+    on the Blackwell architecture (sm100a) which have the same throughput compared to fp8 Tensor Core MMA instructions (tcgen05.mma)
+    and 2x throughput compared to fp8 Hopper Tensor Core MMA Instructions (WGMMA) (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+
+    Similar to 72a_blackwell_nvfp4_fp32_gemm, this kernel leverages:
+    1. Blockscaled tcgen05.mma instructions.
+
+    2. Per-SM memory called Tensor Memory (TMEM) (Please refer to CUDA 12.8 docs on https://docs.nvidia.com/cuda/).
+    
+    3. The extended warp-specialized kernel design introduced in Hopper enabled by use of TMEM 
+    which allows us to decouple the execution of MMA and epilogue into separate warps. 
+    
+    4. A new SW controlled dynamic scheduler based on cluster launch control (See https://docs.nvidia.com/cuda/parallel-thread-execution).
+
+    Usage:
+
+      $ ./examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm --m=2048 --n=2048 --k=2048
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/detail/sm100_blockscaled_layout.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+
+
+#include <iostream>
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = cutlass::mx_float8_t<cutlass::float_e4m3_t>;    // Element type for A matrix operand
+using         LayoutATag  = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 16;                                             // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::mx_float4_t<cutlass::float_e2m1_t>;    // Element type for A matrix operand
+using         LayoutBTag  = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128;                                            // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementD    = cutlass::bfloat16_t;                            // Element type for D matrix operand
+using         ElementC    = cutlass::bfloat16_t;                            // Element type for C matrix operand
+using         LayoutCTag  = cutlass::layout::RowMajor;                      // Layout type for C matrix operand
+using         LayoutDTag  = cutlass::layout::RowMajor;                      // Layout type for D matrix operand
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+// Kernel functional config
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm100;                           // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassBlockScaledTensorOp;      // Operator class tag
+
+// Kernel Perf config
+using MmaTileShape        = Shape<_256,_256,_256>;                          // MMA's tile size
+using ClusterShape        = Shape<_4,_4,_1>;                                // Shape of the threadblocks in a cluster
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,                      
+    MmaTileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutCTag, AlignmentC,
+    ElementD, LayoutDTag, AlignmentD,
+    cutlass::epilogue::collective::EpilogueScheduleAuto                      // Epilogue schedule policy
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutATag, AlignmentA,
+    ElementB, LayoutBTag, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::gemm::collective::KernelScheduleAuto                             // Kernel schedule policy. Auto or using targeted scheduling policy
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>,                                                   // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference device GEMM implementation type
+using StrideA   = typename Gemm::GemmKernel::StrideA;
+using LayoutA   = decltype(cute::make_layout(make_shape(0,0,0), StrideA{}));
+using LayoutSFA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFA;      // Scale Factor tensors have an interleaved layout. Bring Layout instead of stride.
+using StrideB   = typename Gemm::GemmKernel::StrideB;
+using LayoutB   = decltype(cute::make_layout(make_shape(0,0,0), StrideB{}));
+using LayoutSFB = typename Gemm::GemmKernel::CollectiveMainloop::LayoutSFB;      // Scale Factor tensors have an interleaved layout. Bring Layout instead of stride.
+using StrideC   = typename Gemm::GemmKernel::StrideC;
+using LayoutC   = decltype(cute::make_layout(make_shape(0,0,0), StrideC{}));
+using StrideD   = typename Gemm::GemmKernel::StrideD;
+using LayoutD   = decltype(cute::make_layout(make_shape(0,0,0), StrideD{}));
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+LayoutA layout_A;
+LayoutSFA layout_SFA;
+StrideB stride_B;
+LayoutB layout_B;
+LayoutSFB layout_SFB;
+StrideC stride_C;
+LayoutC layout_C;
+StrideD stride_D;
+LayoutD layout_D;
+uint64_t seed;
+
+// The HostTensors are only used for allocating memory on host and device, and transferring data between host and device
+// Use cute::Tensor and cute::Layout for iterating thru the matrix elements
+cutlass::HostTensor<ElementA::DataType, cutlass::layout::PackedVectorLayout> block_A;
+cutlass::HostTensor<ElementA::ScaleFactorType, cutlass::layout::PackedVectorLayout> block_SFA;
+cutlass::HostTensor<ElementB::DataType, cutlass::layout::PackedVectorLayout> block_B;
+cutlass::HostTensor<ElementB::ScaleFactorType, cutlass::layout::PackedVectorLayout> block_SFB;
+cutlass::HostTensor<ElementC, cutlass::layout::PackedVectorLayout> block_C;
+// Output Tensor
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_D;
+// Reference Output Tensor
+cutlass::HostTensor<ElementD, cutlass::layout::PackedVectorLayout> block_reference_D;
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+template <typename T>
+auto make_iterator(T* ptr) {
+  using namespace cute;
+  if constexpr (cute::is_subbyte_v<T>) {
+    return subbyte_iterator<T>(ptr);
+  }
+  else {
+    return ptr;
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int m, n, k;
+
+  Options():
+    help(false),
+    m(1024), n(1024), k(1024),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "72c_blackwell_mixed_mxfp8_bf16_gemm\n\n"
+      << "  Blackwell Mxfp8 x Mxfp4 GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out << "\n\nExamples:\n\n"
+      << "$ " << "/examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm" << " --m=1024 --n=512 --k=1024 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_block(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+
+  double scope_max, scope_min;
+  constexpr int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if constexpr (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if constexpr (bits_input <= 6) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if constexpr (bits_input <= 8) {
+    if constexpr (cute::is_same_v<Element, cutlass::float_ue8m0_t>) {
+      scope_max = 4;
+      scope_min = 1;
+    }
+    else {
+      scope_max = 1;
+      scope_min = -1;
+    }
+  }
+  else{
+    scope_max = 4;
+    scope_min = -4;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+  
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+  using namespace cute;
+  // For SFA and SFB tensors layouts
+  using Sm100BlkScaledConfig =  typename Gemm::GemmKernel::CollectiveMainloop::Sm100BlkScaledConfig;
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+
+  layout_A = make_layout(make_shape(options.m, options.k, 1), stride_A);
+  layout_B = make_layout(make_shape(options.n, options.k, 1), stride_B);
+  layout_C = make_layout(make_shape(options.m, options.n, 1), stride_C);
+  layout_D = make_layout(make_shape(options.m, options.n, 1), stride_D);
+  layout_SFA = Sm100BlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(options.m, options.n, options.k, 1));
+  layout_SFB = Sm100BlkScaledConfig::tile_atom_to_shape_SFB(cute::make_shape(options.m, options.n, options.k, 1));
+
+  block_A.reset(cutlass::make_Coord(size(layout_A)));
+  block_B.reset(cutlass::make_Coord(size(layout_B)));
+  block_C.reset(cutlass::make_Coord(size(layout_C)));
+  block_D.reset(cutlass::make_Coord(size(layout_D)));
+  block_reference_D.reset(cutlass::make_Coord(size(layout_D)));
+  block_SFA.reset(cutlass::make_Coord(size(filter_zeros(layout_SFA))));
+  block_SFB.reset(cutlass::make_Coord(size(filter_zeros(layout_SFB))));
+
+  initialize_block(block_A.host_view(), seed + 2021);
+  initialize_block(block_B.host_view(), seed + 2022);
+  initialize_block(block_C.host_view(), seed + 2023);
+  initialize_block(block_SFA.host_view(), seed + 2024);
+  initialize_block(block_SFB.host_view(), seed + 2025);
+
+  block_A.sync_device();
+  block_B.sync_device();
+  block_C.sync_device();
+  block_SFA.sync_device();
+  block_SFB.sync_device();
+}
+
+// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  typename Gemm::Arguments arguments {
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, 1},
+    { // Mainloop arguments
+      block_A.device_data(), stride_A,
+      block_B.device_data(), stride_B,
+      block_SFA.device_data(), layout_SFA,
+      block_SFB.device_data(), layout_SFB
+    },
+    { // Epilogue arguments
+      {options.alpha, options.beta},
+      block_C.device_data(), stride_C,
+      block_D.device_data(), stride_D
+    }
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  using namespace cute;
+  // Create the arguments for host reference implementation
+  Tensor tensor_A = make_tensor(make_iterator(block_A.host_data()), layout_A);
+  Tensor tensor_SFA = make_tensor(block_SFA.host_data(), layout_SFA);
+  Tensor tensor_B = make_tensor(make_iterator(block_B.host_data()), layout_B);
+  Tensor tensor_SFB = make_tensor(block_SFB.host_data(), layout_SFB);
+
+  cutlass::reference::host::GettBlockScalingMainloopParams<
+      ElementAccumulator,                 // ElementAccumulator
+      decltype(tensor_A),                 // TensorA
+      decltype(tensor_SFA),               // TensorSfA
+      decltype(tensor_B),                 // TensorB
+      decltype(tensor_SFB)                // TensorSfB
+    > mainloop_params{tensor_A, tensor_SFA, tensor_B, tensor_SFB};
+
+  auto tensor_C = cute::make_tensor(make_iterator(block_C.host_data()), layout_C);
+  auto tensor_D = cute::make_tensor(make_iterator(block_reference_D.host_data()), layout_D);
+ 
+  cutlass::reference::host::GettBlockScalingEpilogueParams<
+      ElementAccumulator,                   // ElementScalar
+      ElementAccumulator,                   // ElementAccumulator
+      ElementAccumulator,                   // ElementCompute
+      decltype(tensor_C),                   // TensorC
+      decltype(tensor_D)                    // TensorD
+    > epilogue_params{options.alpha, options.beta, tensor_C, tensor_D};
+
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  // Comparison
+  block_D.sync_host();
+  bool passed = cutlass::reference::host::TensorEquals(block_reference_D.host_view(), block_D.host_view());
+  passed &= (cutlass::reference::host::TensorNorm(block_reference_D.host_view()) > 0);
+  passed &= (cutlass::reference::host::TensorNorm(block_D.host_view()) > 0);
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  cudaDeviceSynchronize();
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 or higher Toolkit to run this example
+  // and must have compute capability at least 100.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  
+  if (!(props.major == 10 && props.minor == 0)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100)." << std::endl;
+    return 0;
+  }
+  
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Gemm>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/72_blackwell_narrow_precision_gemm/CMakeLists.txt
+++ b/examples/72_blackwell_narrow_precision_gemm/CMakeLists.txt
@ -0,0 +1,46 @@
+
+# Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+cutlass_example_add_executable(
+  72a_blackwell_nvfp4_bf16_gemm
+  72a_blackwell_nvfp4_bf16_gemm.cu
+  )
+
+cutlass_example_add_executable(
+  72b_blackwell_nvfp4_nvfp4_gemm
+  72b_blackwell_nvfp4_nvfp4_gemm.cu
+  )
+
+cutlass_example_add_executable(
+  72c_blackwell_mixed_mxfp8_bf16_gemm
+  72c_blackwell_mixed_mxfp8_bf16_gemm.cu
+  )
+endif()
--- a/examples/73_blackwell_gemm_preferred_cluster/CMakeLists.txt
+++ b/examples/73_blackwell_gemm_preferred_cluster/CMakeLists.txt
@ -0,0 +1,36 @@
+# Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+
+if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+cutlass_example_add_executable(
+  73_blackwell_gemm_preferred_cluster
+  blackwell_gemm_preferred_cluster.cu
+  )
+endif()
--- a/examples/73_blackwell_gemm_preferred_cluster/blackwell_gemm_preferred_cluster.cu
+++ b/examples/73_blackwell_gemm_preferred_cluster/blackwell_gemm_preferred_cluster.cu
@ -0,0 +1,536 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief A GEMM example using CUTLASS for the NVIDIA Blackwell SM100 architecture with preferred cluster.
+    
+    With the introduction of NVIDIA Compute Capability 9.0, the CUDA programming model introduced 
+    an optional hierarchy level known as Thread Block Clusters, which consist of multiple Thread Blocks. 
+    While the CUDA programming model has supported the specification of cluster shapes at runtime 
+    (Dynamic Clusters) since the Hopper architecture, CUTLASS has only provided support for Static 
+    Clusters, meaning that cluster shapes must be defined at compile time.  
+
+    Larger cluster shapes can achieve higher TMA multicast but may result in poor SM occupancy due 
+    to quantization. For instance, a 2x2 cluster on an 18 SM GPU would only utilize 16 SMs, leaving 
+    2 SMs idle.
+
+    Starting with Compute Capability 10.0, the CUDA programming model adds the ability to specify 
+    two clusters: preferred cluster and fallback cluster. For brevity, we refer to this as 
+    Preferred Clusters. In the previous example, users can now launch an additional 2x1 cluster to 
+    utilize the 2 idle SMs.
+      
+    With CUTLASS 3.8, in addition to Dynamic Clusters, CUTLASS adds support for Preferred Dynamic Cluster,
+    the ability for users to specify two clusters shapes at runtime.
+
+    Terminology
+      * Static cluster: cluster shape is specified at compile time.
+      * Dynamic cluster: cluster shape is specified at runtime and set by the host.
+      * Preferred cluster: Kernel can be launched with two cluster shapes (preferred and fallback).
+
+    Preferred and fallback cluster shapes are subject to several constraints.
+      * Preferred cluster depth (Z dimension) must be the same as that of fallback cluster.
+      * Fallback cluster shape must evenly divide the preferred cluster shape.
+      * Preferred cluster shape must evenly divide the kernel launch grid shape.
+
+    This example demonstrates how to use the Dynamic Clusters and Preferred Clusters features in
+    CUTLASS 3.x Blackwell SM100 kernels. Users can specify preferred and fallback cluster shapes via GEMM arguments.
+
+    # Example:
+    ./73_blackwell_gemm_preferred_cluster" --m=4096 --n=4096 --k=4096 --preferred_cluster_m=4 --preferred_cluster_n=4 --fallback_cluster_m=2 --fallback_cluster_m=1
+*/
+
+
+
+#include <iostream>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = half_t;                                         // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = half_t;                                          // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = float;                                          // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Kernel functional config
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm100;                           // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+
+// MMA and Cluster Tile Shapes
+// Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape % 2 == 0 
+using MmaTileShape_MNK = Shape<_256,_128,_64>;                          
+// Shape of the cluster set to <int,int,_1> to indicate dynamic cluster shape
+using ClusterShape_MNK = Shape<int,int,_1>;
+// When dynamic cluster is used, KernelScheduleAuto always selects mainloop dispatch policy that 
+// lowers to tcgen05 MMA cta_group = 1 as we don't know if the dynamic cluster M dimension will be a multiple of 2
+// To use tcgen05 MMA cta_group = 2, users must explicitly use 2sm builder schedules
+using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecialized2SmSm100;
+using EpilogueSchedule = cutlass::epilogue::TmaWarpSpecialized2Sm;
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC, AlignmentC,
+    ElementC, LayoutC, AlignmentC,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int, int>, // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    void // <--- Default to cluster launch control (CLC) scheduler
+>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference device GEMM implementation type
+using DeviceGemmReference = cutlass::reference::device::Gemm<
+  ElementA,
+  LayoutA,
+  ElementB,
+  LayoutB,
+  ElementC,
+  LayoutC,
+  ElementAccumulator,
+  ElementAccumulator>;
+
+using StrideA = typename Gemm::GemmKernel::StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+uint64_t seed;
+
+cutlass::DeviceAllocation<typename Gemm::ElementA> block_A;
+cutlass::DeviceAllocation<typename Gemm::ElementB> block_B;
+cutlass::DeviceAllocation<typename Gemm::ElementC> block_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_ref_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int m, n, k;
+  int preferred_cluster_m, preferred_cluster_n, fallback_cluster_m, fallback_cluster_n;
+  
+  Options():
+    help(false),
+    m(4096), n(4096), k(4096),
+    alpha(1.f), beta(0.f),
+    iterations(10),
+    preferred_cluster_m(4),
+    preferred_cluster_n(4),
+    fallback_cluster_m(2),
+    fallback_cluster_n(1)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+    cmd.get_cmd_line_argument("preferred_cluster_m", preferred_cluster_m, 4);
+    cmd.get_cmd_line_argument("preferred_cluster_n", preferred_cluster_n, 4);
+    cmd.get_cmd_line_argument("fallback_cluster_m", fallback_cluster_m, 2);
+    cmd.get_cmd_line_argument("fallback_cluster_n", fallback_cluster_n, 1);
+
+    if (!validate_cluster_shape()){
+      std::cout << "--Invalid cluster shapes" << std::endl;
+      help = true;
+      return;
+    }
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "73_blackwell_gemm_preferred_cluster\n\n"
+      << "  Blackwell FP16 GEMM using preferred cluster.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n"
+      << "  --preferred_cluster_m=<str> Sets the M extent of preferred cluster shape\n"
+      << "  --preferred_cluster_n=<str> Sets the N extent of preferred cluster shape\n"
+      << "  --fallback_cluster_m=<str>  Sets the M extent of fallback cluster shape\n"
+      << "  --fallback_cluster_n=<str>  Sets the N extent of fallback cluster shape\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out << "Preferred cluster shape cannot be smaller than fallback cluster shape.\n"
+      << "Preferred cluster shape must be a multiple of fallback cluster shape.\n\n";
+  
+    out << "\n\nExamples:\n\n"
+      << "$ " << "73_blackwell_gemm_preferred_cluster" << " --m=4096 --n=4096 --k=4096 --preferred_cluster_m=4 --preferred_cluster_n=4 --fallback_cluster_m=2 --fallback_cluster_m=1\n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+
+ private:
+  /// Validate preferred and fallback cluster shapes
+  bool validate_cluster_shape() {
+    if (preferred_cluster_m < fallback_cluster_m || preferred_cluster_n < fallback_cluster_n) {
+      std::cout << "--Preferred cluster cannot be smaller than fallback cluster" << std::endl;
+      return false;
+    }
+
+    if (preferred_cluster_m % fallback_cluster_m != 0 || preferred_cluster_n % fallback_cluster_n != 0) {
+      std::cout << "--Preferred cluster must be a multiple of fallback cluster" << std::endl;
+      return false;
+    }
+    return true;
+  }
+
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_block(cutlass::DeviceAllocation<Element>& block, uint64_t seed=2023) {
+  Element scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = Element(2);
+    scope_min = Element(0);
+  } else if (bits_input <= 8) {
+    scope_max = Element(2);
+    scope_min = Element(-2);
+  } else {
+    scope_max = Element(8);
+    scope_min = Element(-8);
+  }
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+
+  block_A.reset(options.m * options.k);
+  block_B.reset(options.k * options.n);
+  block_C.reset(options.m * options.n);
+  block_D.reset(options.m * options.n);
+  block_ref_D.reset(options.m * options.n);
+
+  initialize_block(block_A, seed + 2023);
+  initialize_block(block_B, seed + 2022);
+  initialize_block(block_C, seed + 2021);
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options) {
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, 1},
+    {block_A.get(), stride_A, block_B.get(), stride_B},
+    {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
+  };
+
+  arguments.hw_info.cluster_shape = dim3(options.preferred_cluster_m, options.preferred_cluster_n,1);
+  arguments.hw_info.cluster_shape_fallback = dim3(options.fallback_cluster_m, options.fallback_cluster_n,1);
+  
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  cutlass::TensorRef ref_A(block_A.get(), Gemm::LayoutA::packed({options.m, options.k}));
+  cutlass::TensorRef ref_B(block_B.get(), Gemm::LayoutB::packed({options.k, options.n}));
+  cutlass::TensorRef ref_C(block_C.get(), Gemm::LayoutC::packed({options.m, options.n}));
+  cutlass::TensorRef ref_D(block_ref_D.get(), Gemm::LayoutD::packed({options.m, options.n}));
+
+  //
+  // Compute reference output
+  //
+
+  // Create instantiation for device reference gemm kernel
+  DeviceGemmReference gemm_reference;
+
+  // Launch device reference gemm kernel
+  gemm_reference(
+    {options.m, options.n, options.k},
+    ElementAccumulator(options.alpha),
+    ref_A,
+    ref_B,
+    ElementAccumulator(options.beta),
+    ref_C,
+    ref_D);
+
+  // Wait for kernel to finish
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  bool passed = cutlass::reference::device::BlockCompareEqual(block_ref_D.get(), block_D.get(), block_D.size());
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+int run(Options &options) {
+
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "GEMM with"
+            << " Problem Size: " << options.m << 'x' << options.n << 'x' << options.k 
+            << " Preferred Cluster = (" << options.preferred_cluster_m << ", " << options.preferred_cluster_n << ", 1)"
+            << " Fallback Cluster = (" << options.fallback_cluster_m << ", " << options.fallback_cluster_n << ", 1)"
+            << std::endl;
+
+  std::cout << "--------------------------------------------------------------------------------" << std::endl;
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 Toolkit to run this example
+  // and must have compute capability at least 100.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+
+  if (props.major != 10 || props.minor != 0) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100)." << std::endl;
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
--- a/examples/74_blackwell_gemm_streamk/CMakeLists.txt
+++ b/examples/74_blackwell_gemm_streamk/CMakeLists.txt
@ -0,0 +1,37 @@
+
+# Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+
+if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+cutlass_example_add_executable(
+  74_blackwell_gemm_streamk
+  blackwell_gemm_streamk.cu
+  )
+endif()
--- a/examples/74_blackwell_gemm_streamk/blackwell_gemm_streamk.cu
+++ b/examples/74_blackwell_gemm_streamk/blackwell_gemm_streamk.cu
@ -0,0 +1,587 @@
+/***************************************************************************************************
+ * Copyright (c) 2025 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief A GEMM example using CUTLASS for the NVIDIA Blackwell SM100 architecture with the Stream-K scheduler.
+
+    Stream-K is a GEMM parallelization technique that attempts to reduce load imbalance across SMs
+    by parallelizing certain output tiles across the K mode of the GEMM, without using a static splitting factor.
+    For complete details on Stream-K, please see https://arxiv.org/abs/2301.03598.
+
+    CUTLASS's Stream-K scheduler using the CUTLASS 3.x API is capable of supporting various modes of
+    decomposing a GEMM (referred to as "decomposition modes" in this example):
+      * DataParallel: basic GEMM parallelized spatially via tiling, but without splitting the K mode
+      * SplitK: `split_factor` CTAs compute portions of the K mode for a given output tile and reduce their results
+      * StreamK: parallelizes work according to the stream-K load balancing method described in https://arxiv.org/abs/2301.03598
+      * Heuristic: applies an internal heuristic in attempt to choose the most performant among the three preceding decomposition modes
+
+    Additionally, the Stream-K scheduler supports two different means of performing reductions for
+    decomposition modes that require reduction (SplitK, StreamK, and Heuristic):
+      * Deterministic: Participating CTAs perform reduction in a turnstile fashion in order of the K mode
+                       covered by each CTA. This requires a lock to be held exclusively by the CTA that is
+                       currently accumulating.
+      * Nondeterministic: Participating CTAs perform reduction atomically to the same workspace (mostly) without locking.
+                          Locks are used only to wait for the first CTA to write its partial values (to initialize the
+                          workspace), and for all but the final CTA to have accumulated (so that the final CTA can load
+                          the accumulated value and accumulate it into registers on top of which the epilogue will
+                          be performed). Due to the nondeterminsitic ordering of accumulation, deterministic numeric
+                          behavior cannot be guaranteed with this mode (e.g., floating-point rounding error will depend
+                          on the order of accumulation)
+
+    This example allows one to try out different decomposition modes, reduction modes, and (when using Split-K) splitting factors.
+    Here are a few examples of usage:
+       # Heuristic mode with deterministic reduction
+      ./74_blackwell_gemm_streamk" --m=256 --n=256 --k=16384 --decomposition=Heuristic --reduction=Deterministic
+
+      # Stream-K mode with determinsitic reduction
+      ./74_blackwell_gemm_streamk" --m=256 --n=256 --k=16384 --decomposition=StreamK --reduction=Deterministic
+
+      # Split-K mode with a splitting factor of 2 and deterministic reduction
+      ./74_blackwell_gemm_streamk" --m=256 --n=256 --k=16384 --decomposition=SplitK --reduction=Deterministic --splits=2
+
+      # Stream-K mode with nondeterministic reduction
+      ./74_blackwell_gemm_streamk" --m=256 --n=256 --k=16384 --decomposition=StreamK --reduction=Nondeterministic
+*/
+
+
+
+#include <iostream>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = half_t;                                         // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = half_t;                                          // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = float;                                          // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Kernel functional config
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm100;                           // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+
+// MMA and Cluster Tile Shapes
+// Shape of the tile computed by tcgen05 MMA, could be across 2 SMs if Cluster Shape % 2 == 0
+using MmaTileShape_MNK = Shape<_256,_128,_64>;
+// Shape of the cluster set to <int,int,_1> to indicate dynamic cluster shape
+using ClusterShape_MNK = Shape<int,int,_1>;
+// When dynamic cluster is used, KernelScheduleAuto always selects mainloop dispatch policy that 
+// lowers to tcgen05 MMA cta_group = 1 as we don't know if the dynamic cluster M dimension will be a multiple of 2
+// To use tcgen05 MMA cta_group = 2, users must explicitly use 2sm builder schedules
+using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecialized2SmSm100;
+using EpilogueSchedule = cutlass::epilogue::TmaWarpSpecialized2Sm;
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC, AlignmentC,
+    ElementC, LayoutC, AlignmentC,
+    cutlass::epilogue::collective::EpilogueScheduleAuto
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    MmaTileShape_MNK, ClusterShape_MNK,
+    cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int, int>, // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    cutlass::gemm::StreamKScheduler // <--- Change needed to enable the stream-K scheduler
+>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference device GEMM implementation type
+using DeviceGemmReference = cutlass::reference::device::Gemm<
+  ElementA,
+  LayoutA,
+  ElementB,
+  LayoutB,
+  ElementC,
+  LayoutC,
+  ElementAccumulator,
+  ElementAccumulator>;
+
+using StrideA = typename Gemm::GemmKernel::StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+uint64_t seed;
+
+cutlass::DeviceAllocation<typename Gemm::ElementA> block_A;
+cutlass::DeviceAllocation<typename Gemm::ElementB> block_B;
+cutlass::DeviceAllocation<typename Gemm::ElementC> block_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_ref_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int m, n, k;
+  int preferred_cluster_m, preferred_cluster_n, fallback_cluster_m, fallback_cluster_n;
+  using DecompositionMode = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90StreamKParams::DecompositionMode;
+  using ReductionMode = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90StreamKParams::ReductionMode;
+  DecompositionMode decomposition_mode;
+  ReductionMode reduction_mode;
+  int splits;
+
+  std::unordered_map<DecompositionMode, std::vector<std::string>> dec_mappings = {
+    {DecompositionMode::Heuristic,    {"Heuristic", "heuristic", "h", "H", ""}},
+    {DecompositionMode::SplitK,       {"SplitK", "split-k", "split-K", "Split-K", "Split-k", "splitk", "Splitk", "splitK", "spk", "SpK", "spK"}},
+    {DecompositionMode::StreamK,      {"StreamK", "stream-k", "stream-K", "Stream-K", "Stream-k", "streamk", "Streamk", "streamK", "stk", "StK", "stK"}},
+    {DecompositionMode::DataParallel, {"DataParallel", "data-parallel", "dataparallel", "dp", "DP"}}
+  };
+
+  std::unordered_map<ReductionMode, std::vector<std::string>> red_mappings = {
+    {ReductionMode::Deterministic,    {"Deterministic", "deterministic", "d", "D", ""}},
+    {ReductionMode::Nondeterministic, {"Nondeterministic", "nondeterministic", "n", "N"}}
+  };
+
+  Options():
+    help(false),
+    m(256), n(256), k(16384),
+    alpha(1.f), beta(0.f),
+    iterations(10),
+    preferred_cluster_m(4),
+    preferred_cluster_n(4),
+    fallback_cluster_m(2),
+    fallback_cluster_n(1),
+    decomposition_mode(DecompositionMode::Heuristic),
+    reduction_mode(ReductionMode::Deterministic),
+    splits(1)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+    cmd.get_cmd_line_argument("splits", splits, 1);
+    cmd.get_cmd_line_argument("preferred_cluster_m", preferred_cluster_m, 4);
+    cmd.get_cmd_line_argument("preferred_cluster_n", preferred_cluster_n, 4);
+    cmd.get_cmd_line_argument("fallback_cluster_m", fallback_cluster_m, 2);
+    cmd.get_cmd_line_argument("fallback_cluster_n", fallback_cluster_n, 1);
+
+    // Parse decompsition mode
+    std::string decomp_mode;
+    cmd.get_cmd_line_argument("decomposition", decomp_mode);
+    bool found = parse_from_options_map(decomp_mode, dec_mappings, decomposition_mode);
+    if (!found) {
+      std::cout << "--decomposition must be one of Heuristic, SplitK, StreamK, or DataParallel" << std::endl;
+      help = true;
+      return;
+    }
+
+    // Parse reduction mode
+    std::string red_mode;
+    cmd.get_cmd_line_argument("reduction", red_mode);
+    found = parse_from_options_map(red_mode, red_mappings, reduction_mode);
+    if (!found) {
+      std::cout << "--reduction must be one of Deterministic and Nondeterministic" << std::endl;
+      help = true;
+      return;
+    }
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "74_blackwell_gemm_streamk\n\n"
+      << "  Blackwell FP16 GEMM using a stream-K kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n"
+      << "  --preferred_cluster_m=<str> Sets the M extent of preferred cluster shape\n"
+      << "  --preferred_cluster_n=<str> Sets the N extent of preferred cluster shape\n"
+      << "  --fallback_cluster_m=<str>  Sets the M extent of fallback cluster shape\n"
+      << "  --fallback_cluster_n=<str>  Sets the N extent of fallback cluster shape\n"
+      << "  --decomposition=<str>       Mode in which the stream-K kernel should decompose the problem. Options: Heuristic (default), SplitK, StreamK, DataParallel\n"
+      << "  --reduction=<str>           Mode in which the stream-K kernel's reduction should be performed. Options: Deterministic (default), Nondeterministic\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "74_blackwell_gemm_streamk" << " --m=256 --n=256 --k=16384 --decomposition=Heuristic --reduction=Deterministic \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+
+  std::string decomposition_mode_str() const {
+    return dec_mappings.at(decomposition_mode).at(0);
+  }
+
+  std::string reduction_mode_str() const {
+    return red_mappings.at(reduction_mode).at(0);
+  }
+
+ private:
+  template <class T>
+  bool parse_from_options_map(std::string val, std::unordered_map<T, std::vector<std::string>> options, T& result) const {
+    for (const auto & [key, values] : options) {
+      if (std::find(values.begin(), values.end(), val) != values.end()) {
+        result = key;
+        return true;
+      }
+    }
+    return false;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_block(cutlass::DeviceAllocation<Element>& block, uint64_t seed=2023) {
+  Element scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = Element(2);
+    scope_min = Element(0);
+  } else if (bits_input <= 8) {
+    scope_max = Element(2);
+    scope_min = Element(-2);
+  } else {
+    scope_max = Element(8);
+    scope_min = Element(-8);
+  }
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {options.m, options.k, 1});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {options.n, options.k, 1});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {options.m, options.n, 1});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {options.m, options.n, 1});
+
+  block_A.reset(options.m * options.k);
+  block_B.reset(options.k * options.n);
+  block_C.reset(options.m * options.n);
+  block_D.reset(options.m * options.n);
+  block_ref_D.reset(options.m * options.n);
+
+  initialize_block(block_A, seed + 2023);
+  initialize_block(block_B, seed + 2022);
+  initialize_block(block_C, seed + 2021);
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options) {
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, 1},
+    {block_A.get(), stride_A, block_B.get(), stride_B},
+    {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
+  };
+
+  arguments.hw_info.cluster_shape = dim3(options.preferred_cluster_m, options.preferred_cluster_n,1);
+  arguments.hw_info.cluster_shape_fallback = dim3(options.fallback_cluster_m, options.fallback_cluster_n,1);
+
+  arguments.scheduler.splits = options.splits;
+  arguments.scheduler.decomposition_mode = options.decomposition_mode;
+  arguments.scheduler.reduction_mode = options.reduction_mode;
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  cutlass::TensorRef ref_A(block_A.get(), Gemm::LayoutA::packed({options.m, options.k}));
+  cutlass::TensorRef ref_B(block_B.get(), Gemm::LayoutB::packed({options.k, options.n}));
+  cutlass::TensorRef ref_C(block_C.get(), Gemm::LayoutC::packed({options.m, options.n}));
+  cutlass::TensorRef ref_D(block_ref_D.get(), Gemm::LayoutD::packed({options.m, options.n}));
+
+  //
+  // Compute reference output
+  //
+
+  // Create instantiation for device reference gemm kernel
+  DeviceGemmReference gemm_reference;
+
+  // Launch device reference gemm kernel
+  gemm_reference(
+    {options.m, options.n, options.k},
+    ElementAccumulator(options.alpha),
+    ref_A,
+    ref_B,
+    ElementAccumulator(options.beta),
+    ref_C,
+    ref_D);
+
+  // Wait for kernel to finish
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  bool passed = cutlass::reference::device::BlockCompareEqual(block_ref_D.get(), block_D.get(), block_D.size());
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+int run(Options &options) {
+
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "Stream-K GEMM with"
+            << " Problem Size: " << options.m << 'x' << options.n << 'x' << options.k
+            << " Preferred Cluster = (" << options.preferred_cluster_m << ", " << options.preferred_cluster_n << ", 1)"
+            << " Fallback Cluster = (" << options.fallback_cluster_m << ", " << options.fallback_cluster_n << ", 1)\n"
+            << " Decomposition_mode=" << options.decomposition_mode_str()
+            << " Split_count=" << options.splits
+            << " Reduction_mode=" << options.reduction_mode_str()
+            << std::endl;
+
+  std::cout << "--------------------------------------------------------------------------------" << std::endl;
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 Toolkit to run this example
+  // and must have compute capability at least 100.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+
+  if (props.major != 10 && (props.minor != 0 || props.minor != 1)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100 or 101)." << std::endl;
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm.cu
+++ b/examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm.cu
@ -0,0 +1,806 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+
+/*! \file
+    \brief Grouped GEMM example using CUTLASS 3 APIs for the NVIDIA Blackwell SM100 architecture.
+
+    This example demonstrates an implementation of Grouped GEMM using a TMA + Blackwell SM100 TensorOp-based warp-specialized kernel.
+    For this example all scheduling work is performed on the device.
+    The new feature showcased in this example is device-side modification of TMA descriptors
+    to move between groups/problem_count (represented by groups).
+    https://docs.nvidia.com/cuda/cuda-c-programming-guide/#encoding-a-tensor-map-on-device
+
+    To run this example:
+
+      $ ./examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm --m=2048 --n=2048 --k=2048 --groups=10
+
+      The above example command makes all 10 groups to be sized at the given m, n, k sizes.
+      Skipping any of the problem dimensions randomizes it across the different groups.
+      Same applies for alpha and beta values that are randomized across the different groups.
+
+    To run this example for a set of problems using the benchmark option:
+
+      $ ./examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm --benchmark=./test_benchmark.txt
+
+      Where the test_benchmark.txt may look as such:
+        0 256x512x128
+        1 256x512x512
+        2 512x256x128
+        3 256x256x128
+        4 256x512x1024
+        5 1024x512x128 and so on
+*/
+
+#include <iostream>
+#include <fstream>
+#include <sstream>
+#include <vector>
+#include <float.h>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/group_array_problem_shape.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+using namespace cute;
+
+using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int,int,int>>; // <M,N,K> per group
+using ElementA = cutlass::float_e4m3_t;                                    // Element type for A matrix operand
+using ElementB = cutlass::float_e4m3_t;                                    // Element type for B matrix operand
+using ElementC = cutlass::half_t;                                          // Element type for C and D matrix operands
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+// A matrix configuration
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm100;                           // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+using StageCountType = cutlass::gemm::collective::StageCountAuto;           // Stage count maximized based on the tile size
+
+// Runtime Cluster Shape
+using ClusterShape = Shape<int32_t,int32_t,_1>;
+
+// Different configs for 1SM and 2SM MMA kernel
+struct MMA1SMConfig {
+  using MmaTileShape     = Shape<_128,_256,Int<128 / sizeof(ElementA)>>;
+  using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100;   // Kernel to launch
+  using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm;          // Epilogue to launch
+};
+
+struct MMA2SMConfig {
+  using MmaTileShape     = Shape<_256,_256,Int<128 / sizeof(ElementA)>>;
+  using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecialized2SmSm100;   // Kernel to launch
+  using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized2Sm;          // Epilogue to launch
+};
+
+template <typename ScheduleConfig>
+struct GivenGemmSchedule {
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    typename ScheduleConfig::MmaTileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC *, AlignmentC,
+    ElementC, LayoutC *, AlignmentC,
+    typename ScheduleConfig::EpilogueSchedule,
+    cutlass::epilogue::fusion::LinearCombination<ElementC, ElementAccumulator>
+  >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+      ElementA, LayoutA *, AlignmentA,
+      ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    typename ScheduleConfig::MmaTileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    typename ScheduleConfig::KernelSchedule
+  >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      ProblemShape,
+      CollectiveMainloop,
+      CollectiveEpilogue
+  >;
+
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+};
+
+using GemmKernel1SM = GivenGemmSchedule<MMA1SMConfig>::GemmKernel;
+using Gemm1SM = GivenGemmSchedule<MMA1SMConfig>::Gemm;
+using Gemm = Gemm1SM;
+
+using GemmKernel2SM = GivenGemmSchedule<MMA2SMConfig>::GemmKernel;
+using Gemm2SM = GivenGemmSchedule<MMA2SMConfig>::Gemm;
+
+// Reference device GEMM implementation type
+using DeviceGemmReference = cutlass::reference::device::Gemm<
+  ElementA,
+  LayoutA,
+  ElementB,
+  LayoutB,
+  ElementC,
+  LayoutC,
+  ElementAccumulator,
+  ElementAccumulator>;
+
+using StrideA = typename Gemm::GemmKernel::InternalStrideA;
+using StrideB = typename Gemm::GemmKernel::InternalStrideB;
+using StrideC = typename Gemm::GemmKernel::InternalStrideC;
+using StrideD = typename Gemm::GemmKernel::InternalStrideD;
+
+// Host-side allocations
+std::vector<int64_t> offset_A;
+std::vector<int64_t> offset_B;
+std::vector<int64_t> offset_C;
+std::vector<int64_t> offset_D;
+
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+
+std::vector<ElementAccumulator> alpha_host;
+std::vector<ElementAccumulator> beta_host;
+
+// Device-side allocations
+cutlass::DeviceAllocation<typename ProblemShape::UnderlyingProblemShape> problem_sizes;
+
+cutlass::DeviceAllocation<typename Gemm::ElementA> block_A;
+cutlass::DeviceAllocation<typename Gemm::ElementB> block_B;
+cutlass::DeviceAllocation<typename Gemm::ElementC> block_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_ref_D;
+
+cutlass::DeviceAllocation<const typename Gemm::ElementA *> ptr_A;
+cutlass::DeviceAllocation<const typename Gemm::ElementB *> ptr_B;
+cutlass::DeviceAllocation<const typename Gemm::ElementC *> ptr_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput *> ptr_D;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput *> ptr_ref_D;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+
+// Note, this is an array of pointers to alpha and beta scaling values per group
+cutlass::DeviceAllocation<ElementAccumulator*> alpha_device;
+cutlass::DeviceAllocation<ElementAccumulator*> beta_device;
+cutlass::DeviceAllocation<ElementAccumulator> block_alpha;
+cutlass::DeviceAllocation<ElementAccumulator> block_beta;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+using RasterOrderOptions = typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm100GroupParams<typename ProblemShape::UnderlyingProblemShape>::RasterOrderOptions;
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+
+  float alpha = FLT_MAX;
+  float beta  = FLT_MAX;
+  int iterations = 10;
+  int m = 1024, n = 2048, k = 512, groups = 10;
+  dim3 cluster_shape = dim3(4,2,1);
+  dim3 cluster_shape_fallback = dim3(2,1,1);
+  RasterOrderOptions raster_order = RasterOrderOptions::AlongM;
+  int max_sm_count = INT_MAX;
+  std::string benchmark_path;
+  std::vector<typename ProblemShape::UnderlyingProblemShape> problem_sizes_host;
+  int const tma_alignment_bits = 128;
+  int const alignment = tma_alignment_bits / cutlass::sizeof_bits<ElementA>::value;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("groups", groups);
+    cmd.get_cmd_line_argument("alpha", alpha, FLT_MAX);
+    cmd.get_cmd_line_argument("beta",  beta,  FLT_MAX);
+    cmd.get_cmd_line_argument("iterations", iterations);
+    cmd.get_cmd_line_argument("benchmark", benchmark_path);
+    cmd.get_cmd_line_argument("cluster_m", cluster_shape.x);
+    cmd.get_cmd_line_argument("cluster_n", cluster_shape.y);
+    cmd.get_cmd_line_argument("cluster_fallback_m", cluster_shape_fallback.x);
+    cmd.get_cmd_line_argument("cluster_fallback_n", cluster_shape_fallback.y);
+    cmd.get_cmd_line_argument("max_sm_count", max_sm_count, INT_MAX);
+
+    // Decide how to initialize the problems
+    if (!benchmark_path.empty()) {
+      if (!benchmark_problems()) {
+        problem_sizes_host.clear();
+        return;
+      }
+    }
+    else {
+      randomize_problems(cmd);
+    }
+
+    char raster_char;
+    cmd.get_cmd_line_argument("raster", raster_char);
+
+    if (raster_char == 'N' || raster_char == 'n') {
+      raster_order = RasterOrderOptions::AlongN;
+    }
+    else if (raster_char == 'M' || raster_char == 'm') {
+      raster_order = RasterOrderOptions::AlongM;
+    }
+  }
+
+  void randomize_problems(cutlass::CommandLine &cmd) {
+    int cmd_line_m = -1, cmd_line_n = -1, cmd_line_k = -1;
+    cmd.get_cmd_line_argument("m", cmd_line_m);
+    cmd.get_cmd_line_argument("n", cmd_line_n);
+    cmd.get_cmd_line_argument("k", cmd_line_k);
+
+    problem_sizes_host.reserve(groups);
+
+    for (int i = groups; i > 0; i--) {
+      int m = cmd_line_m;
+      int n = cmd_line_n;
+      int k = cmd_line_k;
+      if (m < 1) {
+        m = alignment * ((rand() % 64) + 1);
+      }
+      if (n < 1) {
+        n = alignment * ((rand() % 64) + 1);
+      }
+      if (k < 1) {
+        k = alignment * ((rand() % 64) + 1);
+      }
+      problem_sizes_host.push_back({m, n, k});
+    }
+  }
+
+  /// Load a benchmark
+  bool benchmark_problems() {
+    std::ifstream file(benchmark_path);
+    if (!file.good()) {
+      return false;
+    }
+
+    while (file.good()) {
+
+      int idx = -1;
+      std::string extent_str;
+
+      file >> idx >> extent_str;
+
+      if (idx < 0 || extent_str.empty()) {
+        break;
+      }
+
+      cutlass::gemm::GemmCoord extent;
+      std::vector<std::string> tokens;
+
+      cutlass::CommandLine::tokenize(tokens, extent_str, 'x');
+
+      for (int i = 0; i < int(tokens.size()); ++i) {
+        int x = std::atoi(tokens.at(i).c_str());
+
+        // round up
+        if (x % alignment) {
+          x += (alignment - (x % alignment));
+        }
+
+        extent.at(i) = x;
+      }
+
+      if (extent.product()) {
+        problem_sizes_host.push_back({extent.m(), extent.n(), extent.k()});
+      }
+    }
+    groups = static_cast<int>(problem_sizes_host.size());
+
+    return true;
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "75_blackwell_grouped_gemm\n\n"
+      << "  Blackwell FP8 Grouped GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                                                       If specified, displays this usage statement\n\n"
+      << "  --m=<int>                                                    Sets the M extent of the GEMM for all groups\n"
+      << "  --n=<int>                                                    Sets the N extent of the GEMM for all groups\n"
+      << "  --k=<int>                                                    Sets the K extent of the GEMM for all groups\n"
+      << "  --groups=<int>                                               Sets the number of individual GEMM problems for Grouped GEMM\n"
+      << "  --alpha=<f32>                                                Epilogue scalar alpha\n"
+      << "  --beta=<f32>                                                 Epilogue scalar beta\n\n"
+      << "  --cluster_m=<int>          and --cluster_n=<int>             Sets the X,Y dims of the preferred cluster shape\n"
+      << "  --cluster_fallback_m=<int> and --cluster_fallback_n=<int>    Sets the X,Y dims of the fallback cluster shape\n\n"
+      << "  --raster=<char>                                              CTA Rasterization direction (N for along N, M for along M)\n\n"
+      << "  --iterations=<int>                                           Number of profiling iterations to perform\n\n"
+      << "  --benchmark=<str>                                            Executes a benchmark problem size\n"
+      << "  --max_sm_count=<int>                                         Run kernels using only these number of SMs\n";
+                                                                                             
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "75_blackwell_grouped_gemm" << " --m=1024 --n=512 --k=1024 --groups=10 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s, std::vector<typename ProblemShape::UnderlyingProblemShape> problem_sizes_host) const
+  {
+    // Number of real-valued multiply-adds
+    uint64_t fmas = uint64_t();
+
+    for (auto const & problem : problem_sizes_host) {
+      fmas += static_cast<uint64_t>(get<0>(problem)) *
+              static_cast<uint64_t>(get<1>(problem)) *
+              static_cast<uint64_t>(get<2>(problem));
+    }
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * uint64_t(fmas);
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms = 0.0;
+  double gflops = 0.0;
+  cutlass::Status status = cutlass::Status::kSuccess;
+  cudaError_t error = cudaSuccess;
+  bool passed = false;
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_block(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+
+  Element scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = static_cast<Element>(2);
+    scope_min = static_cast<Element>(0);
+  } else if (bits_input <= 8) {
+    scope_max = static_cast<Element>(2);
+    scope_min = static_cast<Element>(-2);
+  } else {
+    scope_max = static_cast<Element>(8);
+    scope_min = static_cast<Element>(-8);
+  }
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+/// Allocates device-side data
+void allocate(const Options &options) {
+  int64_t total_elements_A = 0;
+  int64_t total_elements_B = 0;
+  int64_t total_elements_C = 0;
+  int64_t total_elements_D = 0;
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    offset_A.push_back(total_elements_A);
+    offset_B.push_back(total_elements_B);
+    offset_C.push_back(total_elements_C);
+    offset_D.push_back(total_elements_D);
+
+    int64_t elements_A = M * K;
+    int64_t elements_B = K * N;
+    int64_t elements_C = M * N;
+    int64_t elements_D = M * N;
+
+    total_elements_A += elements_A;
+    total_elements_B += elements_B;
+    total_elements_C += elements_C;
+    total_elements_D += elements_D;
+
+    stride_A_host.push_back(cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1}));
+    stride_B_host.push_back(cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}));
+    stride_C_host.push_back(cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1}));
+    stride_D_host.push_back(cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1}));
+
+  }
+
+  block_A.reset(total_elements_A);
+  block_B.reset(total_elements_B);
+  block_C.reset(total_elements_C);
+  block_D.reset(total_elements_D);
+  block_ref_D.reset(total_elements_D);
+  block_alpha.reset(options.groups);
+  block_beta.reset(options.groups);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  uint64_t seed = 2020;
+
+  problem_sizes.reset(options.groups);
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+
+  //
+  // Assign pointers
+  //
+
+  std::vector<ElementA *> ptr_A_host(options.groups);
+  std::vector<ElementB *> ptr_B_host(options.groups);
+  std::vector<ElementC *> ptr_C_host(options.groups);
+  std::vector<ElementC *> ptr_D_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_alpha_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_beta_host(options.groups);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    ptr_A_host.at(i) = block_A.get() + offset_A.at(i);
+    ptr_B_host.at(i) = block_B.get() + offset_B.at(i);
+    ptr_C_host.at(i) = block_C.get() + offset_C.at(i);
+    ptr_D_host.at(i) = block_D.get() + offset_D.at(i);
+    alpha_host.push_back((options.alpha == FLT_MAX) ? static_cast<ElementAccumulator>((rand() % 5) + 1) : options.alpha);
+    beta_host.push_back((options.beta == FLT_MAX) ? static_cast<ElementAccumulator>(rand() % 5) : options.beta);
+    ptr_alpha_host.at(i) = block_alpha.get() + i;
+    ptr_beta_host.at(i) = block_beta.get() + i;
+  }
+
+  ptr_A.reset(options.groups);
+  ptr_A.copy_from_host(ptr_A_host.data());
+
+  ptr_B.reset(options.groups);
+  ptr_B.copy_from_host(ptr_B_host.data());
+
+  ptr_C.reset(options.groups);
+  ptr_C.copy_from_host(ptr_C_host.data());
+
+  ptr_D.reset(options.groups);
+  ptr_D.copy_from_host(ptr_D_host.data());
+
+  stride_A.reset(options.groups);
+  stride_A.copy_from_host(stride_A_host.data());
+
+  stride_B.reset(options.groups);
+  stride_B.copy_from_host(stride_B_host.data());
+
+  stride_C.reset(options.groups);
+  stride_C.copy_from_host(stride_C_host.data());
+
+  stride_D.reset(options.groups);
+  stride_D.copy_from_host(stride_D_host.data());
+
+  alpha_device.reset(options.groups);
+  alpha_device.copy_from_host(ptr_alpha_host.data());
+  beta_device.reset(options.groups);
+  beta_device.copy_from_host(ptr_beta_host.data());
+
+  initialize_block(block_A, seed + 2023);
+  initialize_block(block_B, seed + 2022);
+  initialize_block(block_C, seed + 2021);
+  block_alpha.copy_from_host(alpha_host.data());
+  block_beta.copy_from_host(beta_host.data());
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+template <typename Gemm>
+typename Gemm::Arguments args_from_options(Options &options, bool host_problem_shapes_available = true)
+{
+  cutlass::KernelHardwareInfo hw_info;
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  hw_info.sm_count = min(cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id), options.max_sm_count);
+
+  if (!is_static_v<ClusterShape>) {
+    if (size<0>(typename Gemm::GemmKernel::CollectiveMainloop::AtomThrShapeMNK{}) == 2 &&
+        (options.cluster_shape.x < 2 || options.cluster_shape_fallback.x < 2)) {
+      std::cout << "Error: MMA2SMConfig kernel config needs cluster_dim.x >= 2" << std::endl;
+    }
+    hw_info.cluster_shape = options.cluster_shape;
+    hw_info.cluster_shape_fallback = options.cluster_shape_fallback;
+  }
+
+  typename Gemm::Arguments arguments;
+  decltype(arguments.epilogue.thread) fusion_args;
+  fusion_args.alpha_ptr = nullptr;
+  fusion_args.beta_ptr = nullptr;
+
+  // If alpha/beta are provided (via cmd line args) and are scalar, then same alpha/beta applies to all batches.
+  // If pointers to alpha/beta are provided, then alpha/beta can differ between batches/groups.
+  if (options.alpha != FLT_MAX){
+    // Single alpha for all groups
+    fusion_args.alpha = options.alpha;
+    fusion_args.alpha_ptr_array = nullptr;
+    fusion_args.dAlpha = {_0{}, _0{}, 0};
+  }
+  else {
+    fusion_args.alpha = 0;
+    fusion_args.alpha_ptr_array = alpha_device.get();
+    // Only one alpha per each group
+    fusion_args.dAlpha = {_0{}, _0{}, 1};
+  }
+  if (options.beta != FLT_MAX) {
+    // Single beta for all groups
+    fusion_args.beta = options.beta;
+    fusion_args.beta_ptr_array = nullptr;
+    fusion_args.dBeta = {_0{}, _0{}, 0};
+  }
+  else {
+    fusion_args.beta = 0;
+    fusion_args.beta_ptr_array = beta_device.get();
+    // Only one beta per each group
+    fusion_args.dBeta = {_0{}, _0{}, 1};
+  }
+
+  typename Gemm::GemmKernel::TileSchedulerArguments scheduler;
+  scheduler.raster_order = options.raster_order;
+
+  if (host_problem_shapes_available) {
+    arguments = typename Gemm::Arguments {
+      cutlass::gemm::GemmUniversalMode::kGrouped,
+      {options.groups, problem_sizes.get(), options.problem_sizes_host.data()},
+      {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get()},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      hw_info, scheduler
+    };
+  }
+  else {
+    arguments = typename Gemm::Arguments {
+      cutlass::gemm::GemmUniversalMode::kGrouped,
+      {options.groups, problem_sizes.get(), nullptr},
+      {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get()},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      hw_info, scheduler
+    };
+  }
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  bool passed = true;
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+    cutlass::TensorRef ref_A(block_A.get() + offset_A.at(i), Gemm::LayoutA::packed({M, K}));
+    cutlass::TensorRef ref_B(block_B.get() + offset_B.at(i), Gemm::LayoutB::packed({K, N}));
+    cutlass::TensorRef ref_C(block_C.get() + offset_C.at(i), Gemm::LayoutC::packed({M, N}));
+    cutlass::TensorRef ref_D(block_ref_D.get() + offset_D.at(i), Gemm::LayoutD::packed({M, N}));
+
+    //
+    // Compute reference output
+    //
+
+    // Create instantiation for device reference gemm kernel
+    DeviceGemmReference gemm_reference;
+
+    // Launch device reference gemm kernel
+    gemm_reference(
+      {M, N, K},
+      ElementAccumulator(alpha_host.at(i)),
+      ref_A,
+      ref_B,
+      ElementAccumulator(beta_host.at(i)),
+      ref_C,
+      ref_D);
+
+    // Wait for kernel to finish
+    CUDA_CHECK(cudaDeviceSynchronize());
+
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    passed &= cutlass::reference::device::BlockCompareEqual(block_ref_D.get() + offset_D.at(i), block_D.get() + offset_D.at(i), M * N);
+  }
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options, bool host_problem_shapes_available = true)
+{
+  std::cout << "  Problem Sizes, Alpha, Beta " << std::endl;
+  for (int32_t i = 0; i < options.groups; ++i) {
+    std::cout << "    " << options.problem_sizes_host.at(i);
+    std::cout << ", " << alpha_host.at(i) << ", " << beta_host.at(i) << std::endl;
+  }
+  std::cout << "  Groups      : " << options.groups  << std::endl;
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options<Gemm>(options, host_problem_shapes_available);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average setup and runtime and GFLOPs.
+    float elapsed_ms       = timer.elapsed_millis();
+    result.avg_runtime_ms  = double(elapsed_ms) / double(options.iterations);
+    result.gflops          = options.gflops(result.avg_runtime_ms / 1000.0, options.problem_sizes_host);
+
+    std::cout << "  Avg runtime : " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS      : " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 Toolkit to run this example
+  if (__CUDACC_VER_MAJOR__ < 12 ||
+       ((__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)
+       )
+     ) {
+    std::cerr << "This example requires CUDA 12.8 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (!(props.major == 10 && props.minor == 0)) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Blackwell Architecture (compute capability 100a).\n";
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  allocate(options);
+  initialize(options);
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+  std::cout << "Running kernel with 1SM MMA config:" << std::endl;
+  run<Gemm1SM>(options, false /*host_problem_shapes_available*/);
+  std::cout << "Running kernel with 2SM MMA config:" << std::endl;
+  run<Gemm2SM>(options, false /*host_problem_shapes_available*/);     
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm_block_scaled.cu
+++ b/examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm_block_scaled.cu
@ -0,0 +1,946 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+
+
+/*! \file
+    \brief Grouped GEMM example using CUTLASS 3 APIs for the NVIDIA Blackwell SM100 architecture.
+
+    This example demonstrates an implementation of Grouped GEMM using a TMA + Blackwell SM100 TensorOp-based warp-specialized kernel
+    for narrow precisions (FP4) with Scale Factors (In and Out).
+    For this example all scheduling work is performed on the device.
+    The new feature showcased in this example is device-side modification of TMA descriptors
+    to move between groups/problem_count (represented by groups).
+    https://docs.nvidia.com/cuda/cuda-c-programming-guide/#encoding-a-tensor-map-on-device
+
+    To run this example:
+
+      $ ./examples/75_blackwell_grouped_gemm_block_scaled/75_blackwell_grouped_gemm_block_scaled --m=2048 --n=2048 --k=2048 --groups=10
+
+      The above example command makes all 10 groups to be sized at the given m, n, k sizes.
+      Skipping any of the problem dimensions randomizes it across the different groups.
+      Same applies for alpha and beta values that are randomized across the different groups.
+
+    To run this example for a set of problems using the benchmark option:
+
+      $ ./examples/75_blackwell_grouped_gemm_block_scaled/75_blackwell_grouped_gemm_block_scaled --benchmark=./test_benchmark.txt
+
+      Where the test_benchmark.txt may look as such:
+        0 256x512x128
+        1 256x512x512
+        2 512x256x128
+        3 256x256x128
+        4 256x512x1024
+        5 1024x512x128 and so on
+*/
+
+#include <iostream>
+#include <fstream>
+#include <iostream>
+#include <sstream>
+#include <vector>
+#include <float.h>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/group_array_problem_shape.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+
+#include "helper.h"
+using namespace cute;
+
+using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int,int,int>>; // <M,N,K> per group
+using ElementInput = cutlass::float_e2m1_t;                                // Element type for Input matrix operands
+using ElementSF    = cutlass::float_ue4m3_t;                               // Element type for SF matrix operands
+using ElementC     = cutlass::half_t;                                      // Element type for C matrix operands
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+// A matrix configuration
+using ElementA = cutlass::nv_float4_t<ElementInput>;                        // Element type for A matrix operand
+using LayoutA  = cutlass::layout::RowMajor;                                 // Layout type for A matrix operand
+constexpr int AlignmentA  = 32;                                             // Alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using ElementB = cutlass::nv_float4_t<ElementInput>;                        // Element type for B matrix operand
+using LayoutB = cutlass::layout::ColumnMajor;                               // Layout type for B matrix operand
+constexpr int AlignmentB  = 32;                                             // Alignment of A matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using ElementD = ElementC;                                                  // Element type for D matrix operands
+using         LayoutC     = cutlass::layout::RowMajor;                      // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Alignment of C matrix in units of elements (up to 16 bytes)
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;    // Alignment of D matrix in units of elements (up to 16 bytes)
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+
+// using ElementD = cutlass::float_e2m1_t; // Enable for SF Output          // Element type for D matrix operands
+using ElementSFD  = cutlass::float_ue4m3_t;                                 // Element type for SF Output operands
+constexpr int OutputSFVectorSize = 16;
+using FusionOperation = cutlass::epilogue::fusion::LinCombEltActBlockScaleFactor<
+    cutlass::epilogue::thread::SiLu,
+    OutputSFVectorSize,
+    ElementD, 
+    ElementAccumulator, 
+    ElementSFD,
+    LayoutC,
+    ElementC>;
+
+// Core kernel configurations
+using ArchTag             = cutlass::arch::Sm100;                           // Tag indicating the minimum SM that supports the intended feature
+using EpilogueOperatorClass = cutlass::arch::OpClassTensorOp;               // Epilogue Operator class tag
+using MainloopOperatorClass = cutlass::arch::OpClassBlockScaledTensorOp;    // Mainloop Operator class tag
+using StageCountType = cutlass::gemm::collective::StageCountAuto;           // Stage count maximized based on the tile size
+
+// Runtime Cluster Shape
+using ClusterShape = Shape<int32_t,int32_t,_1>;
+
+// Different configs for 1SM and 2SM MMA kernel
+struct MMA1SMConfig {
+  using MmaTileShape     = Shape<_128,_256,_256>;
+  using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmNvf4Sm100;   // Kernel to launch
+  using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm;              // Epilogue to launch
+};
+
+struct MMA2SMConfig {
+  using MmaTileShape     = Shape<_256,_256,_256>;
+  using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecialized2SmNvf4Sm100;   // Kernel to launch
+  using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized2Sm;              // Epilogue to launch
+};
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, EpilogueOperatorClass,
+    typename MMA1SMConfig::MmaTileShape, ClusterShape,
+    Shape<_128,_64>,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC *, AlignmentC,
+    ElementD, LayoutC *, AlignmentD,
+    typename MMA1SMConfig::EpilogueSchedule
+    // , FusionOperation  // Enable for SF Output
+>::CollectiveOp;
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+  ArchTag, MainloopOperatorClass,
+  ElementA, LayoutA *, AlignmentA,
+  ElementB, LayoutB *, AlignmentB,
+  ElementAccumulator,
+    typename MMA1SMConfig::MmaTileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    typename MMA1SMConfig::KernelSchedule
+>::CollectiveOp;
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+using Gemm1SM = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+using Gemm = Gemm1SM;
+
+using CollectiveEpilogue2SM = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, EpilogueOperatorClass,
+    typename MMA2SMConfig::MmaTileShape, ClusterShape,
+    Shape<_128,_64>,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC *, AlignmentC,
+    ElementD, LayoutC *, AlignmentD,
+    typename MMA2SMConfig::EpilogueSchedule
+    // , FusionOperation  // Enable for SF Output
+>::CollectiveOp;
+using CollectiveMainloop2SM = typename cutlass::gemm::collective::CollectiveBuilder<
+  ArchTag, MainloopOperatorClass,
+  ElementA, LayoutA *, AlignmentA,
+  ElementB, LayoutB *, AlignmentB,
+  ElementAccumulator,
+    typename MMA2SMConfig::MmaTileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    typename MMA2SMConfig::KernelSchedule
+>::CollectiveOp;
+using GemmKernel2SM = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloop2SM,
+    CollectiveEpilogue2SM
+>;
+using Gemm2SM = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel2SM>;
+
+using StrideA = typename Gemm::GemmKernel::InternalStrideA;
+using StrideB = typename Gemm::GemmKernel::InternalStrideB;
+using StrideC = typename Gemm::GemmKernel::InternalStrideC;
+using StrideD = typename Gemm::GemmKernel::InternalStrideD;
+
+using LayoutSFA = typename Gemm::GemmKernel::CollectiveMainloop::InternalLayoutSFA;
+using LayoutSFB = typename Gemm::GemmKernel::CollectiveMainloop::InternalLayoutSFB;
+using Sm100BlkScaledConfig =  typename Gemm::GemmKernel::CollectiveMainloop::Sm100BlkScaledConfig;
+using Sm100BlockScaledOutputConfig = cutlass::detail::Sm100BlockScaledOutputConfig<
+                                        OutputSFVectorSize, 
+                                        cute::is_same_v<typename FusionOperation::GmemLayoutTagScalefactor,
+                                            cutlass::layout::RowMajor> ? cute::UMMA::Major::K : cute::UMMA::Major::MN
+                                     >;
+using OutputSFAtom = typename Sm100BlockScaledOutputConfig::SfAtom;
+using LayoutSFD = typename Sm100BlockScaledOutputConfig::LayoutSF;
+
+// Host-side allocations
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<LayoutSFA> layout_SFA_host;
+std::vector<LayoutSFA> layout_SFB_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+
+std::vector<ElementAccumulator> alpha_host;
+std::vector<ElementAccumulator> beta_host;
+
+using HostTensorA = cutlass::HostTensor<typename Gemm::ElementA, cutlass::layout::PackedVectorLayout>;
+using HostTensorB = cutlass::HostTensor<typename Gemm::ElementB, cutlass::layout::PackedVectorLayout>;
+using HostTensorSF = cutlass::HostTensor<typename Gemm::GemmKernel::ElementSF, cutlass::layout::PackedVectorLayout>;
+using HostTensorC = cutlass::HostTensor<typename Gemm::ElementC, cutlass::layout::PackedVectorLayout>;
+using HostTensorD = cutlass::HostTensor<typename Gemm::EpilogueOutputOp::ElementOutput, cutlass::layout::PackedVectorLayout>;
+std::vector<HostTensorA> block_A;
+std::vector<HostTensorB> block_B;
+std::vector<HostTensorSF> block_SFA;
+std::vector<HostTensorSF> block_SFB;
+std::vector<HostTensorC> block_C;
+std::vector<HostTensorD> block_D;
+std::vector<HostTensorSF> block_SFD;
+std::vector<HostTensorD> block_ref_D;
+
+// Device-side allocations
+cutlass::DeviceAllocation<typename ProblemShape::UnderlyingProblemShape> problem_sizes;
+
+cutlass::DeviceAllocation<const typename Gemm::ElementA *> ptr_A;
+cutlass::DeviceAllocation<const typename Gemm::ElementB *> ptr_B;
+cutlass::DeviceAllocation<const typename Gemm::GemmKernel::ElementSF *> ptr_SFA;
+cutlass::DeviceAllocation<const typename Gemm::GemmKernel::ElementSF *> ptr_SFB;
+cutlass::DeviceAllocation<const typename Gemm::ElementC *> ptr_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput *> ptr_D;
+cutlass::DeviceAllocation<typename Gemm::GemmKernel::ElementSF *> ptr_SFD;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput *> ptr_ref_D;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<LayoutSFA> layout_SFA;
+cutlass::DeviceAllocation<LayoutSFB> layout_SFB;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+
+// Note, this is an array of pointers to alpha and beta scaling values per group
+cutlass::DeviceAllocation<ElementAccumulator*> alpha_device;
+cutlass::DeviceAllocation<ElementAccumulator*> beta_device;
+cutlass::DeviceAllocation<ElementAccumulator> block_alpha;
+cutlass::DeviceAllocation<ElementAccumulator> block_beta;
+// A matrix wide constant value to scale the output matrix
+// Avoids generating small FP4 values.
+// NormConst is a single device-side constant value, its not per-batch or per-group
+cutlass::DeviceAllocation<ElementAccumulator> norm_constant_device;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+template <typename T>
+auto make_iterator(T* ptr) {
+  using namespace cute;
+  if constexpr (cute::is_subbyte_v<T>) {
+    return subbyte_iterator<T>(ptr);
+  }
+  else {
+    return ptr;
+  }
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+using RasterOrderOptions = typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm100GroupParams<typename ProblemShape::UnderlyingProblemShape>::RasterOrderOptions;
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+  bool verification = true;
+
+  float alpha = FLT_MAX;
+  float beta  = FLT_MAX;
+  float norm_constant = 1.0;
+  int iterations = 10;
+  int m = 1024, n = 2048, k = 512, groups = 10;
+  dim3 cluster_shape = dim3(2,1,1);
+  dim3 cluster_shape_fallback = dim3(2,1,1);
+  RasterOrderOptions raster_order = RasterOrderOptions::AlongN;
+  int max_sm_count = INT_MAX;
+  std::string benchmark_path;
+  std::vector<typename ProblemShape::UnderlyingProblemShape> problem_sizes_host;
+  int const tma_alignment_bits = 128;
+  int const alignment = tma_alignment_bits / cutlass::sizeof_bits<ElementInput>::value;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+    if (cmd.check_cmd_line_flag("no-verif")) {
+      verification = false;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("groups", groups);
+    cmd.get_cmd_line_argument("alpha", alpha, FLT_MAX);
+    cmd.get_cmd_line_argument("beta",  beta,  FLT_MAX);
+    cmd.get_cmd_line_argument("norm_constant",  norm_constant,  float(1.0));
+    cmd.get_cmd_line_argument("iterations", iterations);
+    cmd.get_cmd_line_argument("benchmark", benchmark_path);
+    cmd.get_cmd_line_argument("cluster_m", cluster_shape.x);
+    cmd.get_cmd_line_argument("cluster_n", cluster_shape.y);
+    cmd.get_cmd_line_argument("cluster_fallback_m", cluster_shape_fallback.x);
+    cmd.get_cmd_line_argument("cluster_fallback_n", cluster_shape_fallback.y);
+    cmd.get_cmd_line_argument("max_sm_count", max_sm_count, INT_MAX);
+
+    // Decide how to initialize the problems
+    if (!benchmark_path.empty()) {
+      if (!benchmark_problems()) {
+        problem_sizes_host.clear();
+        return;
+      }
+    }
+    else {
+      randomize_problems(cmd);
+    }
+
+    char raster_char;
+    cmd.get_cmd_line_argument("raster", raster_char);
+
+    if (raster_char == 'N' || raster_char == 'n') {
+      raster_order = RasterOrderOptions::AlongN;
+    }
+    else if (raster_char == 'M' || raster_char == 'm') {
+      raster_order = RasterOrderOptions::AlongM;
+    }
+  }
+
+  void randomize_problems(cutlass::CommandLine &cmd) {
+    int cmd_line_m = -1, cmd_line_n = -1, cmd_line_k = -1;
+    cmd.get_cmd_line_argument("m", cmd_line_m);
+    cmd.get_cmd_line_argument("n", cmd_line_n);
+    cmd.get_cmd_line_argument("k", cmd_line_k);
+
+    problem_sizes_host.reserve(groups);
+
+    for (int i = groups; i > 0; i--) {
+      int m = cmd_line_m;
+      int n = cmd_line_n;
+      int k = cmd_line_k;
+      if (m < 1) {
+        m = alignment * ((rand() % 64) + 1);
+      }
+      if (n < 1) {
+        n = alignment * ((rand() % 64) + 1);
+      }
+      if (k < 1) {
+        k = alignment * ((rand() % 64) + 1);
+      }
+      problem_sizes_host.push_back({m, n, k});
+    }
+  }
+
+  /// Load a benchmark
+  bool benchmark_problems() {
+    std::ifstream file(benchmark_path);
+    if (!file.good()) {
+      return false;
+    }
+
+    while (file.good()) {
+
+      int idx = -1;
+      std::string extent_str;
+
+      file >> idx >> extent_str;
+
+      if (idx < 0 || extent_str.empty()) {
+        break;
+      }
+
+      cutlass::gemm::GemmCoord extent;
+      std::vector<std::string> tokens;
+
+      cutlass::CommandLine::tokenize(tokens, extent_str, 'x');
+
+      for (int i = 0; i < int(tokens.size()); ++i) {
+        int x = std::atoi(tokens.at(i).c_str());
+
+        // round up
+        if (x % alignment) {
+          x += (alignment - (x % alignment));
+        }
+
+        extent.at(i) = x;
+      }
+
+      if (extent.product()) {
+        problem_sizes_host.push_back({extent.m(), extent.n(), extent.k()});
+      }
+    }
+    groups = static_cast<int>(problem_sizes_host.size());
+
+    return true;
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "75_blackwell_grouped_gemm_block_scaled\n\n"
+      << "  Blackwell Block Scaled Narrow Precision Grouped GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                                                       If specified, displays this usage statement\n\n"
+      << "  --m=<int>                                                    Sets the M extent of the GEMM for all groups\n"
+      << "  --n=<int>                                                    Sets the N extent of the GEMM for all groups\n"
+      << "  --k=<int>                                                    Sets the K extent of the GEMM for all groups\n"
+      << "  --groups=<int>                                               Sets the number of individual GEMM problems for Grouped GEMM\n"
+      << "  --alpha=<f32>                                                Epilogue scalar alpha\n"
+      << "  --beta=<f32>                                                 Epilogue scalar beta\n"
+      << "  --norm_constant=<f32>                                        Epilogue scalar normalization constant for the output matrix\n\n"
+      << "  --cluster_m=<int>          and --cluster_n=<int>             Sets the X,Y dims of the preferred cluster shape\n"
+      << "  --cluster_fallback_m=<int> and --cluster_fallback_n=<int>    Sets the X,Y dims of the fallback cluster shape\n\n"
+      << "  --raster=<char>                                              CTA Rasterization direction (N for along N, M for along M)\n\n"
+      << "  --iterations=<int>                                           Number of profiling iterations to perform\n\n"
+      << "  --benchmark=<str>                                            Executes a benchmark problem size\n"
+      << "  --max_sm_count=<int>                                         Run kernels using only these number of SMs\n"
+      << "  --no-verif                                                   Do not run (host-side) verification kernels\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "75_blackwell_grouped_gemm_block_scaled" << " --m=1024 --n=512 --k=1024 --groups=10 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s, std::vector<typename ProblemShape::UnderlyingProblemShape> problem_sizes_host) const
+  {
+    // Number of real-valued multiply-adds
+    uint64_t fmas = uint64_t();
+
+    for (auto const & problem : problem_sizes_host) {
+      fmas += static_cast<uint64_t>(get<0>(problem)) *
+              static_cast<uint64_t>(get<1>(problem)) *
+              static_cast<uint64_t>(get<2>(problem));
+    }
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * uint64_t(fmas);
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms = 0.0;
+  double gflops = 0.0;
+  cutlass::Status status = cutlass::Status::kSuccess;
+  cudaError_t error = cudaSuccess;
+  bool passed = false;
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_block(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+
+  double scope_max, scope_min;
+  constexpr int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if constexpr (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if constexpr (bits_input <= 6) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if constexpr (bits_input <= 8) {
+    if constexpr (cute::is_same_v<Element, cutlass::float_ue8m0_t>) {
+      scope_max = 4;
+      scope_min = 1;
+    }
+    else {
+      scope_max = 1;
+      scope_min = -1;
+    }
+  }
+  else{
+    scope_max = 4;
+    scope_min = -4;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+  
+  return true;
+}
+
+/// Allocates device-side data
+void allocate(const Options &options) {
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    auto stride_A = cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1});
+    auto stride_B = cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1});
+    auto stride_C = cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1});
+    auto stride_D = cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1});
+
+    auto layout_A = make_layout(make_shape(M, K, 1), stride_A);
+    auto layout_B = make_layout(make_shape(N, K, 1), stride_B);
+    auto layout_C = make_layout(make_shape(M, N, 1), stride_C);
+    auto layout_D = make_layout(make_shape(M, N, 1), stride_D);
+    auto layout_SFA = Sm100BlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(M, N, K, 1));
+    auto layout_SFB = Sm100BlkScaledConfig::tile_atom_to_shape_SFB(cute::make_shape(M, N, K, 1));
+    auto layout_SFD = Sm100BlockScaledOutputConfig::tile_atom_to_shape_SFD(cute::make_shape(M, N, K, 1));
+
+    stride_A_host.push_back(stride_A);
+    stride_B_host.push_back(stride_B);
+    layout_SFA_host.push_back(layout_SFA);
+    layout_SFB_host.push_back(layout_SFB);
+    stride_C_host.push_back(stride_C);
+    stride_D_host.push_back(stride_D);
+
+    block_A.push_back(HostTensorA(cutlass::make_Coord(size(layout_A))));
+    block_B.push_back(HostTensorB(cutlass::make_Coord(size(layout_B))));
+    block_SFA.push_back(HostTensorSF(cutlass::make_Coord(size(filter_zeros(layout_SFA)))));
+    block_SFB.push_back(HostTensorSF(cutlass::make_Coord(size(filter_zeros(layout_SFB)))));
+    block_C.push_back(HostTensorC(cutlass::make_Coord(size(layout_C))));
+    block_D.push_back(HostTensorD(cutlass::make_Coord(size(layout_D))));
+    block_SFD.push_back(HostTensorSF(cutlass::make_Coord(size(filter_zeros(layout_SFD)))));
+    block_ref_D.push_back(HostTensorD(cutlass::make_Coord(size(layout_D))));
+  }
+  block_alpha.reset(options.groups);
+  block_beta.reset(options.groups);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+  uint64_t seed = 2020;
+  problem_sizes.reset(options.groups);
+  problem_sizes.copy_from_host(options.problem_sizes_host.data());
+
+  //
+  // Assign pointers
+  //
+
+  std::vector<typename Gemm::ElementA *> ptr_A_host(options.groups);
+  std::vector<typename Gemm::ElementB *> ptr_B_host(options.groups);
+  std::vector<typename Gemm::GemmKernel::ElementSF *> ptr_SFA_host(options.groups);
+  std::vector<typename Gemm::GemmKernel::ElementSF *> ptr_SFB_host(options.groups);
+  std::vector<typename Gemm::ElementC *> ptr_C_host(options.groups);
+  std::vector<typename Gemm::EpilogueOutputOp::ElementOutput *> ptr_D_host(options.groups);
+  std::vector<typename Gemm::GemmKernel::ElementSF *> ptr_SFD_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_alpha_host(options.groups);
+  std::vector<ElementAccumulator *> ptr_beta_host(options.groups);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    initialize_block(block_A.at(i).host_view(), seed + 2021);
+    initialize_block(block_B.at(i).host_view(), seed + 2022);
+    initialize_block(block_C.at(i).host_view(), seed + 2023);
+    initialize_block(block_SFA.at(i).host_view(), seed + 2024);
+    initialize_block(block_SFB.at(i).host_view(), seed + 2025);
+
+    block_A.at(i).sync_device();
+    block_B.at(i).sync_device();
+    block_C.at(i).sync_device();
+    block_SFA.at(i).sync_device();
+    block_SFB.at(i).sync_device();
+
+    ptr_A_host.at(i) = block_A.at(i).device_data();
+    ptr_B_host.at(i) = block_B.at(i).device_data();
+    ptr_SFA_host.at(i) = block_SFA.at(i).device_data();
+    ptr_SFB_host.at(i) = block_SFB.at(i).device_data();
+    ptr_C_host.at(i) = block_C.at(i).device_data();
+    ptr_D_host.at(i) = block_D.at(i).device_data();
+    ptr_SFD_host.at(i) = block_SFD.at(i).device_data();
+
+    alpha_host.push_back((options.alpha == FLT_MAX) ? static_cast<ElementAccumulator>((rand() % 5) + 1) : options.alpha);
+    beta_host.push_back((options.beta == FLT_MAX) ? static_cast<ElementAccumulator>(rand() % 5) : options.beta);
+    ptr_alpha_host.at(i) = block_alpha.get() + i;
+    ptr_beta_host.at(i) = block_beta.get() + i;
+  }
+
+  ptr_A.reset(options.groups);
+  ptr_A.copy_from_host(ptr_A_host.data());
+
+  ptr_B.reset(options.groups);
+  ptr_B.copy_from_host(ptr_B_host.data());
+
+  ptr_SFA.reset(options.groups);
+  ptr_SFA.copy_from_host(ptr_SFA_host.data());
+
+  ptr_SFB.reset(options.groups);
+  ptr_SFB.copy_from_host(ptr_SFB_host.data());
+
+  ptr_C.reset(options.groups);
+  ptr_C.copy_from_host(ptr_C_host.data());
+
+  ptr_D.reset(options.groups);
+  ptr_D.copy_from_host(ptr_D_host.data());
+
+  ptr_SFD.reset(options.groups);
+  ptr_SFD.copy_from_host(ptr_SFD_host.data());
+
+  stride_A.reset(options.groups);
+  stride_A.copy_from_host(stride_A_host.data());
+
+  stride_B.reset(options.groups);
+  stride_B.copy_from_host(stride_B_host.data());
+
+  layout_SFA.reset(options.groups);
+  layout_SFA.copy_from_host(layout_SFA_host.data());
+
+  layout_SFB.reset(options.groups);
+  layout_SFB.copy_from_host(layout_SFB_host.data());
+
+  stride_C.reset(options.groups);
+  stride_C.copy_from_host(stride_C_host.data());
+
+  stride_D.reset(options.groups);
+  stride_D.copy_from_host(stride_D_host.data());
+
+  alpha_device.reset(options.groups);
+  alpha_device.copy_from_host(ptr_alpha_host.data());
+  beta_device.reset(options.groups);
+  beta_device.copy_from_host(ptr_beta_host.data());
+
+  block_alpha.copy_from_host(alpha_host.data());
+  block_beta.copy_from_host(beta_host.data());
+
+  norm_constant_device.reset(1);
+  norm_constant_device.copy_from_host(&options.norm_constant);
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+template <typename Gemm>
+typename Gemm::Arguments args_from_options(Options &options, bool host_problem_shapes_available = true)
+{
+  cutlass::KernelHardwareInfo hw_info;
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  hw_info.sm_count = min(cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id), options.max_sm_count);
+
+  if (!is_static_v<ClusterShape>) {
+    if (size<0>(typename Gemm::GemmKernel::CollectiveMainloop::AtomThrShapeMNK{}) == 2 &&
+        (options.cluster_shape.x < 2 || options.cluster_shape_fallback.x < 2)) {
+      std::cout << "Error: MMA2SMConfig kernel config needs cluster_dim.x >= 2" << std::endl;
+    }
+    hw_info.cluster_shape = options.cluster_shape;
+    hw_info.cluster_shape_fallback = options.cluster_shape_fallback;
+  }
+
+  typename Gemm::Arguments arguments;
+  decltype(arguments.epilogue.thread) fusion_args;
+  fusion_args.alpha_ptr = nullptr;
+  fusion_args.beta_ptr = nullptr;
+
+  // If alpha/beta are provided (via cmd line args) and are scalar, i.e., same alpha/beta applies to all batches.
+  // If pointers to alpha/beta are provided, i.e., alpha/beta can differ between batches/groups.
+  if (options.alpha != FLT_MAX){
+    // Single alpha for all groups
+    fusion_args.alpha = options.alpha;
+    fusion_args.alpha_ptr_array = nullptr;
+    fusion_args.dAlpha = {_0{}, _0{}, 0};
+  }
+  else {
+    fusion_args.alpha = 0;
+    fusion_args.alpha_ptr_array = alpha_device.get();
+    // Only one alpha per each group
+    fusion_args.dAlpha = {_0{}, _0{}, 1};
+  }
+  if (options.beta != FLT_MAX) {
+    // Single beta for all groups
+    fusion_args.beta = options.beta;
+    fusion_args.beta_ptr_array = nullptr;
+    fusion_args.dBeta = {_0{}, _0{}, 0};
+  }
+  else {
+    fusion_args.beta = 0;
+    fusion_args.beta_ptr_array = beta_device.get();
+    // Only one beta per each group
+    fusion_args.dBeta = {_0{}, _0{}, 1};
+  }
+  // Output Block SF
+  // fusion_args.block_scale_factor_ptr = ptr_SFD.get();          // Enable for SF Output
+  // fusion_args.norm_constant_ptr = norm_constant_device.get();  // Enable for SF Output
+
+  typename Gemm::GemmKernel::TileSchedulerArguments scheduler;
+  scheduler.raster_order = options.raster_order;
+
+  if (host_problem_shapes_available) {
+    arguments = typename Gemm::Arguments {
+      cutlass::gemm::GemmUniversalMode::kGrouped,
+      {options.groups, problem_sizes.get(), options.problem_sizes_host.data()},
+      {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get(),
+       ptr_SFA.get(), layout_SFA.get(), ptr_SFB.get(), layout_SFB.get()},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      hw_info, scheduler
+    };
+  }
+  else {
+    arguments = typename Gemm::Arguments {
+      cutlass::gemm::GemmUniversalMode::kGrouped,
+      {options.groups, problem_sizes.get(), nullptr},
+      {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get(),
+       ptr_SFA.get(), layout_SFA.get(), ptr_SFB.get(), layout_SFB.get()},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      hw_info, scheduler
+    };
+  }
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  using namespace cute;
+  bool passed = true;
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto problem = options.problem_sizes_host.at(i);
+    auto M = get<0>(problem);
+    auto N = get<1>(problem);
+    auto K = get<2>(problem);
+
+    auto stride_A = cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1});
+    auto stride_B = cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1});
+    auto stride_C = cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1});
+    auto stride_D = cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1});
+    auto layout_A = make_layout(make_shape(M, K, 1), stride_A);
+    auto layout_B = make_layout(make_shape(N, K, 1), stride_B);
+    auto layout_C = make_layout(make_shape(M, N, 1), stride_C);
+    auto layout_D = make_layout(make_shape(M, N, 1), stride_D);
+    auto layout_SFA = Sm100BlkScaledConfig::tile_atom_to_shape_SFA(cute::make_shape(M, N, K, 1));
+    auto layout_SFB = Sm100BlkScaledConfig::tile_atom_to_shape_SFB(cute::make_shape(M, N, K, 1));
+    auto layout_SFD = Sm100BlockScaledOutputConfig::tile_atom_to_shape_SFD(cute::make_shape(M, N, K, 1));
+
+    // Create the arguments for host reference implementation
+    Tensor tensor_A = make_tensor(make_iterator(block_A.at(i).host_data()), layout_A);
+    Tensor tensor_SFA = make_tensor(block_SFA.at(i).host_data(), layout_SFA);
+    Tensor tensor_B = make_tensor(make_iterator(block_B.at(i).host_data()), layout_B);
+    Tensor tensor_SFB = make_tensor(block_SFB.at(i).host_data(), layout_SFB);
+    cutlass::reference::host::GettBlockScalingMainloopParams<ElementAccumulator,
+        decltype(tensor_A),
+        decltype(tensor_SFA),
+        decltype(tensor_B),
+        decltype(tensor_SFB)
+      > 
+    mainloop_params{tensor_A, tensor_SFA, tensor_B, tensor_SFB};
+  
+    auto tensor_C = cute::make_tensor(make_iterator(block_C.at(i).host_data()), layout_C);
+    auto tensor_ref_D = cute::make_tensor(make_iterator(block_ref_D.at(i).host_data()), layout_D);
+
+    cutlass::reference::host::GettEpilogueParams<
+        float, float,
+        ElementAccumulator, ElementAccumulator,
+        decltype(tensor_C), decltype(tensor_ref_D)
+      > epilogue_params{};
+
+    epilogue_params.C = tensor_C;
+    epilogue_params.D = tensor_ref_D;
+    epilogue_params.alpha = alpha_host.at(i);
+    epilogue_params.beta = beta_host.at(i);
+
+    cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+    block_D.at(i).sync_host();
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    passed &= cutlass::reference::host::TensorEquals(block_ref_D.at(i).host_view(), block_D.at(i).host_view());
+  }
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options, bool host_problem_shapes_available = true)
+{
+  std::cout << "  Problem Sizes, Alpha, Beta " << std::endl;
+  for (int32_t i = 0; i < options.groups; ++i) {
+    std::cout << "    " << options.problem_sizes_host.at(i);
+    std::cout << ", " << alpha_host.at(i) << ", " << beta_host.at(i) << std::endl;
+  }
+  std::cout << "  Groups      : " << options.groups  << std::endl;
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options<Gemm>(options, host_problem_shapes_available);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  cudaDeviceSynchronize();
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  if (options.verification) {
+    std::cout << "  Host-side verification is now running - may be very slow for large cases." << std::endl;
+    result.passed = verify(options);
+    std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+    if (!result.passed) {
+      exit(-1);
+    }
+  }
+  else {
+    std::cout << "  Verfication is turned off for this run." << std::endl;
+  } 
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average setup and runtime and GFLOPs.
+    float elapsed_ms       = timer.elapsed_millis();
+    result.avg_runtime_ms  = double(elapsed_ms) / double(options.iterations);
+    result.gflops          = options.gflops(result.avg_runtime_ms / 1000.0, options.problem_sizes_host);
+
+    std::cout << "  Avg runtime : " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS      : " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.8 Toolkit to run this example
+  if (__CUDACC_VER_MAJOR__ < 12 ||
+       ((__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)
+       )
+     ) {
+    std::cerr << "This example requires CUDA 12.8 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (!(props.major == 10 && props.minor == 0)) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Blackwell Architecture (compute capability 100a).\n";
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  allocate(options);
+  initialize(options);
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+  std::cout << "Running kernel with 1SM MMA config:" << std::endl;
+  run<Gemm1SM>(options, false /*host_problem_shapes_available*/);
+  std::cout << "Running kernel with 2SM MMA config:" << std::endl;
+  run<Gemm2SM>(options, false /*host_problem_shapes_available*/); 
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/75_blackwell_grouped_gemm/CMakeLists.txt
+++ b/examples/75_blackwell_grouped_gemm/CMakeLists.txt
@ -0,0 +1,88 @@
+# Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# Note that we set --iterations=0 for all tests below to disable the performance benchmarking.
+# Only the correctness check will be run by these commands.
+
+
+
+set(TEST_RANDOM --iterations=0)                                                     # Random problem sizes
+set(TEST_RANDOM_LARGE_GROUP --groups=50 --iterations=0)                             # Random problem sizes
+
+set(TEST_EPILOGUE --alpha=0.5 --beta=0.5 --iterations=0)                            # Random problem sizes
+set(TEST_EPILOGUE_LARGE_GROUP --alpha=1.5 --beta=2.0 --groups=50 --iterations=0)    # Random problem sizes
+
+set(TEST_EPILOGUE_OP --beta=0.5 --iterations=1)                                     # Random problem sizes
+set(TEST_EPILOGUE_OP_LARGE_GROUP --alpha=1.5 --iterations=1)                        # Random problem sizes
+
+set(TEST_FIXED --m=2048 --n=5120 --k=8192 --iterations=0)                           # Fixed problem sizes
+set(TEST_FIXED_LARGE_GROUP --m=2048 --n=512 --k=512 --groups=51 --iterations=0)     # Fixed problem sizes
+
+set(TEST_SMALL --m=256 --n=128 --iterations=0)                                      # Small problem sizes
+set(TEST_SMALL_LARGE_GROUP --m=128 --n=128 --groups=50 --iterations=0)              # Small problem sizes
+
+set(TEST_RANDOM_PERF --iterations=10)                                               # Random problem sizes
+set(TEST_RANDOM_PERF_LARGE_GROUP --groups=50 --iterations=10)                       # Random problem sizes
+
+if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+cutlass_example_add_executable(
+  75_blackwell_grouped_gemm
+  75_blackwell_grouped_gemm.cu
+  TEST_COMMAND_OPTIONS
+  TEST_RANDOM
+  TEST_RANDOM_LARGE_GROUP
+  TEST_EPILOGUE
+  TEST_EPILOGUE_LARGE_GROUP
+  TEST_EPILOGUE_OP
+  TEST_EPILOGUE_OP_LARGE_GROUP
+  TEST_FIXED
+  TEST_FIXED_LARGE_GROUP
+  TEST_SMALL
+  TEST_SMALL_LARGE_GROUP
+  TEST_RANDOM_PERF
+  TEST_RANDOM_PERF_LARGE_GROUP
+  )
+
+cutlass_example_add_executable(
+  75_blackwell_grouped_gemm_block_scaled
+  75_blackwell_grouped_gemm_block_scaled.cu
+  TEST_COMMAND_OPTIONS
+  TEST_RANDOM
+  TEST_RANDOM_LARGE_GROUP
+  TEST_EPILOGUE
+  TEST_EPILOGUE_LARGE_GROUP
+  TEST_EPILOGUE_OP
+  TEST_EPILOGUE_OP_LARGE_GROUP
+  TEST_FIXED
+  TEST_FIXED_LARGE_GROUP
+  TEST_SMALL
+  TEST_SMALL_LARGE_GROUP
+  TEST_RANDOM_PERF
+  TEST_RANDOM_PERF_LARGE_GROUP
+  )
+endif()
--- a/examples/76_blackwell_conv/76_blackwell_conv_dgrad.cu
+++ b/examples/76_blackwell_conv/76_blackwell_conv_dgrad.cu
@ -0,0 +1,534 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Simple dgrad convolution example targeting NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
+
+    This example demonstrate a simple way to instantiate and run a dgrad convolution kernel using the new CUTLASS 3.0
+    APIs on NVIDIA Blackwell SM100 architecture.
+
+    The basic computation logic of dgrad convolution kernel is, take 3D convolution as an example:
+        Xformed Actication (NZPQK) * Weight/Filter (KTRSC) = Activation (NDHWC)
+
+    where in terms of GEMM perspective,
+        Matrix A = Xformed Activation, Matrix B = Weight/Filter, Matrix C = Activation
+
+    This example instantiates a simple dgrad kernel using TMA + UMMA + Warp Specialized design with input and output types are fp16.
+    Alpha/beta scaling is supported while fusions like relu/bias/per-channel scaling are not supported in this example.
+
+    Usage:
+
+      $ ./examples/76_blackwell_conv/76_blackwell_conv_dgrad --n=4 --d=1 --h=8 --w=8 --c=64 --k=64 --t=1 --r=3 --s=3 --pad_d=0
+        --pad_h=1 --pad_w=1 --stride_d=1 --stride_h=1 --stride_w=1 --dilation_d=1 --dilation_h=1 --dilation_w=1
+*/
+
+
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/kernel_hardware_info.hpp"
+#include "cutlass/conv/convolution.h"
+#include "cutlass/conv/convnd_problem_shape.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/conv/dispatch_policy.hpp"
+#include "cutlass/conv/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/conv/device/conv_universal_adapter.hpp"
+#include "cutlass/conv/kernel/conv_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/convolution.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Conv kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Activation matrix configuration
+using         ElementAct  = half_t;                                          // Element type for activation matrix
+constexpr int AlignmentAct = 128 / cutlass::sizeof_bits<ElementAct>::value;  // Memory access granularity/alignment of activation matrix in units of elements (up to 16 bytes)
+
+// Weight/Filter matrix configuration
+using         ElementFlt  = half_t;                                          // Element type for weight/filter matrix operand
+constexpr int AlignmentFlt = 128 / cutlass::sizeof_bits<ElementFlt>::value;  // Memory access granularity/alignment of weight/filter matrix in units of elements (up to 16 bytes)
+
+// Xformed activation matrix configuration
+using         ElementXformedAct = half_t;                                    // Element type for xformed activation matrix operand
+constexpr int AlignmentXformedAct = 128 / cutlass::sizeof_bits<ElementXformedAct>::value; // Memory access granularity/alignment of xformed activation matrix in units of elements (up to 16 bytes)
+
+// Layout of matrix A/B/C in gemm's perspecitive.
+using LayoutA = cutlass::layout::TensorNDHWC;
+using LayoutB = cutlass::layout::TensorNDHWC;
+using LayoutC = cutlass::layout::TensorNDHWC;
+
+// Kernel functional config
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ElementCompute      = float;                                           // Element type for internal computation
+using ArchTag             = cutlass::arch::Sm100;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+constexpr cutlass::conv::Operator ConvOp = cutlass::conv::Operator::kDgrad;  // Convolution operation
+
+// Kernel Perf config
+using TileShape           = Shape<_128,_128,Shape<_64>>;                     // Threadblock-level tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                 // Shape of the threadblocks in a cluster
+
+// Build the epilogue
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementCompute,
+    ElementAct, LayoutC, AlignmentAct,
+    ElementAct, LayoutC, AlignmentAct,
+    cutlass::epilogue::collective::EpilogueScheduleAuto
+  >::CollectiveOp;
+
+// Build the mainloop
+using CollectiveMainloop = typename cutlass::conv::collective::CollectiveBuilder<
+    ArchTag, OperatorClass, ConvOp,
+    ElementXformedAct, LayoutA, AlignmentXformedAct,
+    ElementFlt, LayoutB, AlignmentFlt,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::conv::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::conv::collective::KernelScheduleAuto
+  >::CollectiveOp;
+
+// Compose into a kernel
+using ProblemShape=cutlass::conv::ConvProblemShape<ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
+using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue
+  >; 
+
+using Conv = cutlass::conv::device::ConvUniversalAdapter<ConvKernel>;
+
+using StrideC = typename Conv::ConvKernel::StrideC;
+using StrideD = typename Conv::ConvKernel::StrideD;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideC stride_C;
+StrideD stride_D;
+uint64_t seed;
+
+cutlass::DeviceAllocation<ElementXformedAct> block_A;
+cutlass::DeviceAllocation<ElementFlt> block_B;
+cutlass::DeviceAllocation<ElementAct> block_C;
+cutlass::DeviceAllocation<ElementAct> block_D;
+cutlass::DeviceAllocation<ElementAct> block_ref_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int n, d, h, w, c, k, t, r, s, z, p, q;
+  int pad_d, pad_h, pad_w;
+  int stride_d, stride_h, stride_w;
+  int dilation_d, dilation_h, dilation_w;
+
+  Options():
+    help(false),
+    n(4), d(1), h(8), w(8), c(64), k(64), t(1), r(3), s(3),
+    pad_d(0), pad_h(1), pad_w(1),
+    stride_d(1), stride_h(1), stride_w(1),
+    dilation_d(1), dilation_h(1), dilation_w(1),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("d", d);
+    cmd.get_cmd_line_argument("h", h);
+    cmd.get_cmd_line_argument("w", w);
+    cmd.get_cmd_line_argument("c", c);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("t", t);
+    cmd.get_cmd_line_argument("r", r);
+    cmd.get_cmd_line_argument("s", s);
+    cmd.get_cmd_line_argument("pad_d", pad_d);
+    cmd.get_cmd_line_argument("pad_h", pad_h);
+    cmd.get_cmd_line_argument("pad_w", pad_w);
+    cmd.get_cmd_line_argument("stride_d", stride_d);
+    cmd.get_cmd_line_argument("stride_h", stride_h);
+    cmd.get_cmd_line_argument("stride_w", stride_w);
+    cmd.get_cmd_line_argument("dilation_d", dilation_d);
+    cmd.get_cmd_line_argument("dilation_h", dilation_h);
+    cmd.get_cmd_line_argument("dilation_w", dilation_w);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+
+    // Calculate z,p,q based on inputs.
+    z = 1 + (d + 2 * pad_d - ((t - 1) * dilation_d + 1)) / stride_d;
+    p = 1 + (h + 2 * pad_h - ((r - 1) * dilation_h + 1)) / stride_h;
+    q = 1 + (w + 2 * pad_w - ((s - 1) * dilation_w + 1)) / stride_w;
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "76_blackwell_conv_dgrad\n\n"
+      << "  Blackwell FP16 dgrad convolution using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --n=<int>                   Sets the batch size of the Activation\n"
+      << "  --d=<int>                   Sets the depth size of the Activation\n"
+      << "  --h=<int>                   Sets the height of the Activation\n"
+      << "  --w=<int>                   Sets the width of the Activation\n"
+      << "  --c=<int>                   Sets the channel size of the Activation\n"
+      << "  --k=<int>                   Sets the image numbers of the Filter\n"
+      << "  --t=<int>                   Sets the depth size of the Filter\n"
+      << "  --r=<int>                   Sets the height of the Filter\n"
+      << "  --s=<int>                   Sets the width of the Filter\n"
+      << "  --pad_d=<int>               Sets the padding size in depth\n"
+      << "  --pad_h=<int>               Sets the padding size in height\n"
+      << "  --pad_w=<int>               Sets the padding size in width\n"
+      << "  --stride_d=<int>            Sets the traversal stride size in depth\n"
+      << "  --stride_h=<int>            Sets the traversal stride size in height\n"
+      << "  --stride_w=<int>            Sets the traversal stride size in width\n"
+      << "  --dialtion_d=<int>          Sets the filter dilation size in depth\n"
+      << "  --dialtion_h=<int>          Sets the filter dilation size in height\n"
+      << "  --dialtion_w=<int>          Sets the filter dilation size in width\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "76_blackwell_conv_dgrad" << " --n=4 --d=1 --h=8 --w=8 --c=64 --k=64 --t=1 --r=3 --s=3 --pad_d=0"
+      << "  --pad_h=1 --pad_w=1 --stride_d=1 --stride_h=1 --stride_w=1 --dilation_d=1 --dilation_h=1 --dilation_w=1 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * (n * d * h * w) * c * (t * r * s * k);
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Conv setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_block(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+
+  Element scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = Element(2);
+    scope_min = Element(0);
+  } else if (bits_input <= 8) {
+    scope_max = Element(2);
+    scope_min = Element(-2);
+  } else {
+    scope_max = Element(8);
+    scope_min = Element(-8);
+  }
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+/// Initialize operands to be used in the Conv and reference Conv
+void initialize(const Options &options) {
+
+  // Construct ConvProblemShape
+  ProblemShape problem_shape(
+    cutlass::conv::Mode::kCrossCorrelation,
+    {options.n, options.d, options.h, options.w, options.c},      // ndhwc
+    {options.k, options.t, options.r, options.s, options.c},      // ktrsc
+    {options.pad_d, options.pad_h, options.pad_w},                // padding lower (pad_d, pad_h, pad_w)
+    {options.pad_d, options.pad_h, options.pad_w},                // padding upper (pad_d, pad_h, pad_w)
+    {options.stride_d, options.stride_h, options.stride_w},       // stride (stride_d, stride_h, stride_w)
+    {options.dilation_d, options.dilation_h, options.dilation_w}, // dilation (dilation_d, dilation_h, dilation_w)
+    1                                                             // group
+  );
+
+  // Setup stride_C/D
+  cute::for_each(cute::make_seq<cute::rank<0>(StrideC{})>{}, [&](auto i) {
+    cute::get<0, i>(stride_C) = problem_shape.stride_C[ProblemShape::RankT-2-i];
+  });
+  cute::for_each(cute::make_seq<cute::rank<0>(StrideD{})>{}, [&](auto i) {
+    cute::get<0, i>(stride_D) = problem_shape.stride_C[ProblemShape::RankT-2-i];
+  });
+
+  block_A.reset(problem_shape.size_A());
+  block_B.reset(problem_shape.size_B());
+  block_C.reset(problem_shape.size_C());
+  block_D.reset(problem_shape.size_C());
+  block_ref_D.reset(problem_shape.size_C());
+
+  initialize_block(block_A, seed + 2023);
+  initialize_block(block_B, seed + 2022);
+  initialize_block(block_C, seed + 2021);
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Conv::Arguments args_from_options(const Options &options)
+{
+  // Construct ConvProblemShape
+  ProblemShape problem_shape(
+    cutlass::conv::Mode::kCrossCorrelation,
+    {options.n, options.d, options.h, options.w, options.c},      // ndhwc
+    {options.k, options.t, options.r, options.s, options.c},      // ktrsc
+    {options.pad_d, options.pad_h, options.pad_w},                // padding lower (pad_d, pad_h, pad_w)
+    {options.pad_d, options.pad_h, options.pad_w},                // padding upper (pad_d, pad_h, pad_w)
+    {options.stride_d, options.stride_h, options.stride_w},       // stride (stride_d, stride_h, stride_w)
+    {options.dilation_d, options.dilation_h, options.dilation_w}, // dilation (dilation_d, dilation_h, dilation_w)
+    1                                                             // group
+  );
+
+  typename Conv::Arguments arguments{
+    problem_shape,
+    {block_A.get(), block_B.get()},
+    {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  cutlass::TensorRef ref_A(block_A.get(), LayoutA::packed({options.n, options.z, options.p, options.q, options.k}));
+  cutlass::TensorRef ref_B(block_B.get(), LayoutB::packed({options.k, options.t, options.r, options.s, options.c}));
+  cutlass::TensorRef ref_C(block_C.get(), LayoutC::packed({options.n, options.d, options.h, options.w, options.c}));
+  cutlass::TensorRef ref_D(block_ref_D.get(), LayoutC::packed({options.n, options.d, options.h, options.w, options.c}));
+
+  //
+  // Compute reference output
+  //
+
+  // Construct Conv3dProblemSize with user defined inputs.
+  cutlass::conv::Conv3dProblemSize problem_size(      
+    cutlass::Tensor5DCoord(options.n, options.d, options.h, options.w, options.c),      // ndhwc
+    cutlass::Tensor5DCoord(options.k, options.t, options.r, options.s, options.c),      // ktrsc
+    cutlass::make_Coord(options.pad_d, options.pad_h, options.pad_w),                   // padding
+    cutlass::make_Coord(options.stride_d, options.stride_h, options.stride_w),          // stride (stride_d, stride_h, stride_w)
+    cutlass::make_Coord(options.dilation_d, options.dilation_h, options.dilation_w),    // dilation (dilation_d, dilation_h, dilation_w)
+    cutlass::Tensor5DCoord(options.n, options.z, options.p, options.q, options.k)       // nzpqk
+  );
+
+  // Launch device reference conv kernel
+  cutlass::reference::device::Conv3dDgrad(problem_size, ref_A, ref_B, ref_C, ref_D, options.alpha, options.beta);
+
+  // Wait for kernel to finish
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  bool passed = cutlass::reference::device::BlockCompareEqual(block_ref_D.get(), block_D.get(), block_D.size());
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Conv conv;
+
+  // Create a structure of conv kernel arguments suitable for invoking an instance of Conv
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Conv::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(conv.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(conv.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(conv.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(conv.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(conv.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size:" << std::endl;
+    std::cout << "      Activation(n,d,h,w,c) = (" << options.n << ',' << options.d << ',' << options.h << ',' << options.w << ',' << options.c << "), ";
+    std::cout << "  Filter(k,t,r,s,c) = (" << options.k << ',' << options.t << ',' << options.r << ',' << options.s << ',' << options.c << "), ";
+    std::cout << "  Xformed Activation(n,z,p,q,k) = (" << options.n << ',' << options.z << ',' << options.p << ',' << options.q << ',' << options.k << ")" << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 10 && (props.minor != 0 || props.minor != 1)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100 or 101)." << std::endl;
+    return 0;
+  } 
+  
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Conv>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/76_blackwell_conv/76_blackwell_conv_fprop.cu
+++ b/examples/76_blackwell_conv/76_blackwell_conv_fprop.cu
@ -0,0 +1,534 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Simple fprop convolution example targeting NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
+
+    This example demonstrate a simple way to instantiate and run a fprop convolution kernel using the new CUTLASS 3.0
+    APIs on NVIDIA Blackwell SM100 architecture.
+
+    The basic computation logic of fprop convolution kernel is, take 3D convolution as an example:
+        Activation (NDHWC) * Weight/Filter (KTRSC) = Xformed Actication (NZPQK)
+
+    where in terms of GEMM perspective,
+        Matrix A = Activation, Matrix B = Weight/Filter, Matrix C = Xformed Activation
+
+    This example instantiates a simple fprop kernel using TMA + UMMA + Warp Specialized design with input and output types are fp16.
+    Alpha/beta scaling is supported while fusions like relu/bias/per-channel scaling are not supported in this example.
+
+    Usage:
+
+      $ ./examples/76_blackwell_conv/76_blackwell_conv_fprop --n=4 --d=1 --h=8 --w=8 --c=64 --k=64 --t=1 --r=3 --s=3 --pad_d=0
+        --pad_h=1 --pad_w=1 --stride_d=1 --stride_h=1 --stride_w=1 --dilation_d=1 --dilation_h=1 --dilation_w=1
+*/
+
+
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/kernel_hardware_info.hpp"
+#include "cutlass/conv/convolution.h"
+#include "cutlass/conv/convnd_problem_shape.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/conv/dispatch_policy.hpp"
+#include "cutlass/conv/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/conv/device/conv_universal_adapter.hpp"
+#include "cutlass/conv/kernel/conv_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/convolution.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Conv kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Activation matrix configuration
+using         ElementAct  = half_t;                                          // Element type for activation matrix
+constexpr int AlignmentAct = 128 / cutlass::sizeof_bits<ElementAct>::value;  // Memory access granularity/alignment of activation matrix in units of elements (up to 16 bytes)
+
+// Weight/Filter matrix configuration
+using         ElementFlt  = half_t;                                          // Element type for weight/filter matrix operand
+constexpr int AlignmentFlt = 128 / cutlass::sizeof_bits<ElementFlt>::value;  // Memory access granularity/alignment of weight/filter matrix in units of elements (up to 16 bytes)
+
+// Xformed activation matrix configuration
+using         ElementXformedAct = half_t;                                    // Element type for xformed activation matrix operand
+constexpr int AlignmentXformedAct = 128 / cutlass::sizeof_bits<ElementXformedAct>::value; // Memory access granularity/alignment of xformed activation matrix in units of elements (up to 16 bytes)
+
+// Layout of matrix A/B/C in gemm's perspecitive.
+using LayoutA = cutlass::layout::TensorNDHWC;
+using LayoutB = cutlass::layout::TensorNDHWC;
+using LayoutC = cutlass::layout::TensorNDHWC;
+
+// Kernel functional config
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ElementCompute      = float;                                           // Element type for internal computation
+using ArchTag             = cutlass::arch::Sm100;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+constexpr cutlass::conv::Operator ConvOp = cutlass::conv::Operator::kFprop;  // Convolution operation
+
+// Kernel Perf config
+using TileShape           = Shape<_128,_128,Shape<_64>>;                     // Threadblock-level tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                 // Shape of the threadblocks in a cluster
+
+// Build the epilogue
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementCompute,
+    ElementXformedAct, LayoutC, AlignmentXformedAct,
+    ElementXformedAct, LayoutC, AlignmentXformedAct,
+    cutlass::epilogue::collective::EpilogueScheduleAuto
+  >::CollectiveOp;
+
+// Build the mainloop
+using CollectiveMainloop = typename cutlass::conv::collective::CollectiveBuilder<
+    ArchTag, OperatorClass, ConvOp,
+    ElementAct, LayoutA, AlignmentAct,
+    ElementFlt, LayoutB, AlignmentFlt,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::conv::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::conv::collective::KernelScheduleAuto
+  >::CollectiveOp;
+
+// Compose into a kernel
+using ProblemShape=cutlass::conv::ConvProblemShape<ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
+using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue
+  >; 
+
+using Conv = cutlass::conv::device::ConvUniversalAdapter<ConvKernel>;
+
+using StrideC = typename Conv::ConvKernel::StrideC;
+using StrideD = typename Conv::ConvKernel::StrideD;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideC stride_C;
+StrideD stride_D;
+uint64_t seed;
+
+cutlass::DeviceAllocation<ElementAct> block_A;
+cutlass::DeviceAllocation<ElementFlt> block_B;
+cutlass::DeviceAllocation<ElementXformedAct> block_C;
+cutlass::DeviceAllocation<ElementXformedAct> block_D;
+cutlass::DeviceAllocation<ElementXformedAct> block_ref_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int n, d, h, w, c, k, t, r, s, z, p, q;
+  int pad_d, pad_h, pad_w;
+  int stride_d, stride_h, stride_w;
+  int dilation_d, dilation_h, dilation_w;
+
+  Options():
+    help(false),
+    n(4), d(1), h(8), w(8), c(64), k(64), t(1), r(3), s(3),
+    pad_d(0), pad_h(1), pad_w(1),
+    stride_d(1), stride_h(1), stride_w(1),
+    dilation_d(1), dilation_h(1), dilation_w(1),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("d", d);
+    cmd.get_cmd_line_argument("h", h);
+    cmd.get_cmd_line_argument("w", w);
+    cmd.get_cmd_line_argument("c", c);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("t", t);
+    cmd.get_cmd_line_argument("r", r);
+    cmd.get_cmd_line_argument("s", s);
+    cmd.get_cmd_line_argument("pad_d", pad_d);
+    cmd.get_cmd_line_argument("pad_h", pad_h);
+    cmd.get_cmd_line_argument("pad_w", pad_w);
+    cmd.get_cmd_line_argument("stride_d", stride_d);
+    cmd.get_cmd_line_argument("stride_h", stride_h);
+    cmd.get_cmd_line_argument("stride_w", stride_w);
+    cmd.get_cmd_line_argument("dilation_d", dilation_d);
+    cmd.get_cmd_line_argument("dilation_h", dilation_h);
+    cmd.get_cmd_line_argument("dilation_w", dilation_w);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+
+    // Calculate z,p,q based on inputs.
+    z = 1 + (d + 2 * pad_d - ((t - 1) * dilation_d + 1)) / stride_d;
+    p = 1 + (h + 2 * pad_h - ((r - 1) * dilation_h + 1)) / stride_h;
+    q = 1 + (w + 2 * pad_w - ((s - 1) * dilation_w + 1)) / stride_w;
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "76_blackwell_conv_fprop\n\n"
+      << "  Blackwell FP16 fprop convolution using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --n=<int>                   Sets the batch size of the Activation\n"
+      << "  --d=<int>                   Sets the depth size of the Activation\n"
+      << "  --h=<int>                   Sets the height of the Activation\n"
+      << "  --w=<int>                   Sets the width of the Activation\n"
+      << "  --c=<int>                   Sets the channel size of the Activation\n"
+      << "  --k=<int>                   Sets the image numbers of the Filter\n"
+      << "  --t=<int>                   Sets the depth size of the Filter\n"
+      << "  --r=<int>                   Sets the height of the Filter\n"
+      << "  --s=<int>                   Sets the width of the Filter\n"
+      << "  --pad_d=<int>               Sets the padding size in depth\n"
+      << "  --pad_h=<int>               Sets the padding size in height\n"
+      << "  --pad_w=<int>               Sets the padding size in width\n"
+      << "  --stride_d=<int>            Sets the traversal stride size in depth\n"
+      << "  --stride_h=<int>            Sets the traversal stride size in height\n"
+      << "  --stride_w=<int>            Sets the traversal stride size in width\n"
+      << "  --dialtion_d=<int>          Sets the filter dilation size in depth\n"
+      << "  --dialtion_h=<int>          Sets the filter dilation size in height\n"
+      << "  --dialtion_w=<int>          Sets the filter dilation size in width\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "76_blackwell_conv_fprop" << " --n=4 --d=1 --h=8 --w=8 --c=64 --k=64 --t=1 --r=3 --s=3 --pad_d=0"
+      << "  --pad_h=1 --pad_w=1 --stride_d=1 --stride_h=1 --stride_w=1 --dilation_d=1 --dilation_h=1 --dilation_w=1 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * (n * z * p * q) * k * (t * r * s * c);
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Conv setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_block(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+
+  Element scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = Element(2);
+    scope_min = Element(0);
+  } else if (bits_input <= 8) {
+    scope_max = Element(2);
+    scope_min = Element(-2);
+  } else {
+    scope_max = Element(8);
+    scope_min = Element(-8);
+  }
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+/// Initialize operands to be used in the Conv and reference Conv
+void initialize(const Options &options) {
+
+  // Construct ConvProblemShape
+  ProblemShape problem_shape(
+    cutlass::conv::Mode::kCrossCorrelation,
+    {options.n, options.d, options.h, options.w, options.c},      // ndhwc
+    {options.k, options.t, options.r, options.s, options.c},      // ktrsc
+    {options.pad_d, options.pad_h, options.pad_w},                // padding lower (pad_d, pad_h, pad_w)
+    {options.pad_d, options.pad_h, options.pad_w},                // padding upper (pad_d, pad_h, pad_w)
+    {options.stride_d, options.stride_h, options.stride_w},       // stride (stride_d, stride_h, stride_w)
+    {options.dilation_d, options.dilation_h, options.dilation_w}, // dilation (dilation_d, dilation_h, dilation_w)
+    1                                                             // group
+  );
+
+  // Setup stride_C/D
+  cute::for_each(cute::make_seq<cute::rank<0>(StrideC{})>{}, [&](auto i) {
+    cute::get<0, i>(stride_C) = problem_shape.stride_C[ProblemShape::RankT-2-i];
+  });
+  cute::for_each(cute::make_seq<cute::rank<0>(StrideD{})>{}, [&](auto i) {
+    cute::get<0, i>(stride_D) = problem_shape.stride_C[ProblemShape::RankT-2-i];
+  });
+
+  block_A.reset(problem_shape.size_A());
+  block_B.reset(problem_shape.size_B());
+  block_C.reset(problem_shape.size_C());
+  block_D.reset(problem_shape.size_C());
+  block_ref_D.reset(problem_shape.size_C());
+
+  initialize_block(block_A, seed + 2023);
+  initialize_block(block_B, seed + 2022);
+  initialize_block(block_C, seed + 2021);
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Conv::Arguments args_from_options(const Options &options)
+{
+  // Construct ConvProblemShape
+  ProblemShape problem_shape(
+    cutlass::conv::Mode::kCrossCorrelation,
+    {options.n, options.d, options.h, options.w, options.c},      // ndhwc
+    {options.k, options.t, options.r, options.s, options.c},      // ktrsc
+    {options.pad_d, options.pad_h, options.pad_w},                // padding lower (pad_d, pad_h, pad_w)
+    {options.pad_d, options.pad_h, options.pad_w},                // padding upper (pad_d, pad_h, pad_w)
+    {options.stride_d, options.stride_h, options.stride_w},       // stride (stride_d, stride_h, stride_w)
+    {options.dilation_d, options.dilation_h, options.dilation_w}, // dilation (dilation_d, dilation_h, dilation_w)
+    1                                                             // group
+  );
+
+  typename Conv::Arguments arguments{
+    problem_shape,
+    {block_A.get(), block_B.get()},
+    {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  cutlass::TensorRef ref_A(block_A.get(), LayoutA::packed({options.n, options.d, options.h, options.w, options.c}));
+  cutlass::TensorRef ref_B(block_B.get(), LayoutB::packed({options.k, options.t, options.r, options.s, options.c}));
+  cutlass::TensorRef ref_C(block_C.get(), LayoutC::packed({options.n, options.z, options.p, options.q, options.k}));
+  cutlass::TensorRef ref_D(block_ref_D.get(), LayoutC::packed({options.n, options.z, options.p, options.q, options.k}));
+
+  //
+  // Compute reference output
+  //
+
+  // Construct Conv3dProblemSize with user defined inputs.
+  cutlass::conv::Conv3dProblemSize problem_size(      
+    cutlass::Tensor5DCoord(options.n, options.d, options.h, options.w, options.c),      // ndhwc
+    cutlass::Tensor5DCoord(options.k, options.t, options.r, options.s, options.c),      // ktrsc
+    cutlass::make_Coord(options.pad_d, options.pad_h, options.pad_w),                   // padding
+    cutlass::make_Coord(options.stride_d, options.stride_h, options.stride_w),          // stride (stride_d, stride_h, stride_w)
+    cutlass::make_Coord(options.dilation_d, options.dilation_h, options.dilation_w),    // dilation (dilation_d, dilation_h, dilation_w)
+    cutlass::Tensor5DCoord(options.n, options.z, options.p, options.q, options.k)       // nzpqk
+  );
+
+  // Launch device reference conv kernel
+  cutlass::reference::device::Conv3dFprop(problem_size, ref_A, ref_B, ref_C, ref_D, options.alpha, options.beta);
+
+  // Wait for kernel to finish
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  bool passed = cutlass::reference::device::BlockCompareEqual(block_ref_D.get(), block_D.get(), block_D.size());
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Conv conv;
+
+  // Create a structure of conv kernel arguments suitable for invoking an instance of Conv
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Conv::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(conv.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(conv.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(conv.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(conv.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(conv.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size:" << std::endl;
+    std::cout << "      Activation(n,d,h,w,c) = (" << options.n << ',' << options.d << ',' << options.h << ',' << options.w << ',' << options.c << "), ";
+    std::cout << "  Filter(k,t,r,s,c) = (" << options.k << ',' << options.t << ',' << options.r << ',' << options.s << ',' << options.c << "), ";
+    std::cout << "  Xformed Activation(n,z,p,q,k) = (" << options.n << ',' << options.z << ',' << options.p << ',' << options.q << ',' << options.k << ")" << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 10 && (props.minor != 0 || props.minor != 1)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100 or 101)." << std::endl;
+    return 0;
+  } 
+  
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Conv>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/76_blackwell_conv/76_blackwell_conv_wgrad.cu
+++ b/examples/76_blackwell_conv/76_blackwell_conv_wgrad.cu
@ -0,0 +1,530 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Simple wgrad convolution example targeting NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
+
+    This example demonstrate a simple way to instantiate and run a wgrad convolution kernel using the new CUTLASS 3.0
+    APIs on NVIDIA Blackwell SM100 architecture.
+
+    The basic computation logic of wgrad convolution kernel is, take 3D convolution as an example:
+        Xformed Actication (NZPQK) * Activation (NDHWC) = Weight/Filter (KTRSC)
+
+    where in terms of GEMM perspective,
+        Matrix A = Xformed Activation, Matrix B = Activation, Matrix C = Weight/Filter
+
+    This example instantiates a simple wgrad kernel using TMA + UMMA + Warp Specialized design with input and output types are fp16.
+    Alpha/beta scaling is supported while fusions like relu/bias/per-channel scaling are not supported in this example.
+
+    Usage:
+
+      $ ./examples/76_blackwell_conv/76_blackwell_conv_wgrad --n=4 --d=1 --h=8 --w=8 --c=64 --k=64 --t=1 --r=3 --s=3 --pad_d=0
+        --pad_h=1 --pad_w=1 --stride_d=1 --stride_h=1 --stride_w=1 --dilation_d=1 --dilation_h=1 --dilation_w=1
+*/
+
+
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/kernel_hardware_info.hpp"
+#include "cutlass/conv/convolution.h"
+#include "cutlass/conv/convnd_problem_shape.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/conv/dispatch_policy.hpp"
+#include "cutlass/conv/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/conv/device/conv_universal_adapter.hpp"
+#include "cutlass/conv/kernel/conv_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/convolution.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Conv kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Activation matrix configuration
+using         ElementAct  = half_t;                                          // Element type for activation matrix
+constexpr int AlignmentAct = 128 / cutlass::sizeof_bits<ElementAct>::value;  // Memory access granularity/alignment of activation matrix in units of elements (up to 16 bytes)
+
+// Weight/Filter matrix configuration
+using         ElementFlt  = half_t;                                          // Element type for weight/filter matrix operand
+constexpr int AlignmentFlt = 128 / cutlass::sizeof_bits<ElementFlt>::value;  // Memory access granularity/alignment of weight/filter matrix in units of elements (up to 16 bytes)
+
+// Xformed activation matrix configuration
+using         ElementXformedAct = half_t;                                    // Element type for xformed activation matrix operand
+constexpr int AlignmentXformedAct = 128 / cutlass::sizeof_bits<ElementXformedAct>::value; // Memory access granularity/alignment of xformed activation matrix in units of elements (up to 16 bytes)
+
+// Layout of matrix A/B/C in gemm's perspecitive.
+using LayoutA = cutlass::layout::TensorNDHWC;
+using LayoutB = cutlass::layout::TensorNDHWC;
+using LayoutC = cutlass::layout::TensorKCSRT;
+
+// Kernel functional config
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ElementCompute      = float;                                           // Element type for internal computation
+using ArchTag             = cutlass::arch::Sm100;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+constexpr cutlass::conv::Operator ConvOp = cutlass::conv::Operator::kWgrad;  // Convolution operation
+
+// Kernel Perf config
+using TileShape           = Shape<_128,Shape<_128>,Shape<_64>>;              // Threadblock-level tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                 // Shape of the threadblocks in a cluster
+
+// Build the epilogue
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementCompute,
+    ElementFlt, LayoutC, AlignmentFlt,
+    ElementFlt, LayoutC, AlignmentFlt,
+    cutlass::epilogue::collective::EpilogueScheduleAuto
+  >::CollectiveOp;
+
+// Build the mainloop
+using CollectiveMainloop = typename cutlass::conv::collective::CollectiveBuilder<
+    ArchTag, OperatorClass, ConvOp,
+    ElementXformedAct, LayoutA, AlignmentXformedAct,
+    ElementAct, LayoutB, AlignmentAct,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::conv::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    cutlass::conv::collective::KernelScheduleAuto
+  >::CollectiveOp;
+
+// Compose into a kernel
+using ProblemShape=cutlass::conv::ConvProblemShape<ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
+using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue
+  >; 
+
+using Conv = cutlass::conv::device::ConvUniversalAdapter<ConvKernel>;
+
+using StrideC = typename Conv::ConvKernel::StrideC;
+using StrideD = typename Conv::ConvKernel::StrideD;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideC stride_C;
+StrideD stride_D;
+uint64_t seed;
+
+cutlass::DeviceAllocation<ElementXformedAct> block_A;
+cutlass::DeviceAllocation<ElementAct> block_B;
+cutlass::DeviceAllocation<ElementFlt> block_C;
+cutlass::DeviceAllocation<ElementFlt> block_D;
+cutlass::DeviceAllocation<ElementFlt> block_ref_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int n, d, h, w, c, k, t, r, s, z, p, q;
+  int pad_d, pad_h, pad_w;
+  int stride_d, stride_h, stride_w;
+  int dilation_d, dilation_h, dilation_w;
+
+  Options():
+    help(false),
+    n(4), d(1), h(8), w(8), c(64), k(64), t(1), r(3), s(3),
+    pad_d(0), pad_h(1), pad_w(1),
+    stride_d(1), stride_h(1), stride_w(1),
+    dilation_d(1), dilation_h(1), dilation_w(1),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("d", d);
+    cmd.get_cmd_line_argument("h", h);
+    cmd.get_cmd_line_argument("w", w);
+    cmd.get_cmd_line_argument("c", c);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("t", t);
+    cmd.get_cmd_line_argument("r", r);
+    cmd.get_cmd_line_argument("s", s);
+    cmd.get_cmd_line_argument("pad_d", pad_d);
+    cmd.get_cmd_line_argument("pad_h", pad_h);
+    cmd.get_cmd_line_argument("pad_w", pad_w);
+    cmd.get_cmd_line_argument("stride_d", stride_d);
+    cmd.get_cmd_line_argument("stride_h", stride_h);
+    cmd.get_cmd_line_argument("stride_w", stride_w);
+    cmd.get_cmd_line_argument("dilation_d", dilation_d);
+    cmd.get_cmd_line_argument("dilation_h", dilation_h);
+    cmd.get_cmd_line_argument("dilation_w", dilation_w);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+
+    // Calculate z,p,q based on inputs.
+    z = 1 + (d + 2 * pad_d - ((t - 1) * dilation_d + 1)) / stride_d;
+    p = 1 + (h + 2 * pad_h - ((r - 1) * dilation_h + 1)) / stride_h;
+    q = 1 + (w + 2 * pad_w - ((s - 1) * dilation_w + 1)) / stride_w;
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "76_blackwell_conv_wgrad\n\n"
+      << "  Blackwell FP16 wgrad convolution using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --n=<int>                   Sets the batch size of the Activation\n"
+      << "  --d=<int>                   Sets the depth size of the Activation\n"
+      << "  --h=<int>                   Sets the height of the Activation\n"
+      << "  --w=<int>                   Sets the width of the Activation\n"
+      << "  --c=<int>                   Sets the channel size of the Activation\n"
+      << "  --k=<int>                   Sets the image numbers of the Filter\n"
+      << "  --t=<int>                   Sets the depth size of the Filter\n"
+      << "  --r=<int>                   Sets the height of the Filter\n"
+      << "  --s=<int>                   Sets the width of the Filter\n"
+      << "  --pad_d=<int>               Sets the padding size in depth\n"
+      << "  --pad_h=<int>               Sets the padding size in height\n"
+      << "  --pad_w=<int>               Sets the padding size in width\n"
+      << "  --stride_d=<int>            Sets the traversal stride size in depth\n"
+      << "  --stride_h=<int>            Sets the traversal stride size in height\n"
+      << "  --stride_w=<int>            Sets the traversal stride size in width\n"
+      << "  --dialtion_d=<int>          Sets the filter dilation size in depth\n"
+      << "  --dialtion_h=<int>          Sets the filter dilation size in height\n"
+      << "  --dialtion_w=<int>          Sets the filter dilation size in width\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "76_blackwell_conv_wgrad" << " --n=4 --d=1 --h=8 --w=8 --c=64 --k=64 --t=1 --r=3 --s=3 --pad_d=0"
+      << "  --pad_h=1 --pad_w=1 --stride_d=1 --stride_h=1 --stride_w=1 --dilation_d=1 --dilation_h=1 --dilation_w=1 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * k * (t * r * s * c) * (n * z * p * q);
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Conv setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_block(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+
+  Element scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = Element(2);
+    scope_min = Element(0);
+  } else if (bits_input <= 8) {
+    scope_max = Element(2);
+    scope_min = Element(-2);
+  } else {
+    scope_max = Element(8);
+    scope_min = Element(-8);
+  }
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+/// Initialize operands to be used in the Conv and reference Conv
+void initialize(const Options &options) {
+
+  // Construct ConvProblemShape
+  ProblemShape problem_shape(
+    cutlass::conv::Mode::kCrossCorrelation,
+    {options.n, options.d, options.h, options.w, options.c},      // ndhwc
+    {options.k, options.t, options.r, options.s, options.c},      // ktrsc
+    {options.pad_d, options.pad_h, options.pad_w},                // padding lower (pad_d, pad_h, pad_w)
+    {options.pad_d, options.pad_h, options.pad_w},                // padding upper (pad_d, pad_h, pad_w)
+    {options.stride_d, options.stride_h, options.stride_w},       // stride (stride_d, stride_h, stride_w)
+    {options.dilation_d, options.dilation_h, options.dilation_w}, // dilation (dilation_d, dilation_h, dilation_w)
+    1                                                             // group
+  );
+
+  // Setup stride_C/D
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, problem_shape.shape_C, problem_shape.stride_C, ConvOp);
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, problem_shape.shape_C, problem_shape.stride_C, ConvOp);
+
+  block_A.reset(problem_shape.size_A());
+  block_B.reset(problem_shape.size_B());
+  block_C.reset(problem_shape.size_C());
+  block_D.reset(problem_shape.size_C());
+  block_ref_D.reset(problem_shape.size_C());
+
+  initialize_block(block_A, seed + 2023);
+  initialize_block(block_B, seed + 2022);
+  initialize_block(block_C, seed + 2021);
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Conv::Arguments args_from_options(const Options &options)
+{
+  // Construct ConvProblemShape
+  ProblemShape problem_shape(
+    cutlass::conv::Mode::kCrossCorrelation,
+    {options.n, options.d, options.h, options.w, options.c},      // ndhwc
+    {options.k, options.t, options.r, options.s, options.c},      // ktrsc
+    {options.pad_d, options.pad_h, options.pad_w},                // padding lower (pad_d, pad_h, pad_w)
+    {options.pad_d, options.pad_h, options.pad_w},                // padding upper (pad_d, pad_h, pad_w)
+    {options.stride_d, options.stride_h, options.stride_w},       // stride (stride_d, stride_h, stride_w)
+    {options.dilation_d, options.dilation_h, options.dilation_w}, // dilation (dilation_d, dilation_h, dilation_w)
+    1                                                             // group
+  );
+
+  typename Conv::Arguments arguments{
+    problem_shape,
+    {block_A.get(), block_B.get()},
+    {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  cutlass::TensorRef ref_A(block_A.get(), LayoutA::packed({options.n, options.z, options.p, options.q, options.k}));
+  cutlass::TensorRef ref_B(block_B.get(), LayoutB::packed({options.n, options.d, options.h, options.w, options.c}));
+  cutlass::TensorRef ref_C(block_C.get(), LayoutA::packed({options.k, options.t, options.r, options.s, options.c}));
+  cutlass::TensorRef ref_D(block_ref_D.get(), LayoutA::packed({options.k, options.t, options.r, options.s, options.c}));
+
+  //
+  // Compute reference output
+  //
+
+  // Construct Conv3dProblemSize with user defined inputs.
+  cutlass::conv::Conv3dProblemSize problem_size(      
+    cutlass::Tensor5DCoord(options.n, options.d, options.h, options.w, options.c),      // ndhwc
+    cutlass::Tensor5DCoord(options.k, options.t, options.r, options.s, options.c),      // ktrsc
+    cutlass::make_Coord(options.pad_d, options.pad_h, options.pad_w),                   // padding
+    cutlass::make_Coord(options.stride_d, options.stride_h, options.stride_w),          // stride (stride_d, stride_h, stride_w)
+    cutlass::make_Coord(options.dilation_d, options.dilation_h, options.dilation_w),    // dilation (dilation_d, dilation_h, dilation_w)
+    cutlass::Tensor5DCoord(options.n, options.z, options.p, options.q, options.k)       // nzpqk
+  );
+
+  // Launch device reference conv kernel
+  cutlass::reference::device::Conv3dWgrad(problem_size, ref_A, ref_B, ref_C, ref_D, options.alpha, options.beta);
+
+  // Wait for kernel to finish
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  bool passed = cutlass::reference::device::BlockCompareEqual(block_ref_D.get(), block_D.get(), block_D.size());
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Conv conv;
+
+  // Create a structure of conv kernel arguments suitable for invoking an instance of Conv
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Conv::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(conv.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(conv.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(conv.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(conv.initialize(arguments, workspace.get()));
+      CUTLASS_CHECK(conv.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size:" << std::endl;
+    std::cout << "      Activation(n,d,h,w,c) = (" << options.n << ',' << options.d << ',' << options.h << ',' << options.w << ',' << options.c << "), ";
+    std::cout << "  Filter(k,t,r,s,c) = (" << options.k << ',' << options.t << ',' << options.r << ',' << options.s << ',' << options.c << "), ";
+    std::cout << "  Xformed Activation(n,z,p,q,k) = (" << options.n << ',' << options.z << ',' << options.p << ',' << options.q << ',' << options.k << ")" << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 8)) {
+    std::cerr << "This example requires CUDA 12.8 or newer." << std::endl;
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major != 10 && (props.minor != 0 || props.minor != 1)) {
+    std::cerr << "This example requires a GPU of NVIDIA's Blackwell architecture (compute capability 100 or 101)." << std::endl;
+    return 0;
+  } 
+  
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+  run<Conv>(options);
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/76_blackwell_conv/CMakeLists.txt
+++ b/examples/76_blackwell_conv/CMakeLists.txt
@ -0,0 +1,46 @@
+# Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+
+if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+cutlass_example_add_executable(
+  76_blackwell_conv_fprop
+  76_blackwell_conv_fprop.cu
+)
+
+cutlass_example_add_executable(
+  76_blackwell_conv_dgrad
+  76_blackwell_conv_dgrad.cu
+)
+
+cutlass_example_add_executable(
+  76_blackwell_conv_wgrad
+  76_blackwell_conv_wgrad.cu
+)
+endif()
--- a/examples/77_blackwell_fmha/77_blackwell_fmha.cu
+++ b/examples/77_blackwell_fmha/77_blackwell_fmha.cu
@ -0,0 +1,990 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Example implementation of fused multi-head attention for the NVIDIA Blackwell SM100
+    architecture using CUTLASS 3.
+
+    MQA/GQA
+    -------
+
+    The head dimension can be represented as a tuple, where the K/V strides in the
+    first dimension is zero. This has the effect of MQA or GQA.
+    * MHA is (head_size:head_stride).
+    * MQA is (head_size:head_stride) in Q and (head_size:_0) in K and V.
+    * GQA is (grouped_heads,heads_kv):(head_stride,grouped_heads*head_stride) in Q
+      and (grouped_heads,heads_kv):(0,head_stride) in K and V
+
+    Output Scale
+    ------------
+
+    The output scale gets passed to the collective mainloop, and is applied
+    using FP32 compute pre-quantization
+
+    Variable Sequence Length
+    ------------------------
+
+    For variable sequence length, pass in VariableLength objects
+    (max_seqlen, cumulative_seqlen_ptr) in the problem shape for
+    seqlen Q and KV.
+
+    Support
+    ---------
+
+    Right now e4m3 with fp32 compute is using a 256x256 tiling and a head dimension
+    of 128 is supported.
+
+
+    Example usage:
+      $ ./examples/77_blackell_fmha/77_blackell_fmha_fp8 \
+            --b=2048 --h=2048 --d=2048 --q=2048 --k=2048
+*/
+
+#define DSHOW(x) print(#x ": "); print(x); print("\n");
+#define DSHOWT(x) print(#x ": "); print_tensor(x); print("\n");
+
+#include <iostream>
+#include <random>
+#include <regex>
+
+#include "cute/tensor.hpp"
+
+#include "cutlass/cutlass.h"
+#include "cutlass/kernel_hardware_info.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "reference/fmha_fwd_reference.hpp"
+#include "reference/reference_abs_error.hpp"
+
+#include "device/fmha.hpp"
+#include "collective/fmha_fusion.hpp"
+#include "collective/sm100_fmha_fwd_mainloop_tma_warpspecialized.hpp"
+#include "collective/sm100_fmha_fwd_epilogue_tma_warpspecialized.hpp"
+#include "kernel/fmha_options.hpp"
+#include "kernel/fmha_tile_scheduler.hpp"
+#include "kernel/sm100_fmha_fwd_kernel_tma_warpspecialized.hpp"
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+using namespace cute;
+using namespace cutlass::fmha::kernel;
+using namespace cutlass::fmha::collective;
+using namespace cutlass::fmha;
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+enum class InitStyle {
+  kOne, kLinearStride128, kLinearStride1, kRandom, kNone
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Command line options parsing
+struct Options {
+
+  bool help = false;
+  bool error = false;
+
+  int b = 1;
+  int h = 1;
+  int h_k = 1;
+  int q = 256;
+  int k = 256;
+  int d = 128;
+  int iterations = 3;
+  bool verify = false;
+  bool verbose = false;
+
+  bool causal = false;
+  bool residual = false;
+  bool varlen = false;
+  int sm_count = 0;
+
+  std::string kernel_filter;
+
+  InitStyle init_style_q = InitStyle::kRandom;
+  InitStyle init_style_k = InitStyle::kRandom;
+  InitStyle init_style_v = InitStyle::kRandom;
+
+  static void get_init_style_argument(cutlass::CommandLine& cmd, const char* name, InitStyle& dst, InitStyle const& src) {
+    std::string s;
+    cmd.get_cmd_line_argument(name, s, s);
+    if (s.empty()) {
+      dst = src;
+    }
+    else {
+      if (s == "r") {
+        dst = InitStyle::kRandom;
+      }
+      else if (s == "1") {
+        dst = InitStyle::kOne;
+      }
+      else if (s == "d") {
+        dst = InitStyle::kLinearStride1;
+      }
+      else if (s == "s") {
+        dst = InitStyle::kLinearStride128;
+      }
+      else if (s == "n") {
+        dst = InitStyle::kNone;
+      }
+      else {
+        std::cout << "Error: " << s << " is not a valid input type.\n";
+        std::exit(-1);
+      }
+    }
+  }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    Options defaults;
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("d", d, defaults.d);
+    cmd.get_cmd_line_argument("h", h, -1);
+    if (h == -1) h = 2048 / d;
+
+    cmd.get_cmd_line_argument("h_k", h_k, -1);
+    if (h_k == -1) h_k = h;
+
+    cmd.get_cmd_line_argument("q", q, -1);
+    cmd.get_cmd_line_argument("k", k, -1);
+    if (q == -1) q = k;
+    if (k == -1) k = q;
+    if (q == -1 && k == -1) q = k = defaults.q;
+
+    cmd.get_cmd_line_argument("b", b, -1);
+    if (b == -1) b = 16384 / k;
+    if (b == 0) b = 1;
+
+    cmd.get_cmd_line_argument("iterations", iterations, defaults.iterations);
+    verify = cmd.check_cmd_line_flag("verify");
+    verbose = cmd.check_cmd_line_flag("verbose");
+    varlen = cmd.check_cmd_line_flag("varlen");
+    std::string mask;
+    cmd.get_cmd_line_argument<std::string>("mask", mask, "");
+    if (mask == "no" || mask == "") {
+      causal = residual = false;
+      if (varlen) {
+        residual = true;
+      }
+    }
+    else if (mask == "causal") {
+      residual = false;
+      causal = true;
+    }
+    else if (mask == "residual") {
+      residual = true;
+      causal = false;
+    }
+    cmd.get_cmd_line_argument("sm-count", sm_count, defaults.sm_count);
+    
+    get_init_style_argument(cmd, "init-style", init_style_q, defaults.init_style_q);
+    get_init_style_argument(cmd, "init-style", init_style_k, defaults.init_style_q);
+    get_init_style_argument(cmd, "init-style", init_style_v, defaults.init_style_q);
+    get_init_style_argument(cmd, "init-style-q", init_style_q, init_style_q);
+    get_init_style_argument(cmd, "init-style-k", init_style_k, init_style_k);
+    get_init_style_argument(cmd, "init-style-v", init_style_v, init_style_v);
+
+    cmd.get_cmd_line_argument("kernel-filter", kernel_filter, defaults.kernel_filter);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "77_blackwell_fmha\n\n"
+      << "  This example showcases the use of CUTLASS's collective operation builders to easily construct\n"
+      << "  fused multi-head attention forward-passkernels targeting NVIDIA's Blackwell architecture.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --b=<int>                   Sets the B extent\n"
+      << "  --h=<int>                   Sets the H extent\n"
+      << "  --h_k=<int>                 Sets the H_K/V extent (for GQA/MQA)\n"
+      << "  --q=<int>                   Sets the Q extent\n"
+      << "  --k=<int>                   Sets the K extent\n"
+      << "  --d=<int>                   Sets the D extentn"
+      << "  --iterations=<int>          Benchmarking iterations\n"
+      << "  --verify                    Verify results\n"
+      << "  --verbose                   Print smem and execution time per kernel\n"
+      << "  --mask=<no|residual|causal> Enables masking\n"
+      << "  --varlen                    Enables variable sequence length\n"
+      << "                              B*Q and B*K become the total sequence length\n"
+      << "                              and are split B-ways, alternatingly +10% and -10%\n"
+      << "                              with the last batch sized to make it fit\n"
+      << "                              implies at least residual masking for correctness\n"
+      << " --sm-count                   Sets SM count rather than querying it\n"
+      << " --kernel-filter=<filter>     Sets regexp to match kernel against\n"
+      << "\n";
+
+    return out;
+  }
+};
+
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+void initialize_block(
+    DeviceAllocation<Element>& block,
+    uint64_t seed=2023, InitStyle init_style = InitStyle::kRandom) {
+
+  switch (init_style) {
+    case InitStyle::kOne: {
+      cutlass::reference::device::BlockFillRandomUniform(
+        block.get(), block.size(), seed, (Element) 1, (Element) 1);
+      break;
+    }
+    case InitStyle::kRandom: {
+      cutlass::reference::device::BlockFillRandomGaussian(
+        block.get(), block.size(), seed, (Element) 0, (Element) 1);
+      break;
+    }
+    case InitStyle::kLinearStride1: {
+      std::vector<Element> data(block.size());
+      for (size_t i = 0; i < block.size() / 128; i ++) {
+        for (int j = 0; j < 128; j++) {
+          data[j + 128*i] = static_cast<Element>((double) (j % 4));
+        }
+      }
+      block.copy_from_host(data.data(), data.size());
+      break;
+    }
+    case InitStyle::kLinearStride128: {
+      std::vector<Element> data(block.size());
+      for (size_t i = 0; i < block.size() / 128; i ++) {
+        for (int j = 0; j < 128; j++) {
+          data[j + 128*i] = static_cast<Element>((double) (i % 4));
+        }
+      }
+      block.copy_from_host(data.data(), data.size());
+      break;
+    }
+    case InitStyle::kNone: {
+      break;
+    }
+  }
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct ExampleResult {
+  bool passed = false;
+  bool verified = false;
+  float runtime_ms = 0;
+  double tflops_tc_s = 0;
+  double tops_exp2_s = 0;
+  double tbytes_s = 0;
+  size_t smem_size = 0;
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<
+  bool kIsVarlen,
+  class TileShape,
+  class DispatchPolicy,
+  class ActiveMask,
+  class... KernelOptions
+>
+struct FwdRunner {
+
+#ifdef FP8
+  using Element = cutlass::float_e4m3_t;
+#else
+  using Element = cutlass::half_t;
+#endif
+
+  using ElementAccumulatorQK = float;
+  using ElementAccumulatorPV = float;
+  using ElementOut = cutlass::half_t;
+
+  // Q K D (B H)
+  using ProblemShapeRegular = cute::tuple<int, int, int, cute::tuple<cute::tuple<int, int>, int>>;
+  using ProblemShapeVarlen = cute::tuple<VariableLength, VariableLength, int, cute::tuple<cute::tuple<int, int>, int>>;
+  using ProblemShapeType = std::conditional_t<kIsVarlen, ProblemShapeVarlen, ProblemShapeRegular>;
+  
+  using StrideQ = cute::tuple<int, _1, cute::tuple<cute::tuple<int, int>, int>>;  // Q D (H_G H_R B)
+  using StrideK = cute::tuple<int, _1, cute::tuple<cute::tuple<_0, int>, int>>;  // K D (H_G H_R B)
+  using StrideV = StrideK;
+  using StrideO = StrideQ;
+  using StrideLSE = cute::tuple<_1, cute::tuple<cute::tuple<int, int>, int>>;     // Q   (H_G H_R B)
+
+  static constexpr bool kIsPersistent = find_option_t<Tag::kIsPersistent, true_type, KernelOptions...>::value;
+  using TileScheduler = std::conditional_t<kIsPersistent, cutlass::fmha::kernel::PersistentTileScheduler, cutlass::fmha::kernel::IndividualTileScheduler>;
+
+  using Mainloop = 
+    cutlass::fmha::collective::Sm100FmhaFwdMainloopTmaWarpspecialized<
+      Element, ElementAccumulatorQK, ElementAccumulatorPV,
+      TileShape, StrideQ, StrideK, StrideV,
+      ActiveMask
+    >;
+  using Operation = cutlass::fmha::device::FMHA<
+    cutlass::fmha::kernel::Sm100FmhaFwdKernelTmaWarpspecialized<
+      ProblemShapeType,
+      Mainloop,
+      cutlass::fmha::collective::Sm100FmhaFwdEpilogueTmaWarpspecialized<
+        ElementOut, ElementAccumulatorPV,
+        typename Mainloop::TileShapePV,
+        StrideO, StrideLSE
+      >,
+      TileScheduler
+    >>;
+
+  //
+  // Data members
+  //
+
+  /// Initialization
+  StrideQ stride_Q;
+  StrideK stride_K;
+  StrideV stride_V;
+  StrideO stride_O;
+  StrideLSE stride_LSE;
+  uint64_t seed = 0;
+
+  DeviceAllocation<Element> block_Q;
+  DeviceAllocation<Element> block_K;
+  DeviceAllocation<Element> block_V;
+  DeviceAllocation<ElementOut> block_O;
+  DeviceAllocation<ElementAccumulatorPV> block_LSE;
+  DeviceAllocation<ElementOut> block_ref_O;
+  DeviceAllocation<ElementAccumulatorPV> block_ref_LSE;
+
+  std::vector<int> cumulative_seqlen_q;
+  std::vector<int> cumulative_seqlen_kv;
+  DeviceAllocation<int> device_cumulative_seqlen_q;
+  DeviceAllocation<int> device_cumulative_seqlen_kv;
+
+  //
+  // Methods
+  //
+  bool verify(const ProblemShapeType& problem_shape) {
+    Tensor mQ = make_tensor(make_gmem_ptr(block_Q.get()),
+      select<0,2,3>(problem_shape),
+      stride_Q);
+
+    Tensor mK = make_tensor(make_gmem_ptr(block_K.get()),
+      select<1,2,3>(problem_shape),
+      stride_K);
+
+    Tensor mV = make_tensor(make_gmem_ptr(block_V.get()),
+      select<1,2,3>(problem_shape),
+      stride_V);
+
+    Tensor mO = make_tensor(make_gmem_ptr(block_ref_O.get()),
+      select<0,2,3>(problem_shape),
+      stride_O);
+
+    Tensor mLSE = make_tensor(make_gmem_ptr(block_ref_LSE.get()),
+      select<0,3>(problem_shape),
+      stride_LSE);
+
+    fmha_reference(problem_shape, mQ, mK, mV, mO, mLSE, ActiveMask{});
+
+    cudaError_t result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Reference kernel failed. Last CUDA error: "
+                << cudaGetErrorString(result) << std::endl;
+      return false;
+    }
+
+    const double kMaxDiffThresh = sizeof(Element) == 1 ? 1e-1 : 1e-2;
+    const double kMeanDiffThresh = sizeof(Element) == 1 ? 1e-1 : 1e-3;
+
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    double max_diff = 0;
+    double mean_diff = 0;
+    reference_abs_diff(block_O, block_ref_O, max_diff, mean_diff);
+
+    bool passed_O = (max_diff < kMaxDiffThresh) && (mean_diff < kMeanDiffThresh);
+    if (! passed_O) {
+      std::cerr << "failed O: max diff " << max_diff 
+                << " mean " << mean_diff << std::endl;
+    }
+
+    // reference_abs_diff(block_LSE, block_ref_LSE, max_diff, mean_diff);
+
+    bool passed_LSE = true;  // future work
+    // bool passed_LSE = (max_diff < kMaxDiffThresh) && (mean_diff < kMeanDiffThresh);
+    // if ( ! passed_LSE) {
+    //   std::cerr << "failed LSE: max diff " << max_diff 
+    //             << " mean " << mean_diff << std::endl;
+    // }
+
+    return passed_O && passed_LSE;
+  }
+
+  template<class ProblemShape>
+  auto initialize_varlen(const ProblemShape& problem_size, const bool kVarlenSame = true) {
+    int num_batches = get<3,1>(problem_size);
+
+    // generate Q as --b times
+    //    gaussian (--Q, --Q / 2) sampled positive
+    //    track cumulative 
+    std::mt19937 rng(0x202305151552ull);
+    std::normal_distribution<double> dist_q(get<0>(problem_size), get<0>(problem_size) / 2);
+    std::normal_distribution<double> dist_kv(get<1>(problem_size), get<1>(problem_size) / 2);
+    std::cout << "N: " << num_batches << ", Q: " << get<0>(problem_size) << ", KV: " << get<1>(problem_size) << std::endl;
+
+    auto generate_positive_int = [](auto& dist, auto& gen) {
+      int result = 0;
+      do {
+        result = static_cast<int>(dist(gen));
+      } while (result <= 0);
+      return result;
+    };
+
+    cumulative_seqlen_q = {0};
+    cumulative_seqlen_kv = {0};
+
+    int total_seqlen_q = 0;
+    int total_seqlen_kv = 0;
+    int max_seqlen_q = 0;
+    int max_seqlen_kv = 0;
+
+    for (int i = 0; i < num_batches; i++) {
+      int seqlen_q = kVarlenSame ? get<0>(problem_size) : generate_positive_int(dist_q, rng);
+      int seqlen_kv = kVarlenSame ? get<1>(problem_size) : generate_positive_int(dist_kv, rng);
+
+      total_seqlen_q += seqlen_q;
+      total_seqlen_kv += seqlen_kv;
+
+      max_seqlen_q = std::max(max_seqlen_q, seqlen_q);
+      max_seqlen_kv = std::max(max_seqlen_kv, seqlen_kv);
+
+      cumulative_seqlen_q.push_back(cumulative_seqlen_q.back() + seqlen_q);
+      cumulative_seqlen_kv.push_back(cumulative_seqlen_kv.back() + seqlen_kv);
+    }
+    std::cout << "Q max: " << max_seqlen_q << " total: " << total_seqlen_q << " vs even " << num_batches * get<0>(problem_size) << std::endl;
+    std::cout << "KV max: " << max_seqlen_kv << " total: " << total_seqlen_kv << " vs even " << num_batches * get<1>(problem_size) << std::endl;
+
+    ProblemShape problem_size_for_init = problem_size;
+    get<3,1>(problem_size_for_init) = 1;
+    get<0>(problem_size_for_init) = total_seqlen_q;
+    get<1>(problem_size_for_init) = total_seqlen_kv;
+
+    ProblemShapeType problem_size_for_launch;
+
+    get<0>(problem_size_for_launch) = VariableLength{max_seqlen_q};
+    get<1>(problem_size_for_launch) = VariableLength{max_seqlen_kv};
+    get<2>(problem_size_for_launch) = get<2>(problem_size);
+    get<3>(problem_size_for_launch) = get<3>(problem_size);
+
+    return cute::make_tuple(problem_size_for_init, problem_size_for_launch);
+  }
+
+
+  /// Initialize operands to be used in the GEMM and reference GEMM
+
+  ProblemShapeType initialize(const Options& options) {
+    int h_r = options.h / options.h_k;
+    assert(options.h % options.h_k == 0);
+    auto problem_shape_in = cute::make_tuple(options.q, options.k, options.d, cute::make_tuple(cute::make_tuple(h_r, options.h_k), options.b));
+    
+    ProblemShapeType problem_shape;
+    decltype(problem_shape_in) problem_size;
+
+    if constexpr (kIsVarlen) {
+      auto [problem_shape_init, problem_shape_launch] = initialize_varlen(problem_shape_in);
+      problem_shape = problem_shape_launch;
+      problem_size = problem_shape_init;
+    }
+    else {
+      problem_size = problem_shape_in;
+      problem_shape = problem_shape_in;
+    }
+
+    get<2>(problem_size) = cutlass::round_up(get<2>(problem_size), 8);  // alignment
+
+    auto shape_QO = select<0,2,3>(problem_size);
+    auto shape_KV = select<1,2,3>(problem_size);
+    auto shape_LSE = select<0,3>(problem_size);
+
+    int SQ = size<0>(problem_size);
+    int SK = size<1>(problem_size);
+    int D = size<2>(problem_size);
+    int H  = size<3,0>(problem_size);
+    int H_K = size<3,0,1>(problem_size);
+    int H_Q = size<3,0,0>(problem_size);
+    int B = size<3,1>(problem_size);
+
+    stride_Q = make_stride(H*D , _1{}, make_stride(make_stride(D, H_Q*D), H*D*SQ));
+    stride_O = stride_Q;
+    stride_K = make_stride(H_K*D , _1{}, make_stride(make_stride(_0{}, D), H_K*D*SK));
+    stride_V = stride_K;
+    stride_LSE = make_stride(_1{}, make_stride(make_stride(SQ, SQ*H_Q), SQ*H));
+
+    if (kIsVarlen) {
+      get<2,1>(stride_Q) = 0;
+      get<2,1>(stride_K) = 0;
+      get<2,1>(stride_V) = 0;
+      get<2,1>(stride_O) = 0;
+      get<1,1>(stride_LSE) = 0;
+    }
+
+    block_Q.reset(size(shape_QO), kIsVarlen ? D*SQ*H : 0);
+    block_K.reset(size(shape_KV), kIsVarlen ? D*SK*H_K : 0);
+    block_V.reset(size(shape_KV), kIsVarlen ? D*SK*H_K : 0);
+    block_O.reset(size(shape_QO), kIsVarlen ? D*SQ*H : 0);
+    block_LSE.reset(size(shape_LSE));
+    block_ref_O.reset(size(shape_QO));
+    block_ref_LSE.reset(size(shape_LSE));
+
+    initialize_block(block_Q, seed + 2023, options.init_style_q);
+    initialize_block(block_K, seed + 2022, options.init_style_k);
+    initialize_block(block_V, seed + 2021, options.init_style_v);
+
+    if ( ! cumulative_seqlen_q.empty()) {
+      device_cumulative_seqlen_q.reset(cumulative_seqlen_q.size());
+      device_cumulative_seqlen_q.copy_from_host(
+        cumulative_seqlen_q.data(), cumulative_seqlen_q.size());
+    }
+    if ( ! cumulative_seqlen_kv.empty()) {
+      device_cumulative_seqlen_kv.reset(cumulative_seqlen_kv.size());
+      device_cumulative_seqlen_kv.copy_from_host(
+        cumulative_seqlen_kv.data(), cumulative_seqlen_kv.size());
+    }
+
+    if constexpr (kIsVarlen) {
+      get<0>(problem_shape).cumulative_length = device_cumulative_seqlen_q.get();
+      get<1>(problem_shape).cumulative_length = device_cumulative_seqlen_kv.get();
+    }
+
+    return problem_shape;
+  }
+
+  ExampleResult run(const Options& options, const cutlass::KernelHardwareInfo& hw_info) {
+
+    ProblemShapeType problem_shape = initialize(options);
+
+    typename Operation::Arguments arguments{
+      problem_shape,
+      { block_Q.get(), stride_Q,
+        block_K.get(), stride_K,
+        block_V.get(), stride_V },
+      { block_O.get(), stride_O,
+      block_LSE.get(), stride_LSE },
+      hw_info
+    };
+
+    Operation op;
+
+    ExampleResult example_result;
+
+    example_result.smem_size = Operation::Kernel::SharedStorageSize;
+
+    size_t workspace_size = 0;
+    workspace_size = Operation::get_workspace_size(arguments);
+    DeviceAllocation<uint8_t> workspace(workspace_size);
+
+    cutlass::Status status = cutlass::Status::kSuccess;
+    status = op.can_implement(arguments);
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "This kernel is not supported. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+
+    status = op.initialize(arguments, workspace.get());
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "Failed to initialize the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+
+    // Run
+    status = op.run();
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "Failed to launch the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+
+    cudaError_t result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Error running the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    //
+    // Construct events
+    //
+
+    cudaEvent_t events[2];
+
+    for (auto & event : events) {
+      result = cudaEventCreate(&event);
+      if (result != cudaSuccess) {
+        std::cerr << "cudaEventCreate() failed: " << cudaGetErrorString(result) << std::endl;
+        return example_result;
+      }
+    }
+
+    // Record an event at the start of a series of GEMMs
+    result = cudaEventRecord(events[0]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventRecord() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    for (int i = 0; i < options.iterations; i++) {
+      status = op.run();
+      if (status != cutlass::Status::kSuccess) {
+        std::cerr << "Failed to launch the CUTLASS kernel. Last CUDA error is: "
+                  << cudaGetErrorString(cudaGetLastError()) << std::endl;
+        return example_result;
+      }
+    }
+
+    //
+    // Stop profiling loop
+    //
+
+    // Record an event when the GEMMs are complete
+    result = cudaEventRecord(events[1]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventRecord() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    // Wait for work on the device to complete.
+    result = cudaEventSynchronize(events[1]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventSynchronize() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    // Measure elapsed runtime
+    float runtime_ms = 0;
+    result = cudaEventElapsedTime(&runtime_ms, events[0], events[1]);
+    if (result != cudaSuccess) {
+      std::cerr << "cudaEventElapsed() failed: " << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    runtime_ms /= static_cast<float>(options.iterations);
+
+    double flops;
+    if (kIsVarlen) {
+      flops = 0.0;
+      for (int i = 0; i < size<3,1>(problem_shape); i++) {
+        flops += (cumulative_seqlen_q[i+1] - cumulative_seqlen_q[i])
+               * 1.0
+               * (cumulative_seqlen_kv[i+1] - cumulative_seqlen_kv[i]);
+      }
+    }
+    else {
+      flops = 1.0;
+      flops *= static_cast<double>(size<0>(problem_shape));
+      flops *= static_cast<double>(size<1>(problem_shape));
+      flops *= static_cast<double>(size<3,1>(problem_shape));
+    }
+    flops *= 4.0 * (std::is_same_v<ActiveMask, CausalMask> ? 0.5 : 1.0);
+    flops *= static_cast<double>(size<2>(problem_shape));
+    flops *= static_cast<double>(size<3,0>(problem_shape));
+    double tflops_s = flops * 1e-12 /*tera*/ / (runtime_ms * 1e-3 /*ms*/);
+    example_result.tflops_tc_s = tflops_s;
+    example_result.runtime_ms = runtime_ms;
+
+    result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Error running the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    // Verify that the result is correct
+    bool passed = true;
+    if (options.verify) {
+      passed = verify(problem_shape);
+      if (passed) example_result.verified = true;
+    }
+    
+    if (!passed) {
+      std::cerr << "Reference check failed" << std::endl;
+      return example_result;
+    }
+
+    example_result.passed = true;
+
+    return example_result;
+  }
+
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to print a description of the example run and its result
+void print_result(const std::string& description, ExampleResult result, bool verbose) {
+  std::ios fmt(nullptr);
+  fmt.copyfmt(std::cout);
+  std::cout << (result.passed ? (result.verified ? " [OK]  " : " [--] ") : "[FAIL] ");
+  std::cout << std::setw(32) << std::left << description;
+  std::cout.copyfmt(fmt);
+  std::cout << " : " << result.tflops_tc_s << " TFLOPS/s" << std::endl;
+  if (verbose) {
+    std::cout << "       t=" << result.runtime_ms << "ms, "
+        "smem=" << result.smem_size << "b" << std::endl;
+  }
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<class Mask>
+void run_fwd_128(Mask fusion, Options const & options, cutlass::KernelHardwareInfo const& hw_info) {
+  auto run = [&](auto shape, const char* name, auto... kernel_options) {
+    if ((! options.kernel_filter.empty()) && (! std::regex_search(name, std::basic_regex(options.kernel_filter)))) {
+        return;
+    }
+    if (options.varlen) {
+      FwdRunner<true, decltype(shape), void, Mask, decltype(kernel_options)...> runner;
+      auto result = runner.run(options, hw_info);
+      print_result(name, result, options.verbose);
+    }
+    else 
+    {
+      FwdRunner<false, decltype(shape), void, Mask, decltype(kernel_options)...> runner;
+      auto result = runner.run(options, hw_info);
+      print_result(name, result, options.verbose);
+    }
+  };
+
+  using HeadDim = _128;
+
+  // Persistent Tile Scheduler
+  run(Shape<_256, _128, HeadDim>{}, "tma ws 256x128 acc fp32 persistent", Option<Tag::kIsPersistent, true_type>{});
+  // Individual Tile Scheduler
+  run(Shape<_256, _128, HeadDim>{}, "tma ws 256x128 acc fp32 individual", Option<Tag::kIsPersistent, false_type>{});
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<class Mask>
+void run_fwd_64(Mask fusion, Options const & options, cutlass::KernelHardwareInfo const& hw_info) {
+  auto run = [&](auto shape, const char* name, auto... kernel_options) {
+    if ((! options.kernel_filter.empty()) && (! std::regex_search(name, std::basic_regex(options.kernel_filter)))) {
+        return;
+    }
+    if (options.varlen) {
+      FwdRunner<true, decltype(shape), void, Mask, decltype(kernel_options)...> runner;
+      auto result = runner.run(options, hw_info);
+      print_result(name, result, options.verbose);
+    }
+    else 
+    {
+      FwdRunner<false, decltype(shape), void, Mask, decltype(kernel_options)...> runner;
+      auto result = runner.run(options, hw_info);
+      print_result(name, result, options.verbose);
+    }
+  };
+
+  using HeadDim = _64;
+
+  // Persistent Tile Scheduler
+  run(Shape<_256, _128, HeadDim>{}, "tma ws 256x128 acc fp32 persistent", Option<Tag::kIsPersistent, true_type>{});
+  // Individual Tile Scheduler
+  run(Shape<_256, _128, HeadDim>{}, "tma ws 256x128 acc fp32 individual", Option<Tag::kIsPersistent, false_type>{});
+}
+
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<class Mask>
+void run_fwd_32(Mask fusion, Options const & options, cutlass::KernelHardwareInfo const& hw_info) {
+  auto run = [&](auto shape, const char* name, auto... kernel_options) {
+    if (options.varlen) {
+      FwdRunner<true, decltype(shape), void, Mask, decltype(kernel_options)...> runner;
+      auto result = runner.run(options, hw_info);
+      print_result(name, result, options.verbose);
+    }
+    else {
+      FwdRunner<false, decltype(shape), void, Mask, decltype(kernel_options)...> runner;
+      auto result = runner.run(options, hw_info);
+      print_result(name, result, options.verbose);
+    }
+  };
+
+  using HeadDim = _32;
+
+#ifdef FP8
+  // Persistent Tile Scheduler
+  run(Shape<_256, _128, HeadDim>{}, "tma ws 256x128 acc fp32 persistent", Option<Tag::kIsPersistent, true_type>{});
+  // Individual Tile Scheduler
+  run(Shape<_256, _128, HeadDim>{}, "tma ws 256x128 acc fp32 individual", Option<Tag::kIsPersistent, false_type>{});
+#endif
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main_single(int argc, char const **args) {
+
+  cudaDeviceProp props;
+
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (error != cudaSuccess) {
+    std::cerr << "cudaGetDeviceProperties() returned an error: " << cudaGetErrorString(error) << std::endl;
+    return -1;
+  }
+
+  if (__CUDACC_VER_MAJOR__ < 12 || props.major != 10) {
+    std::cout
+      << "This example requires a GPU of NVIDIA's Blackwell Architecture "
+      << "(compute capability major 10) and CUDA 12.8 or greater.\n";
+    return 0;
+  }
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  if (options.error) {
+    std::cerr << "Aborting execution." << std::endl;
+    return -1;
+  }
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  //
+  // Run examples
+  //
+
+  // The KernelHardwareInfo struct holds the number of SMs on the GPU with a given device ID. This
+  // information is used by the underlying kernel.
+  cutlass::KernelHardwareInfo hw_info;
+
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  if (options.sm_count == 0) {
+    hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+  }
+  else {
+    hw_info.sm_count = options.sm_count;
+  }
+
+  std::cout << "###### B " << options.b << " H " << options.h << " H_K " << options.h_k << " Q " << options.q << " K " << options.k << " D " << options.d << " ";
+  std::cout << "Forward" << " " << (options.causal ? "Causal" : (options.residual ? "Residual" : "None")) << " ";
+  std::cout << "#SM " << hw_info.sm_count << std::endl;
+
+  auto with_mask = [&](auto fn) {
+    if (options.causal) {
+      fn(CausalMask{});
+    }
+    else if (options.residual) {
+      fn(ResidualMask{});
+    }
+    else {
+      fn(NoMask{});
+    }
+  };
+
+  with_mask([&](auto fusion) {
+    if (options.d <= 32) {
+      run_fwd_32(fusion, options, hw_info);
+    }
+    else if (options.d <= 64) {
+      run_fwd_64(fusion, options, hw_info);
+    }
+    else if (options.d <= 128) {
+      run_fwd_128(fusion, options, hw_info);
+    }
+    else {
+      std::cout << "No kernel instantiated for d=" << options.d << std::endl;
+    }
+  });
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+  std::vector<std::string> full_arguments(args, args + argc);
+
+  int result = 0;
+
+  bool recursed = false;
+  for (size_t i = 1; i < full_arguments.size(); i++) {
+    if (full_arguments[i].find(',') != std::string::npos) {
+      auto arg = full_arguments[i];
+      size_t eq_pos = arg.find('=');
+      std::string prefix = eq_pos == std::string::npos ? "" : arg.substr(0, eq_pos+1);
+      std::string rest = eq_pos == std::string::npos ? arg : arg.substr(eq_pos+1);
+      for (;;) {
+        size_t comma_pos = rest.find(',');
+        std::string current = rest.substr(0, comma_pos);
+        full_arguments[i] = prefix + current;
+        std::vector<const char*> next_args;
+        for (auto& elem : full_arguments) { next_args.push_back(elem.data()); }
+        main(argc, next_args.data());
+        if (comma_pos == std::string::npos) break;
+        rest = rest.substr(comma_pos+1);
+      }
+      recursed = true;
+      break;
+    }
+  }
+
+  if (! recursed) {
+    main_single(argc, args);
+  }
+
+  return result;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/77_blackwell_fmha/77_blackwell_fmha_gen.cu
+++ b/examples/77_blackwell_fmha/77_blackwell_fmha_gen.cu
@ -0,0 +1,831 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Example implementation of fused multi-head attention for the NVIDIA Blackwell SM100
+    architecture using CUTLASS 3.
+
+    MQA/GQA
+    -------
+
+    The head dimension can be represented as a tuple, where the K/V strides in the
+    first dimension is zero. This has the effect of MQA or GQA.
+    * MHA is (head_size:head_stride).
+    * MQA is (head_size:head_stride) in Q and (head_size:_0) in K and V.
+    * GQA is (grouped_heads,heads_kv):(head_stride,grouped_heads*head_stride) in Q
+      and (grouped_heads,heads_kv):(0,head_stride) in K and V
+
+    Example usage:
+      $ ./examples/77_blackell_fmha/77_blackell_fmha_gen_fp8 \
+            --b=2048 --h=2048 --d=2048 --k=2048
+*/
+
+#define DSHOW(x) print(#x ": "); print(x); print("\n");
+#define DSHOWT(x) print(#x ": "); print_tensor(x); print("\n");
+
+#include <iostream>
+#include <random>
+#include <regex>
+
+#include "cute/tensor.hpp"
+
+#include "cutlass/cutlass.h"
+#include "cutlass/kernel_hardware_info.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "reference/fmha_fwd_gen_reference.hpp"
+#include "reference/reference_abs_error.hpp"
+
+#include "device/fmha.hpp"
+#include "collective/fmha_fusion.hpp"
+#include "collective/sm100_fmha_gen_mainloop_warpspecialized.hpp"
+#include "collective/sm100_fmha_gen_epilogue_warpspecialized.hpp"
+#include "kernel/sm100_fmha_gen_kernel_warpspecialized.hpp"
+#include "kernel/fmha_tile_scheduler.hpp"
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+using namespace cute;
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+enum class InitStyle {
+  kZero, kOne, kLinearStride128, kLinearStride1, kRandom, kNone
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Command line options parsing
+struct Options {
+
+  bool help = false;
+  bool error = false;
+
+  int b = 1;
+  int h = 1;
+  int h_k = 1;
+  int k = 512;
+  int d = 128;
+  int iterations = 3;
+  bool verify = false;
+  bool verbose = false;
+  bool remap = false;
+  bool varlen = false;
+  bool cache_only = false;
+
+  int sm_count = 0;
+
+  std::string kernel_filter;
+  bool clear_cache = false;
+
+  InitStyle init_style_q = InitStyle::kRandom;
+  InitStyle init_style_cache_k = InitStyle::kRandom;
+  InitStyle init_style_cache_v = InitStyle::kRandom;
+  InitStyle init_style_new_k = InitStyle::kRandom;
+  InitStyle init_style_new_v = InitStyle::kRandom;
+
+  static void get_init_style_argument(cutlass::CommandLine& cmd, const char* name, InitStyle& dst, InitStyle const& src) {
+    std::string s;
+    cmd.get_cmd_line_argument(name, s, s);
+    if (s.empty()) {
+      dst = src;
+    }
+    else {
+      if (s == "r") {
+        dst = InitStyle::kRandom;
+      }
+      else if (s == "0") {
+        dst = InitStyle::kZero;
+      }
+      else if (s == "1") {
+        dst = InitStyle::kOne;
+      }
+      else if (s == "d") {
+        dst = InitStyle::kLinearStride1;
+      }
+      else if (s == "s") {
+        dst = InitStyle::kLinearStride128;
+      }
+      else if (s == "n") {
+        dst = InitStyle::kNone;
+      }
+      else {
+        std::cout << "Error: " << s << " is not a valid input type.\n";
+        std::exit(-1);
+      }
+    }
+  }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    Options defaults;
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("d", d, defaults.d);
+    cmd.get_cmd_line_argument("h", h, -1);
+    if (h == -1) h = 2048 / d;
+
+    cmd.get_cmd_line_argument("h_k", h_k, -1);
+    if (h_k == -1) h_k = h;
+
+    cmd.get_cmd_line_argument("k", k, defaults.k);
+
+    cmd.get_cmd_line_argument("b", b, -1);
+    if (b == -1) b = 16384 / k;
+    if (b == 0) b = 1;
+
+    cmd.get_cmd_line_argument("iterations", iterations, defaults.iterations);
+    verify = cmd.check_cmd_line_flag("verify");
+    verbose = cmd.check_cmd_line_flag("verbose");
+    varlen = cmd.check_cmd_line_flag("varlen");
+    remap = cmd.check_cmd_line_flag("remap");
+    cache_only = cmd.check_cmd_line_flag("cache-only");
+    cmd.get_cmd_line_argument("sm-count", sm_count, defaults.sm_count);
+
+    get_init_style_argument(cmd, "init-style", init_style_q, defaults.init_style_q);
+    get_init_style_argument(cmd, "init-style", init_style_cache_k, defaults.init_style_cache_k);
+    get_init_style_argument(cmd, "init-style", init_style_cache_v, defaults.init_style_cache_v);
+    get_init_style_argument(cmd, "init-style", init_style_new_k, defaults.init_style_new_k);
+    get_init_style_argument(cmd, "init-style", init_style_new_v, defaults.init_style_new_v);
+    get_init_style_argument(cmd, "init-style-q", init_style_q, init_style_q);
+    get_init_style_argument(cmd, "init-style-cache-k", init_style_cache_k, init_style_cache_k);
+    get_init_style_argument(cmd, "init-style-cache-v", init_style_cache_v, init_style_cache_v);
+    get_init_style_argument(cmd, "init-style-new-k", init_style_new_k, init_style_new_k);
+    get_init_style_argument(cmd, "init-style-new-v", init_style_new_v, init_style_new_v);
+
+    clear_cache = cmd.check_cmd_line_flag("clear-cache");
+
+    cmd.get_cmd_line_argument("kernel-filter", kernel_filter, defaults.kernel_filter);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "77_blackwell_fmha_gen\n\n"
+      << "  This example showcases the use of CUTLASS's collective operation builders to easily construct\n"
+      << "  fused multi-head attention forward-pass gen-phase kernels targeting NVIDIA's Blackwell architecture.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --b=<int>                   Sets the B extent\n"
+      << "  --h=<int>                   Sets the H extent\n"
+      << "  --h_k=<int>                 Sets the H_K/V extent (for GQA/MQA)\n"
+      << "  --k=<int>                   Sets the K extent (sampled around this length)\n"
+      << "  --d=<int>                   Sets the D extentn"
+      << "  --iterations=<int>          Benchmarking iterations\n"
+      << "  --verify                    Verify results\n"
+      << "  --verbose                   Print smem and execution time per kernel\n"
+      << "  --remap                     Enables batch index remapping\n"
+      << "  --cache-only                Only use data from KV cache, no reading or inserting new entry\n"
+      << "  --varlen                    Varies sequence length between cache entries\n"
+      << "  --sm-count                  Sets SM count rather than querying it\n"
+      << "  --clear-cache               Clears the cache before benchmarking runs\n"
+      << " --kernel-filter=<filter>     Sets regexp to match kernel against\n"
+      << "\n";
+
+    return out;
+  }
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+void initialize_block(
+    DeviceAllocation<Element>& block,
+    uint64_t seed=2023, InitStyle init_style = InitStyle::kRandom) {
+
+  switch (init_style) {
+    case InitStyle::kZero: {
+      cutlass::reference::device::BlockFillRandomUniform(
+        block.get(), block.size(), seed, (Element) 0, (Element) 0);
+      break;
+    }
+    case InitStyle::kOne: {
+      cutlass::reference::device::BlockFillRandomUniform(
+        block.get(), block.size(), seed, (Element) 1, (Element) 1);
+      break;
+    }
+    case InitStyle::kRandom: {
+      cutlass::reference::device::BlockFillRandomGaussian(
+        block.get(), block.size(), seed, (Element) 0, (Element) 1);
+      break;
+    }
+    case InitStyle::kLinearStride1: {
+      std::vector<Element> data(block.size());
+      for (size_t i = 0; i < block.size() / 128; i ++) {
+        for (int j = 0; j < 128; j++) {
+          data[j + 128*i] = static_cast<Element>((double) (j % 4));
+        }
+      }
+      block.copy_from_host(data.data(), data.size());
+      break;
+    }
+    case InitStyle::kLinearStride128: {
+      std::vector<Element> data(block.size());
+      for (size_t i = 0; i < block.size() / 128; i ++) {
+        for (int j = 0; j < 128; j++) {
+          data[j + 128*i] = static_cast<Element>((double) (i % 4));
+        }
+      }
+      block.copy_from_host(data.data(), data.size());
+      break;
+    }
+    case InitStyle::kNone: {
+      break;
+    }
+
+  }
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct ExampleResult {
+  bool supported = false;
+  bool passed = false;
+  bool verified = false;
+  float runtime_ms = 0;
+  double tflops_tc_s = 0;
+  double tops_exp2_s = 0;
+  double tbytes_s = 0;
+  size_t smem_size = 0;
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct ClearCache {
+  const int size = 1024 * 1024 * 1024 / 4;
+  DeviceAllocation<float> data;
+  bool active = false;
+
+  ClearCache() = default;
+
+  void set_active(bool the_active) {
+    active = the_active;
+    if (active) {
+      data.reset(size);
+    }
+    else {
+      data.reset(0);
+    }
+  }
+
+  void operator ()() {
+    if (active) {
+      initialize_block(data, 0x49314, InitStyle::kRandom);
+    }
+  }
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+enum class KernelType {
+  UMMA_P, UMMA_I
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<KernelType kKernelType, class TileShape, class ThreadShape>
+struct ExampleRunner {
+
+  using Element = cutlass::float_e5m2_t;
+  using ElementAcc = float;
+  using ElementOut = cutlass::half_t;
+
+  using ProblemShape = Shape<_1, int, int, Shape<Shape<int, int>, int>>;
+
+  using StrideQ = Stride<_0, _1, Stride<Stride<int, int>, int>>;
+  using StrideNewK = Stride<_0, _1, Stride<Stride<_0, int>, int>>;
+  using StrideCacheK = Stride<int, _1, Stride<Stride<_0, int>, int>>;
+  using StrideNewV = StrideNewK;
+  using StrideCacheV = StrideCacheK;
+  using StrideO = StrideQ;
+
+  using Kernel = 
+    cutlass::fmha::kernel::Sm100FmhaGenKernelWarpspecialized<
+      ProblemShape,
+      cutlass::fmha::collective::Sm100FmhaGenMainloopWarpspecialized<
+        Element, ElementAcc, ElementAcc, ElementOut,
+        TileShape,
+        StrideQ, StrideNewK, StrideNewV,
+        StrideCacheK, StrideCacheV, StrideO
+      >,
+      cutlass::fmha::collective::Sm100FmhaGenEpilogueWarpspecialized<ElementOut, StrideO>,
+      std::conditional_t<kKernelType == KernelType::UMMA_P,
+        cutlass::fmha::kernel::PersistentTileScheduler,
+        cutlass::fmha::kernel::IndividualTileScheduler
+      >
+    >;
+  
+  using Operation = cutlass::fmha::device::FMHA<Kernel>;
+
+  StrideQ stride_q;
+  StrideNewK stride_new_k;
+  StrideNewV stride_new_v;
+  StrideCacheK stride_cache_k;
+  StrideCacheV stride_cache_v;
+  StrideO stride_o;
+  uint64_t seed = 0;
+
+  std::vector<int> seqlen_kv;
+
+  DeviceAllocation<int> block_seqlen_kv;
+  DeviceAllocation<int> block_cache_batch_idx;
+  DeviceAllocation<Element> block_q;
+  DeviceAllocation<Element> block_new_k;
+  DeviceAllocation<Element> block_new_v;
+  DeviceAllocation<Element> block_cache_k;
+  DeviceAllocation<Element> block_cache_v;
+  DeviceAllocation<ElementOut> block_o;
+
+  DeviceAllocation<Element> block_ref_cache_k;
+  DeviceAllocation<Element> block_ref_cache_v;
+  DeviceAllocation<ElementOut> block_ref_o;
+
+  ClearCache clear_cache;
+
+  bool verify(const ProblemShape& problem_shape) {
+
+    Tensor mQ = make_tensor(make_gmem_ptr(block_q.get()), select<0,2,3>(problem_shape), stride_q);
+    Tensor mNewK = make_tensor(make_gmem_ptr(block_new_k.get()), select<0,2,3>(problem_shape), stride_new_k);
+    Tensor mNewV = make_tensor(make_gmem_ptr(block_new_v.get()), select<0,2,3>(problem_shape), stride_new_v);
+    Tensor mCacheK = make_tensor(make_gmem_ptr(block_ref_cache_k.get()), select<1,2,3>(problem_shape), stride_cache_k);
+    Tensor mCacheV = make_tensor(make_gmem_ptr(block_ref_cache_v.get()), select<1,2,3>(problem_shape), stride_cache_v);
+    Tensor mO = make_tensor(make_gmem_ptr(block_ref_o.get()), select<0,2,3>(problem_shape), stride_o);
+
+    fmha_fwd_gen_reference<ElementAcc>(
+        problem_shape, block_seqlen_kv.get(), block_cache_batch_idx.get(),
+        mQ, mNewK, mNewV, mCacheK, mCacheV, mO);
+    cudaError_t result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Reference kernel failed. Last CUDA error: "
+                << cudaGetErrorString(result) << std::endl;
+      return false;
+    }
+
+    const double kMaxDiffThresh = sizeof(Element) == 1 ? 1e-1 : 1e-2;
+    const double kMeanDiffThresh = sizeof(Element) == 1 ? 1e-1 : 1e-3;
+
+    // Check if output from CUTLASS kernel and reference kernel are equal or not
+    double max_diff = 0;
+    double mean_diff = 0;
+    reference_abs_diff(block_o, block_ref_o, max_diff, mean_diff);
+    bool passed_O = (max_diff < kMaxDiffThresh) && (mean_diff < kMeanDiffThresh);
+    if (! passed_O) {
+      std::cerr << "failed O: max diff " << max_diff 
+                << " mean " << mean_diff << std::endl;
+    }
+
+    reference_abs_diff(block_cache_k, block_ref_cache_k, max_diff, mean_diff);
+    bool passed_K = (max_diff < kMaxDiffThresh) && (mean_diff < kMeanDiffThresh);
+    if ( ! passed_K) {
+      std::cerr << "failed Cache K: max diff " << max_diff 
+                << " mean " << mean_diff << std::endl;
+    }
+
+    reference_abs_diff(block_cache_v, block_ref_cache_v, max_diff, mean_diff);
+    bool passed_V = (max_diff < kMaxDiffThresh) && (mean_diff < kMeanDiffThresh);
+    if ( ! passed_V) {
+      std::cerr << "failed Cache V: max diff " << max_diff 
+                << " mean " << mean_diff << std::endl;
+    }
+
+    return passed_O && passed_K && passed_V;
+  }
+
+  ProblemShape initialize(const Options& options) {
+
+    clear_cache.set_active(options.clear_cache);
+
+    std::vector<int> cache_batch_idx;
+
+    // set up stides and sizes
+    if (options.remap) {
+      for (int i = 0; i < options.b; i++) {
+        cache_batch_idx.push_back(i);
+      }
+      std::mt19937 rng(0x202305291305ull);
+      std::shuffle(cache_batch_idx.begin(), cache_batch_idx.end(), rng);
+    }
+
+    seqlen_kv = std::vector<int>(options.b, options.k);
+    if (options.varlen) {
+      std::mt19937 rng(0x202305151552ull);
+      std::normal_distribution<double> dist_kv(options.k, options.k / 2);
+
+      auto generate_positive_int = [](auto& dist, auto& gen) {
+        int result = 0;
+        do {
+          result = static_cast<int>(dist(gen));
+        } while (result <= 0);
+        return result;
+      };
+
+      for (int i = 0; i < options.b; i++) {
+        seqlen_kv[i] = generate_positive_int(dist_kv, rng);
+      }
+    }
+
+    int max_seqlen_kv = 0;
+    for (auto e : seqlen_kv) {
+      max_seqlen_kv = std::max(e, max_seqlen_kv);
+    }
+
+    ProblemShape result = make_shape(_1{}, max_seqlen_kv + 1, options.d, make_shape(make_shape(options.h / options.h_k, options.h_k), options.b));
+
+    stride_q = make_stride(_0{}, _1{}, make_stride(make_stride(options.d, options.d * size<3,0,0>(result)), options.d * size<3,0>(result)));
+    stride_new_k = make_stride(_0{}, _1{}, make_stride(make_stride(_0{}, options.d), options.d * size<3,0,1>(result)));
+    stride_cache_k = make_stride(options.d * size<3,0,1>(result), _1{}, make_stride(make_stride(_0{}, options.d), options.d * size<3,0,1>(result) * get<1>(result)));
+
+    stride_new_v = stride_new_k;
+    stride_cache_v = stride_cache_k;
+    stride_o = stride_q;
+
+    block_q.reset(options.b * get<2,1>(stride_q));
+    if (! options.cache_only) {
+      block_new_k.reset(options.b * get<2,1>(stride_new_k));
+      block_new_v.reset(options.b * get<2,1>(stride_new_v));
+    }
+    block_cache_k.reset(options.b * get<2,1>(stride_cache_k));
+    block_cache_v.reset(options.b * get<2,1>(stride_cache_v));
+    block_o.reset(options.b * get<2,1>(stride_o));
+
+    block_ref_cache_k.reset(options.b * get<2,1>(stride_cache_k));
+    block_ref_cache_v.reset(options.b * get<2,1>(stride_cache_v));
+    block_ref_o.reset(options.b * get<2,1>(stride_o));
+    
+    initialize_block(block_q, seed + 2023, options.init_style_q);
+    if (! options.cache_only) {
+      initialize_block(block_new_k, seed + 2022, options.init_style_new_k);
+      initialize_block(block_new_v, seed + 2021, options.init_style_new_v);
+    }
+
+    initialize_block(block_cache_k, seed + 2024 - 2025, options.init_style_cache_k);
+    initialize_block(block_cache_v, seed + 2025, options.init_style_cache_v);
+
+    block_ref_cache_k.copy_from_device(block_cache_k.get(), block_cache_k.size());
+    block_ref_cache_v.copy_from_device(block_cache_v.get(), block_cache_v.size());
+    block_seqlen_kv.reset(seqlen_kv.size());
+    block_seqlen_kv.copy_from_host(seqlen_kv.data(), seqlen_kv.size());
+
+    if (! cache_batch_idx.empty()) {
+      block_cache_batch_idx.reset(cache_batch_idx.size());
+      block_cache_batch_idx.copy_from_host(cache_batch_idx.data(), cache_batch_idx.size());
+    }
+
+    return result;
+  }
+
+  ExampleResult run(const Options& options, const cutlass::KernelHardwareInfo& hw_info) {
+    auto problem_shape = initialize(options);
+
+    typename Operation::Arguments arguments{
+      problem_shape,
+      block_seqlen_kv.get(), block_cache_batch_idx.get(),
+      block_q.get(), stride_q,
+      block_new_k.get(), stride_new_k,
+      block_new_v.get(), stride_new_v,
+      block_cache_k.get(), stride_cache_k,
+      block_cache_v.get(), stride_cache_v,
+      block_o.get(), stride_o,
+      hw_info
+    };
+
+    Operation op;
+
+    ExampleResult example_result;
+
+    example_result.smem_size = Operation::Kernel::SharedStorageSize;
+
+    size_t workspace_size = 0;
+    workspace_size = Operation::get_workspace_size(arguments);
+    DeviceAllocation<uint8_t> workspace(workspace_size);
+
+    cutlass::Status status = cutlass::Status::kSuccess;
+    status = op.can_implement(arguments);
+    if (status != cutlass::Status::kSuccess) {
+      // std::cerr << "This kernel is not supported. Last CUDA error is: "
+      //           << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+    example_result.supported = true;
+
+    status = op.initialize(arguments, workspace.get());
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "Failed to initialize the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+
+    // Run
+    status = op.run();
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "Failed to launch the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(cudaGetLastError()) << std::endl;
+      return example_result;
+    }
+
+    cudaError_t result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Error running the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    //
+    // Construct events
+    //
+
+    cudaEvent_t events[2];
+
+    for (auto & event : events) {
+      result = cudaEventCreate(&event);
+      if (result != cudaSuccess) {
+        std::cerr << "cudaEventCreate() failed: " << cudaGetErrorString(result) << std::endl;
+        return example_result;
+      }
+    }
+
+    float total_runtime_ms = 0;
+
+    for (int i = 0; i < options.iterations; i++) {
+
+      clear_cache();
+
+      // Record an event at the start of a series of GEMMs
+      result = cudaEventRecord(events[0]);
+      if (result != cudaSuccess) {
+        std::cerr << "cudaEventRecord() failed: " << cudaGetErrorString(result) << std::endl;
+        return example_result;
+      }
+
+      status = op.run();
+      if (status != cutlass::Status::kSuccess) {
+        std::cerr << "Failed to launch the CUTLASS kernel. Last CUDA error is: "
+                  << cudaGetErrorString(cudaGetLastError()) << std::endl;
+        return example_result;
+      }
+
+      // Record an event when the GEMMs are complete
+      result = cudaEventRecord(events[1]);
+      if (result != cudaSuccess) {
+        std::cerr << "cudaEventRecord() failed: " << cudaGetErrorString(result) << std::endl;
+        return example_result;
+      }
+
+      //
+      // Stop profiling loop
+      //
+  
+      // Wait for work on the device to complete.
+      result = cudaEventSynchronize(events[1]);
+      if (result != cudaSuccess) {
+        std::cerr << "cudaEventSynchronize() failed: " << cudaGetErrorString(result) << std::endl;
+        return example_result;
+      }
+
+      // Measure elapsed runtime
+      float runtime_ms = 0;
+      result = cudaEventElapsedTime(&runtime_ms, events[0], events[1]);
+      if (result != cudaSuccess) {
+        std::cerr << "cudaEventElapsed() failed: " << cudaGetErrorString(result) << std::endl;
+        return example_result;
+      }
+
+      result = cudaDeviceSynchronize();
+      if (result != cudaSuccess) {
+        std::cerr << "cudaDeviceSynchronize() failed: " << cudaGetErrorString(result) << std::endl;
+        return example_result;
+      }
+
+      total_runtime_ms += runtime_ms;
+
+    }
+
+    float runtime_ms = total_runtime_ms / static_cast<float>(options.iterations);
+
+    double bytes;
+    bytes = 0.0;
+    bytes += double(sizeof(Element) * size<3>(problem_shape));  // Q
+    bytes += double(sizeof(ElementOut) * size<3>(problem_shape));  // O
+    bytes += 2.0 * double(sizeof(Element) * size<3>(problem_shape) / size<3,0,0>(problem_shape));  // NewK, NewV
+    double total_seqlen_kv = 0;
+    for (auto e : seqlen_kv) {
+      total_seqlen_kv += double(e + 1);
+    }
+    bytes += 2.0 * double(sizeof(Element) * size<3,0,1>(problem_shape) * total_seqlen_kv);  // CacheK, CacheV
+    bytes *= static_cast<double>(size<2>(problem_shape));
+    double tbytes_s = bytes * 1e-12 /*tera*/ / (runtime_ms * 1e-3 /*ms*/);
+    example_result.tbytes_s = tbytes_s;
+    example_result.runtime_ms = runtime_ms;
+
+    result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      std::cerr << "Error running the CUTLASS kernel. Last CUDA error is: "
+                << cudaGetErrorString(result) << std::endl;
+      return example_result;
+    }
+
+    // Verify that the result is correct
+    bool passed = true;
+    if (options.verify) {
+      passed = verify(problem_shape);
+      if (passed) example_result.verified = true;
+    }
+    
+    if (!passed) {
+      std::cerr << "Reference check failed" << std::endl;
+      return example_result;
+    }
+
+    example_result.passed = true;
+
+    return example_result;
+  }
+
+};
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#endif // defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to print a description of the example run and its result
+void print_result(const std::string& description, ExampleResult result, bool verbose) {
+  std::ios fmt(nullptr);
+  fmt.copyfmt(std::cout);
+  std::cout << (result.supported ? (result.passed ? (result.verified ? " [OK]  " : " [--] ") : "[FAIL] ") : "[NSUP] ");
+  std::cout << std::setw(32) << std::left << description;
+  std::cout.copyfmt(fmt);
+  std::cout << " : " << result.tbytes_s << " TB/s" << std::endl;
+  if (verbose) {
+    std::cout << "       t=" << result.runtime_ms << "ms, "
+        "smem=" << result.smem_size << "b" << std::endl;
+  }
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main_single(int argc, char const **args) {
+
+  cudaDeviceProp props;
+
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (error != cudaSuccess) {
+    std::cerr << "cudaGetDeviceProperties() returned an error: " << cudaGetErrorString(error) << std::endl;
+    return -1;
+  }
+
+  if (__CUDACC_VER_MAJOR__ < 12 || props.major < 10) {
+    std::cout
+      << "This example requires a GPU of NVIDIA's Blackwell Architecture or "
+      << "later (compute capability 90 or greater) and CUDA 12.0 or greater.\n";
+    return 0;
+  }
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  if (options.error) {
+    std::cerr << "Aborting execution." << std::endl;
+    return -1;
+  }
+
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
+
+  //
+  // Run examples
+  //
+
+  // The KernelHardwareInfo struct holds the number of SMs on the GPU with a given device ID. This
+  // information is used by the underlying kernel.
+  cutlass::KernelHardwareInfo hw_info;
+
+  // Change device_id to another value if you are running on a machine with multiple GPUs and wish
+  // to use a GPU other than that with device ID 0.
+  hw_info.device_id = 0;
+  if (options.sm_count == 0) {
+    hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+  }
+  else {
+    hw_info.sm_count = options.sm_count;
+  }
+
+  std::cout << "###### B " << options.b << " H " << options.h << " H_K " << options.h_k << " K " << options.k << " D " << options.d << " ";
+  std::cout << "Gen" << " " << (options.varlen ? "Variable" : "Uniform") << " " << (options.remap ? "Remap" : "Linear") << " ";
+  std::cout << "#SM " << hw_info.sm_count << std::endl;
+
+  using UMMA = true_type;
+  using FFMA2 = false_type;
+  auto run = [&](const char* name, auto kernel_type, auto tile, auto thr) {
+    if ((! options.kernel_filter.empty()) && (! std::regex_search(name, std::basic_regex(options.kernel_filter)))) {
+        return;
+    }
+    ExampleRunner<decltype(kernel_type)::value, decltype(tile), decltype(thr)> runner;
+    auto result = runner.run(options, hw_info);
+    print_result(name, result, options.verbose);
+  };
+
+
+  #define RUN(MODE, m, n, k, tm, tn, tk) \
+    run( \
+      #MODE " " #m "x" #n "x" #k " / " #tm "x" #tn "x" #tk, \
+      std::integral_constant<KernelType, KernelType::MODE>{}, Shape<_##m, _##n, _##k>{}, Shape<_##tm, _##tn, _##tk>{} \
+    )
+
+  RUN(UMMA_I, 128, 64, 128, 1, 1, 1);
+  RUN(UMMA_I, 128, 128, 128, 1, 1, 1);
+  RUN(UMMA_I, 128, 256, 128, 1, 1, 1);
+  RUN(UMMA_P, 128, 64, 128, 1, 1, 1);
+  RUN(UMMA_P, 128, 128, 128, 1, 1, 1);
+  RUN(UMMA_P, 128, 256, 128, 1, 1, 1);
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+  std::vector<std::string> full_arguments(args, args + argc);
+
+  int result = 0;
+
+  bool recursed = false;
+  for (size_t i = 1; i < full_arguments.size(); i++) {
+    if (full_arguments[i].find(',') != std::string::npos) {
+      auto arg = full_arguments[i];
+      size_t eq_pos = arg.find('=');
+      std::string prefix = eq_pos == std::string::npos ? "" : arg.substr(0, eq_pos+1);
+      std::string rest = eq_pos == std::string::npos ? arg : arg.substr(eq_pos+1);
+      for (;;) {
+        size_t comma_pos = rest.find(',');
+        std::string current = rest.substr(0, comma_pos);
+        full_arguments[i] = prefix + current;
+        std::vector<const char*> next_args;
+        for (auto& elem : full_arguments) { next_args.push_back(elem.data()); }
+        main(argc, next_args.data());
+        if (comma_pos == std::string::npos) break;
+        rest = rest.substr(comma_pos+1);
+      }
+      recursed = true;
+      break;
+    }
+  }
+
+  if (! recursed) {
+    main_single(argc, args);
+  }
+
+  return result;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
--- a/examples/77_blackwell_fmha/CMakeLists.txt
+++ b/examples/77_blackwell_fmha/CMakeLists.txt
@ -0,0 +1,105 @@
+# Copyright (c) 2014 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+set_property(
+  SOURCE 77_blackwell_fmha.cu
+  PROPERTY COMPILE_FLAGS "--use_fast_math -ftemplate-backtrace-limit=0")
+
+set_property(
+  SOURCE 77_blackwell_fmha_gen.cu
+  PROPERTY COMPILE_FLAGS "--use_fast_math -ftemplate-backtrace-limit=0")
+
+set(TEST_BASIC --b=1 --h=4 --q=512 --k=512 --d=128 --verify --mask=no)
+set(TEST_CAUSAL --b=1 --h=4 --q=512 --k=512 --d=128 --verify --mask=causal)
+set(TEST_VARLEN --b=1 --h=4 --q=512 --k=512 --d=128 --verify --mask=residual --varlen)
+set(TEST_HDIM64 --b=2 --h=4 --q=512 --k=512 --d=64 --verify)
+set(TEST_GQA --b=2 --h=4 --h_k=2 --q=512 --k=512 --d=64 --verify)
+
+set(TEST_GEN_BASIC --b=1 --h=4 --k=512 --d=128 --verify)
+set(TEST_GEN_VARLEN --b=1 --h=4 --k=512 --d=128 --verify  --varlen)
+set(TEST_GEN_HDIM64 --b=2 --h=4 --k=512 --d=64 --verify)
+set(TEST_GEN_GQA --b=2 --h=4 --h_k=2 --k=512 --d=64 --verify)
+set(TEST_GEN_REMAP --b=2 --h=4 --h_k=2 --k=512 --d=128 --verify --remap)
+set(TEST_GEN_CACHEONLY --b=2 --h=4 --h_k=2 --k=512 --d=128 --verify --cache-only)
+
+if(NOT WIN32 AND (NOT (CMAKE_CXX_COMPILER_ID MATCHES "Clang")))
+  if (CUTLASS_NVCC_ARCHS MATCHES 100a)
+  cutlass_example_add_executable(
+      77_blackwell_fmha_fp8
+      77_blackwell_fmha.cu
+      TEST_COMMAND_OPTIONS
+      TEST_BASIC
+      # TEST_CAUSAL
+      # TEST_VARLEN
+      # TEST_HDIM64
+      # TEST_GQA)
+      )
+  target_include_directories(77_blackwell_fmha_fp8 PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+  target_compile_definitions(77_blackwell_fmha_fp8 PRIVATE FP8)
+
+  cutlass_example_add_executable(
+      77_blackwell_fmha_gen_fp8
+      77_blackwell_fmha_gen.cu
+      TEST_COMMAND_OPTIONS
+      TEST_GEN_BASIC
+      # TEST_GEN_VARLEN
+      # TEST_GEN_HDIM64
+      # TEST_GEN_GQA
+      # TEST_GEN_REMAP
+      # TEST_GEN_CACHEONLY)
+      )
+  target_include_directories(77_blackwell_fmha_gen_fp8 PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+  target_compile_definitions(77_blackwell_fmha_gen_fp8 PRIVATE FP8)
+
+  cutlass_example_add_executable(
+      77_blackwell_fmha_fp16
+      77_blackwell_fmha.cu
+      TEST_COMMAND_OPTIONS
+      TEST_BASIC
+      # TEST_CAUSAL
+      # TEST_VARLEN
+      # TEST_HDIM64
+      # TEST_GQA)
+      )
+  target_include_directories(77_blackwell_fmha_fp16 PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+
+  cutlass_example_add_executable(
+      77_blackwell_fmha_gen_fp16
+      77_blackwell_fmha_gen.cu
+      TEST_COMMAND_OPTIONS
+      TEST_GEN_BASIC
+      # TEST_GEN_VARLEN
+      # TEST_GEN_HDIM64
+      # TEST_GEN_GQA
+      # TEST_GEN_REMAP
+      # TEST_GEN_CACHEONLY)
+      )
+  target_include_directories(77_blackwell_fmha_gen_fp16 PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
+  endif()
+endif()
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Yuxi Chi	e9a75581fe	DeepGemm Support - Step 2 (#2142 ) * add cpp example for DeepGemm. * add grouped_gemm_contiguous. * add groupedgemm_masked. * add python example and tests.	2025-02-28 10:11:59 -05:00
Yuxi Chi	ac210faef8	DeepGemm Support (#2137 ) * add cpp example for DeepGemm. * add grouped_gemm_contiguous. * add groupedgemm_masked.	2025-02-26 07:01:12 -05:00
Junkai-Wu	15f5468872	Migrate FlashMLA codes to example. (#2135 )	2025-02-26 01:29:07 -05:00
myu-guo	af5519d938	Flash MLA Support - Step 2 (#2134 ) * initial commit * initial commit * fix some error * update * bugfix * bugfix * change name * Add input&output process * minor * update * initial commit * initial commit * fix some error * update * bugfix * bugfix * change name * minor * update	2025-02-25 23:18:03 -05:00
myu-guo	415d587ebf	Flash MLA support (#2130 ) * initial commit * initial commit * fix some error * update * bugfix * bugfix * change name	2025-02-24 08:31:56 -05:00
Josh Fromm	eefa171318	[EVT] Fix Row/Col broadcast with array arguments (#2120 ) * Use constexpr in if to prevent invalid comparison. * Move constexpr check into else scope.	2025-02-21 17:47:30 -05:00
Yujia Zhai	afa1772203	truncate name for cutlass profiler (#2124 ) Co-authored-by: yuzhai <yuzhai@nvidia.com>	2025-02-21 00:16:56 -05:00
ANIKET SHIVAM	9b3772dfa6	Hopper Grouped GEMM support for FP8 Accum (#2123 ) * Add support for fp8accum, with profiler extension * Update .gitignore * contri --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-02-20 21:55:26 -05:00
Yujia Zhai	b84e9802d8	update 3.8 v2 (#2112 ) * update 3.8 v2 * update 3.8 --------- Co-authored-by: yuzhai <yuzhai@nvidia.com>	2025-02-19 22:03:14 -05:00
dan_the_3rd	e9627ce55b	Always use cudaGetDriverEntryPoint with CUDA 12 (#2086 ) `cudaGetDriverEntryPointByVersion` has been added to drivers in 12.5, but we don't know at compile time the driver version. In particular, we can build with nvcc 12.8 for a 12.2 driver for instance, and this was causing the following error: ``` undefined symbol: cudaGetDriverEntryPointByVersion, ```	2025-02-11 13:04:25 -05:00
Sijia(Jackson) Chen	ad6e1ec19c	Add ParetoQ to PUBLICATIONS.md (#2089 )	2025-02-10 16:47:02 -05:00
botbw	0642d46dd4	Update 0x_gemm_tutorial.md (#2090 )	2025-02-10 16:46:43 -05:00
Yujia Zhai	833f6990e0	v3.8.0 update (#2082 ) * 3.8 update * fix Markus' name --------- Co-authored-by: yuzhai <yuzhai@nvidia.com>	2025-02-06 21:33:40 -05:00
Josh Fromm	affd1b693d	[EVT] Add support for Row/Col broadcast PtrArray (#2033 ) * Add group support to EVT row/col broadcast. * small modifications --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-02-02 12:10:07 -05:00
Tadej Ciglarič	6f55278121	bugfix generic-k code in top-k with softmax (#1993 ) * bugfix generic-k code in top-k with softmax * Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> * Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com> --------- Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>	2025-01-31 19:05:35 -05:00
Liang	3c28697b9f	Groupwise scaling along M for FP8 gemm (#2037 ) * FP8 groupwise scaling along M * small updates --------- Co-authored-by: zl <zl@deepseek.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-31 13:51:28 -05:00
Haicheng Wu	bdd641790a	Update README.md	2025-01-28 18:08:13 -05:00
Haicheng Wu	cc19d4d22b	fix a readme broken link (#2069 )	2025-01-28 18:03:34 -05:00
Haicheng Wu	47daa33c61	fix cuda 12.6 issues (#2066 )	2025-01-28 17:28:29 -05:00
mihir-awatramani	389e493055	CUTLASS 3.8 Release (#2059 ) * CUTLASS 3.8 Release * update * Update README.md * Revert "Update README.md" This reverts commit `b353e36fe8`. * update * update --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2025-01-25 02:44:06 -05:00
Yujia Zhai	9eb01fa0b0	update 3.7 docs (#2051 ) * update docs * update docs * update docs * update docs * update docs --------- Co-authored-by: yuzhai <yuzhai@nvidia.com>	2025-01-23 15:13:50 -05:00