Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. (#100 )

- Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. - Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out - Added test_examples target to build and test all CUTLASS examples - Minor edits to documentation to point to GTC 2020 webinar
CUTLASS 2.2 (#96 )
2020-06-15 10:47:01 -07:00 · 2020-06-08 16:17:35 -07:00 · 2020-04-08 10:54:36 -07:00 · 2020-04-07 13:51:25 -07:00 · 2019-11-22 09:39:12 -08:00 · 2019-11-20 11:27:13 -08:00
4308 changed files with 710588 additions and 10921 deletions
--- a/.gitmodules
+++ b/.gitmodules
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -0,0 +1,127 @@
+# NVIDIA CUTLASS Changelog
+
+# CUTLASS 2.x
+
+## [2.2.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.2.0) (2020-06-08)
+ * [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)
+   * Fast Tensor Core operations: 
+    * Maximum performance via [`mma.sync`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends)
+    * Tensor Float 32, BFloat16, and double-precision data types
+    * Mixed integer data types (int8, int4, bin1)
+   * Asynchronous copy for deep software pipelines via [`cp.async`](https://docs.nvidia.com/cuda/parallel-thread-execution)   
+   * Described in [GTC 2020 Webinar (SR 21745)](https://developer.nvidia.com/gtc/2020/video/s21745) (free registration required)
+ * Features:
+   * SDK examples showing GEMM fused with bias+relu and fused GEMM+GEMM
+   * Complex-valued GEMMs targeting NVIDIA Ampere Tensor Cores in double-precision and Tensor Float 32
+   * Gaussian complex GEMMs using 3m complex multiply algorithm
+   * Universal GEMM kernel supporting two batch modes and two algorithms for parallel reductions
+ * Policy updates:
+   * [CUDA 11 Toolkit](https://developer.nvidia.com/cuda-toolkit) needed to enable NVIDIA Ampere Architecture features
+   * Disabled F16C by default for compatibility - enable on cmake command line with `-DCUTLASS_ENABLE_F16C=ON`
+
+## [2.1.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.1.0) (2020-04-06)
+ * BLAS-style host-side API added to [CUTLASS Library](/media/docs/quickstart.md#cutlass-library)
+    * API to launch compiled kernel instances for GEMM and planar complex GEMM
+ * Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
+    * Computes complex matrix products on matrices stored as disjoint real and imaginary parts
+    * [SDK Examples of Planar Complex GEMMs](/examples/10_planar_complex/planar_complex.cu)
+ * Minor enhancements and bug fixes
+
+## [2.0.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.0.0) (2019-11-19)
+ * Substantially refactored for
+    * Better performance, particularly for native Turing Tensor Cores
+    * Robust and durable templates spanning the design space
+    * Encapsulated functionality embodying modern C++11 programming techniques
+    * Optimized containers and data types for efficient, generic, portable device code
+  * Updates to:
+    * [Quick start guide](/media/docs/quickstart.md)
+    * [Documentation](/README.md#documentation)
+    * [Utilities](/media/docs/utilities.md)
+    * [CUTLASS Profiler](/media/docs/profiler.md)
+ * Native Turing Tensor Cores
+    * Efficient GEMM kernels targeting Turing Tensor Cores
+    * Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
+ * Coverage of existing CUTLASS functionality
+    * GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
+    * Volta Tensor Cores through native mma.sync and through WMMA API
+    * Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
+    * Batched GEMM operations
+    * Complex-valued GEMMs
+ * **Note: a host compiler supporting C++11 or greater is required.**
+
+# CUTLASS 1.x
+
+## [1.3.2](https://github.com/NVIDIA/cutlass/releases/tag/v1.3.2) (2019-07-09)
+ * Performance improvement for Volta Tensor Cores TN and TT layouts.
+
+## [1.3.1](https://github.com/NVIDIA/cutlass/releases/tag/v1.3.1) (2019-04-09)
+ * Corrected NVRTC unit tests.
+
+## [1.3.0](https://github.com/NVIDIA/cutlass/releases/tag/v1.3.0) (2019-03-20)
+ * Efficient GEMM kernel targeting Volta Tensor Cores via `mma.sync` instruction added in CUDA 10.1.
+
+## [1.2.0](https://github.com/NVIDIA/cutlass/releases/tag/v1.2.0) (2018-10-26)
+ * Parallelized reductions across threadblocks ("Split-K")
+   * Improved IGEMM performance
+ * Batched strided WMMA GEMMs
+
+## [1.1.0](https://github.com/NVIDIA/cutlass/releases/tag/v1.1.0) (2018-09-19)
+  * Turing Features
+    * WMMA GEMM targeting TensorCores - INT8, INT4, 1-bit
+  * Batched Strided GEMM
+  * Threadblock rasterization strategies
+    * Improved performance for adverse problem sizes and data layouts
+  * Extended CUTLASS Core comonents
+    * Tensor views support arbitrary matrix and tensor layouts
+    * Zip iterators for structuring multiple data streams
+  * Enhanced CUTLASS utilities
+    * Reference code for tensor operations in host and device code
+    * Added HostMatrix<> for simplified matrix creation
+  * Examples
+    * Basic GEMM, tensor views, CUTLASS utilities, batched GEMM, WMMA GEMM
+
+## [1.0.1](https://github.com/NVIDIA/cutlass/releases/tag/v1.0.1) (2018-06-11)
+
+  * Intra-threadblock reduction added for small threadblock tile sizes
+    * sgemm_64x128x16, sgemm_128x128x16, sgemm_128x64x16, sgemm_128x32x16, sgemm_64x64x16, sgemm_64x32x16
+    * igemm_32x32x128
+  * GEMM _K_ residue handled during prologue prior to mainloop
+  * Replaced Google Test copy with submodule. Use `git submodule init --recursive --update`
+
+## [1.0.0](https://github.com/NVIDIA/cutlass/commit/2028ebe120aab22bfd0b2baf8902d4c9627eb33f) (2018-05-16)
+
+  * Substantial rewrite to accommodate new architecture
+  * Kernels: SGEMM, DGEMM, IGEMM, HGEMM, WMMA GEMM
+  * Unit and performance tests
+
+## [0.0.1](https://github.com/NVIDIA/cutlass/commit/d08ba8ac46e2fa3f745e070c390182edb56b2e91) (2017-12-04)
+
+  * Initial release
+
+
+## Copyright
+
+Copyright (c) 2017-2020, NVIDIA CORPORATION.  All rights reserved.
+
+```
+  Redistribution and use in source and binary forms, with or without modification, are permitted
+  provided that the following conditions are met:
+      * Redistributions of source code must retain the above copyright notice, this list of
+        conditions and the following disclaimer.
+      * Redistributions in binary form must reproduce the above copyright notice, this list of
+        conditions and the following disclaimer in the documentation and/or other materials
+        provided with the distribution.
+      * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
+        to endorse or promote products derived from this software without specific prior written
+        permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
+  IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
+  FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+  OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+  STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```
+
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -0,0 +1,488 @@
+# Copyright (c) 2017-2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without modification, are permitted
+# provided that the following conditions are met:
+#     * Redistributions of source code must retain the above copyright notice, this list of
+#       conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright notice, this list of
+#       conditions and the following disclaimer in the documentation and/or other materials
+#       provided with the distribution.
+#     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
+#       to endorse or promote products derived from this software without specific prior written
+#       permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
+# IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
+# FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+# BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+cmake_minimum_required(VERSION 3.12.4 FATAL_ERROR)
+
+if(cutlass_LOADED)
+  # If CUTLASS has been previously fetched and loaded, don't do it again.
+  return()
+else()
+  set(cutlass_LOADED ON)
+  set(CUTLASS_DIR ${CMAKE_CURRENT_SOURCE_DIR} CACHE PATH "CUTLASS Repository Directory")
+endif()
+
+message(STATUS "CMake Version: ${CMAKE_VERSION}")
+
+project(CUTLASS VERSION 2.2.0 LANGUAGES CXX)
+include(${CMAKE_CURRENT_SOURCE_DIR}/CUDA.cmake)
+
+find_package(Doxygen QUIET)
+
+#
+# CUTLASS 2.x requires C++11
+#
+set(CMAKE_CXX_STANDARD 11)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(CMAKE_CXX_EXTENSIONS OFF)
+
+if(CUTLASS_NATIVE_CUDA)
+  set(CMAKE_CUDA_STANDARD 11)
+  set(CMAKE_CUDA_STANDARD_REQUIRED ON)
+else()
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS --std=c++11)
+endif()
+
+if(CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT)
+  set(CMAKE_INSTALL_PREFIX install CACHE PATH "Default installation location." FORCE)
+endif()
+
+message(STATUS "Default Install Location: ${CMAKE_INSTALL_PREFIX}")
+
+set(CUTLASS_ENABLE_HEADERS_ONLY OFF CACHE BOOL "Enable only the header library")
+
+if(CUTLASS_ENABLE_HEADERS_ONLY)
+  set(CUTLASS_ENABLE_EXAMPLES_INIT OFF)
+  set(CUTLASS_ENABLE_TOOLS_INIT OFF)
+else()
+  set(CUTLASS_ENABLE_EXAMPLES_INIT ON)
+  set(CUTLASS_ENABLE_TOOLS_INIT ON)
+endif()
+
+set(CUTLASS_ENABLE_EXAMPLES ${CUTLASS_ENABLE_EXAMPLES_INIT} CACHE BOOL "Enable CUTLASS Examples")
+set(CUTLASS_ENABLE_TOOLS ${CUTLASS_ENABLE_TOOLS_INIT} CACHE BOOL "Enable CUTLASS Tools")
+
+if(${CMAKE_PROJECT_NAME} STREQUAL ${PROJECT_NAME})
+  set(CUTLASS_ENABLE_TESTS_INIT ${CUTLASS_ENABLE_TOOLS_INIT})
+else()
+  set(CUTLASS_ENABLE_TESTS_INIT OFF)
+endif()
+
+set(CUTLASS_ENABLE_TESTS ${CUTLASS_ENABLE_TESTS_INIT} CACHE BOOL "Enable CUTLASS Tests")
+
+if (CUTLASS_ENABLE_TESTS)
+  include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/googletest.cmake)
+endif()
+
+set(CUTLASS_NVCC_ARCHS_SUPPORTED "")
+if (NOT CUDA_VERSION VERSION_LESS 7.5)
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 53)
+endif()
+if (NOT CUDA_VERSION VERSION_LESS 8.0)
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 60 61)
+endif()
+if (NOT CUDA_VERSION VERSION_LESS 9.0)
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 70)
+endif()
+if (NOT CUDA_VERSION VERSION_LESS 9.2)
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 72)
+endif()
+if (NOT CUDA_VERSION VERSION_LESS 10.0)
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 75)
+endif()
+if (NOT CUDA_VERSION VERSION_LESS 11.0)
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 80)
+endif()
+set(CUTLASS_NVCC_ARCHS ${CUTLASS_NVCC_ARCHS_SUPPORTED} CACHE STRING "The SM architectures requested.")
+set(CUTLASS_NVCC_ARCHS_ENABLED ${CUTLASS_NVCC_ARCHS} CACHE STRING "The SM architectures to build code for.")
+
+# Special policy introduced in CMake 3.13
+if (POLICY CMP0076)
+  cmake_policy(SET CMP0076 NEW)
+endif()
+
+if( NOT CMAKE_SIZEOF_VOID_P EQUAL 8 )
+    message(FATAL_ERROR "CUTLASS requires a 64-bit compiler!")
+endif()
+
+include(GNUInstallDirs)
+
+link_directories(${CUDA_TOOLKIT_ROOT_DIR}/lib64/stubs)
+
+###################################################################################################
+#
+# Configure CMake variables
+#
+###################################################################################################
+
+message(STATUS "CUDA Compilation Architectures: ${CUTLASS_NVCC_ARCHS_ENABLED}")
+
+if (NOT (CMAKE_BUILD_TYPE OR CONFIGURATION_TYPES))
+  # By default we want to build in Release mode to ensure that we're getting best performance.
+  set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose build level" FORCE)
+  set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "RelWithDebInfo" "Release")
+endif()
+
+set(CMAKE_POSITION_INDEPENDENT_CODE ON)
+set(CUTLASS_LIBRARY_DEBUG_POSTFIX ".debug" CACHE STRING "Default postfix value for debug libraries")
+
+if(WIN32)
+  # On Windows we link against the shared (DLL) runtime. Change gtest settings to match this.
+  set(gtest_force_shared_crt ON CACHE BOOL "Use shared (DLL) run-time lib even when Google Test is built as static lib" FORCE)
+endif()
+
+if (WIN32)
+  # Enable more warnings and treat as errors
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=/W3 -Xcompiler=/WX)
+
+  # Disable warning on Unicode characters
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=/wd4819)
+
+  # Disable excess x86 floating point precision that can lead to results being labeled incorrectly
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=/fp:strict)
+endif(WIN32)
+
+if (${CUTLASS_NVCC_VERBOSE})
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -v)
+endif()
+
+set(CUTLASS_NVCC_EMBED_CUBIN ON CACHE BOOL "Embed compiled CUDA kernel binaries into executables.")
+set(CUTLASS_NVCC_EMBED_PTX ON CACHE BOOL "Embed compiled PTX into executables.")
+set(CUTLASS_NVCC_KEEP OFF CACHE BOOL "Keep intermediate files generated by NVCC.")
+set(CUTLASS_ENABLE_F16C OFF CACHE BOOL "Enable F16C x86 extensions in host code.")
+
+#
+# CUTLASS generator cmake configuration
+#
+set(CUTLASS_LIBRARY_OPERATIONS "all" CACHE STRING "Comma delimited list of operation name filters. Default '' means all operations are enabled.")
+set(CUTLASS_LIBRARY_KERNELS "" CACHE STRING "Comma delimited list of kernel name filters. If unspecified, only the largest tile size is enabled. If 'all' is specified, all kernels are enabled.")
+
+
+# Test Levels L0, L1, L2
+set(CUTLASS_TEST_LEVEL "0" CACHE STRING "Level of tests to compile.")
+set_property(CACHE CUTLASS_TEST_LEVEL PROPERTY STRINGS 0 1 2)
+list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_TEST_LEVEL=${CUTLASS_TEST_LEVEL})
+
+#
+# CUDA 10.1 introduces "mma" in PTX performing collective matrix multiply operations.
+#
+
+if (CUDA_VERSION VERSION_LESS 10.1)
+  set(CUTLASS_ENABLE_TENSOR_CORE_MMA_DEFAULT OFF)
+else()
+  set(CUTLASS_ENABLE_TENSOR_CORE_MMA_DEFAULT ON)
+endif()
+
+set(CUTLASS_ENABLE_TENSOR_CORE_MMA ${CUTLASS_ENABLE_TENSOR_CORE_MMA_DEFAULT} CACHE BOOL
+  "Enable PTX mma instruction for collective matrix multiply operations.")
+
+#
+# NOTE: running with asan and CUDA requires the following environment variable:
+#
+#  ASAN_OPTIONS=protect_shadow_gap=0:replace_intrin=0:detect_leaks=0
+#
+# without the above environment setting, an error like the following may be generated:
+#
+#  *** Error: Could not detect active GPU device ID [out of memory]
+#  ...
+#  ==9149==ERROR: LeakSanitizer: detected memory leaks
+#  ...
+#
+if(ENABLE_ASAN)  # https://github.com/google/sanitizers/wiki/AddressSanitizer
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS --compiler-options=-fsanitize=address --compiler-options=-fno-omit-frame-pointer)
+  string(APPEND CMAKE_EXE_LINKER_FLAGS " -fsanitize=address")
+endif()
+
+###################################################################################################
+#
+# Configure CUDA build options
+#
+###################################################################################################
+
+if(CUTLASS_NVCC_EMBED_PTX)
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS --cuda-include-ptx=all)
+endif()
+
+if (CUTLASS_ENABLE_TENSOR_CORE_MMA)
+  list(APPEND CUTLASS_CUDA_FLAGS -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1)
+endif()
+
+if (NOT MSVC AND CUTLASS_NVCC_KEEP)
+  # MSVC flow handles caching already, but for other generators we handle it here.
+  set(CUTLASS_NVCC_KEEP_DIR ${CMAKE_CURRENT_BINARY_DIR}/tmp CACHE PATH "Location to store NVCC scratch files")
+  file(MAKE_DIRECTORY ${CUTLASS_NVCC_KEEP_DIR})
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS --keep) # --keep-dir may not work with nvcc for some directories.
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -save-temps=${CUTLASS_NVCC_KEEP_DIR})
+endif()
+
+if (CUTLASS_ENABLE_F16C AND NOT CMAKE_CROSSCOMPILING)
+  list(APPEND CUTLASS_CUDA_FLAGS -DCUTLASS_ENABLE_F16C=1)
+  if ((CMAKE_CXX_COMPILER_ID MATCHES "GNU") OR (CMAKE_CXX_COMPILER_ID MATCHES "Clang"))
+    list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=-mf16c)
+  elseif((CMAKE_CXX_COMPILER_ID MATCHES "MSVC"))
+    list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=/arch:AVX2)
+  endif()
+endif()
+
+list(APPEND CUTLASS_CUDA_NVCC_FLAGS $<$<BOOL:${UNIX}>:-Xcompiler=-Wconversion>)
+list(APPEND CUTLASS_CUDA_NVCC_FLAGS $<$<BOOL:${UNIX}>:-Xcompiler=-fno-strict-aliasing>)
+
+# Don't leak lineinfo in release builds
+if (NOT CMAKE_BUILD_TYPE MATCHES "Release")
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -gmlt)
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -lineinfo)
+endif()
+
+if(CUDA_COMPILER MATCHES "[Cc]lang")
+  if( NOT CMAKE_CXX_COMPILER_ID MATCHES "Clang" )
+    message(FATAL_ERROR "Clang CUDA compilation requires Clang CXX compilation. Currently CMAKE_CXX_COMPILER is ${CMAKE_CXX_COMPILER_ID}" )
+  endif()
+
+  if (CMAKE_CXX_COMPILER_VERSION VERSION_LESS 7.0)
+    message(FATAL_ERROR "Clang 7.0+ required for GPU compilation")
+  endif()
+
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS --cuda-path=${CUDA_TOOLKIT_ROOT_DIR})
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -mllvm -pragma-unroll-threshold=100000)
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -mllvm -unroll-threshold=5000)
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -Wno-unused-command-line-argument)
+
+  string(REPLACE "." ";" CUDA_VERSION_PARTS ${CMAKE_CUDA_COMPILER_VERSION})
+  list(GET CUDA_VERSION_PARTS 0 CUDA_VERSION_MAJOR)
+  list(GET CUDA_VERSION_PARTS 1 CUDA_VERSION_MINOR)
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -D__CUDACC_VER_MAJOR__=${CUDA_VERSION_MAJOR} -D__CUDACC_VER_MINOR__=${CUDA_VERSION_MINOR})
+
+
+  # needed for libcublasLt.so in case it's installed in the same location as libcudart.so
+  # dynamic linker can find it if linker sets RPATH (forced by --disable-new-tags)
+  # Otherwise linker uses RUNPATH and that does not propagate to loaded libs.
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -Wl,--disable-new-dtags)
+
+  link_libraries(nvidia::cudart)
+endif()
+
+function(cutlass_apply_cuda_gencode_flags TARGET)
+
+  set(NVCC_FLAGS)
+  set(CLANG_FLAGS)
+  foreach(ARCH ${CUTLASS_NVCC_ARCHS_ENABLED})
+    list(APPEND CLANG_FLAGS --cuda-gpu-arch=sm_${ARCH})
+    set(CODES)
+    if(CUTLASS_NVCC_EMBED_CUBIN)
+      list(APPEND CODES sm_${ARCH})
+    endif()
+    if(CUTLASS_NVCC_EMBED_PTX)
+      list(APPEND CODES compute_${ARCH})
+    endif()
+    list(JOIN CODES "," CODES_STR)
+    list(APPEND NVCC_FLAGS -gencode=arch=compute_${ARCH},code=[${CODES_STR}])
+  endforeach()
+
+  if (CUDA_COMPILER MATCHES "[Cc]lang")
+    target_compile_options(
+      ${TARGET}
+      PRIVATE
+      $<$<COMPILE_LANGUAGE:CXX>:${CLANG_FLAGS}>
+      )
+  else()
+    target_compile_options(
+      ${TARGET}
+      PRIVATE
+      $<$<COMPILE_LANGUAGE:CUDA>:${NVCC_FLAGS}>
+      )
+  endif()
+
+endfunction()
+
+function(cutlass_apply_standard_compile_options TARGET)
+
+  if(CUDA_COMPILER MATCHES "[Cc]lang")
+    set(CUDA_COMPILE_LANGUAGE CXX)
+    set(_FLAGS ${CUTLASS_CUDA_FLAGS} ${CUTLASS_CUDA_CLANG_FLAGS})
+    set(_FLAGS_RELEASE ${CUTLASS_CUDA_FLAGS_RELEASE} ${CUTLASS_CUDA_CLANG_FLAGS_RELEASE})
+    set(_FLAGS_RELWITHDEBINFO ${CUTLASS_CUDA_FLAGS_RELWITHDEBINFO} ${CUTLASS_CUDA_CLANG_FLAGS_RELWITHDEBINFO})
+    set(_FLAGS_DEBUG ${CUTLASS_CUDA_FLAGS_DEBUG} ${CUTLASS_CUDA_CLANG_FLAGS_DEBUG})
+  else()
+    set(CUDA_COMPILE_LANGUAGE CUDA)
+    set(_FLAGS ${CUTLASS_CUDA_FLAGS} ${CUTLASS_CUDA_NVCC_FLAGS})
+    set(_FLAGS_RELEASE ${CUTLASS_CUDA_FLAGS_RELEASE} ${CUTLASS_CUDA_NVCC_FLAGS_RELEASE})
+    set(_FLAGS_RELWITHDEBINFO ${CUTLASS_CUDA_FLAGS_RELWITHDEBINFO} ${CUTLASS_CUDA_NVCC_FLAGS_RELWITHDEBINFO})
+    set(_FLAGS_DEBUG ${CUTLASS_CUDA_FLAGS_DEBUG} ${CUTLASS_CUDA_NVCC_FLAGS_DEBUG})
+  endif()
+
+  target_compile_options(
+    ${TARGET}
+    PRIVATE
+    $<$<COMPILE_LANGUAGE:${CUDA_COMPILE_LANGUAGE}>:${_FLAGS}>
+    $<$<COMPILE_LANGUAGE:${CUDA_COMPILE_LANGUAGE}>:$<$<CONFIG:RELEASE>:${_FLAGS_RELEASE}>>
+    $<$<COMPILE_LANGUAGE:${CUDA_COMPILE_LANGUAGE}>:$<$<CONFIG:RELWITHDEBINFO>:${_FLAGS_RELWITHDEBINFO}>>
+    $<$<COMPILE_LANGUAGE:${CUDA_COMPILE_LANGUAGE}>:$<$<CONFIG:DEBUG>:${_FLAGS_DEBUG}>>
+    )
+
+endfunction()
+
+#
+# The following items should eventually be pushed into cutlass/CMakeLists.txt
+#
+
+# GLOB for CUTLASS header files. Should we use a static list instead?
+file(GLOB_RECURSE CUTLASS_INCLUDE RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} include/cutlass/*.h)
+file(GLOB_RECURSE CUTLASS_CUTLASS RELATIVE ${CMAKE_CURRENT_SOURCE_DIR}/include include/cutlass/*.h)
+file(GLOB_RECURSE CUTLASS_NVRTC RELATIVE ${CMAKE_CURRENT_SOURCE_DIR}/test test/unit/nvrtc/kernel/*.h)
+
+###################################################################################################
+#
+# Define build targets
+#
+###################################################################################################
+
+source_group(TREE ${CMAKE_CURRENT_SOURCE_DIR}/include REGULAR_EXPRESSION ".*\.h")
+
+add_library(CUTLASS INTERFACE)
+add_library(nvidia::cutlass::cutlass ALIAS CUTLASS)
+set_target_properties(CUTLASS PROPERTIES EXPORT_NAME cutlass)
+
+set(CUTLASS_INCLUDE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/include CACHE PATH "CUTLASS Header Library")
+
+set(CUTLASS_GENERATOR_DIR ${CMAKE_CURRENT_SOURCE_DIR}/tools/library/)
+
+# The following utility directory is needed even if the tools build is disabled, so it exists here.
+set(CUTLASS_TOOLS_UTIL_INCLUDE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/tools/util/include CACHE INTERNAL "")
+
+include_directories(${CUTLASS_INCLUDE_DIR})
+
+target_compile_features(CUTLASS INTERFACE cxx_std_11)
+
+if (NOT DEFINED CUTLASS_REVISION)
+
+  find_package(Git QUIET)
+
+  execute_process(
+    COMMAND ${GIT_EXECUTABLE} rev-parse --short HEAD
+    RESULT_VARIABLE CUTLASS_REVISION_RESULT
+    OUTPUT_VARIABLE CUTLASS_REVISION
+    OUTPUT_STRIP_TRAILING_WHITESPACE
+  )
+
+  if (CUTLASS_REVISION_RESULT)
+    message(STATUS "CUTLASS Revision: Unable to detect, Git returned code ${CUTLASS_REVISION_RESULT}.")
+  else()
+    message(STATUS "CUTLASS Revision: ${CUTLASS_REVISION}")
+  endif()
+
+endif()
+
+configure_file(
+  ${CMAKE_CURRENT_SOURCE_DIR}/cmake/version.h.in
+  ${CMAKE_CURRENT_BINARY_DIR}/include/cutlass/version.h
+  @ONLY)
+
+target_include_directories(
+  CUTLASS
+  INTERFACE
+  $<INSTALL_INTERFACE:include>
+  $<BUILD_INTERFACE:${CUTLASS_INCLUDE_DIR}>
+  $<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}/include>
+  $<BUILD_INTERFACE:${CUDA_TOOLKIT_ROOT_DIR}/include>
+  )
+
+install(
+  DIRECTORY
+  ${CUTLASS_INCLUDE_DIR}/
+  ${CMAKE_CURRENT_BINARY_DIR}/include/
+  DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
+  )
+
+install(
+  TARGETS CUTLASS
+  EXPORT NvidiaCutlass
+  PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
+  )
+
+################################################################################
+
+# Doxygen is available. Generate documentation
+if (DOXYGEN_FOUND)
+    # DOT is available. Enable graph generation in the documentation
+    if (DOXYGEN_DOT_EXECUTABLE)
+        set(CUTLASS_ENABLE_DOXYGEN_DOT ON CACHE BOOL "Use dot to generate graphs in the doxygen documentation.")
+    else()
+        set(CUTLASS_ENABLE_DOXYGEN_DOT OFF CACHE BOOL "Use dot to generate graphs in the doxygen documentation." FORCE)
+    endif()
+
+    if (CUTLASS_ENABLE_DOXYGEN_DOT)
+        set(HAVE_DOT "YES")
+    else()
+        set(HAVE_DOT "NO")
+    endif()
+
+    # Add custom target for Doxygen.
+    add_custom_target(cutlass_docs ${CMAKE_COMMAND} -E env
+        "DOT_PATH=${DOXYGEN_DOT_EXECUTABLE}"
+        "HAVE_DOT=${HAVE_DOT}"
+        ${DOXYGEN_EXECUTABLE} ${CMAKE_CURRENT_SOURCE_DIR}/Doxyfile
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
+        VERBATIM
+    )
+endif()
+
+if(NOT WIN32)
+  # Add common library search paths so executables and libraries can load and run
+  # without LD_LIBRARY_PATH being set.
+  link_libraries(
+    "-Wl,-rpath,'$ORIGIN'"
+    "-Wl,-rpath,'$ORIGIN/../lib64'"
+    "-Wl,-rpath,'$ORIGIN/../lib'"
+    "-Wl,-rpath,'${CUDA_TOOLKIT_ROOT_DIR}/lib64'"
+    "-Wl,-rpath,'${CUDA_TOOLKIT_ROOT_DIR}/lib'"
+    )
+endif()
+
+################################################################################
+
+include(${CMAKE_CURRENT_SOURCE_DIR}/cuBLAS.cmake)
+
+if (CUTLASS_ENABLE_CUBLAS)
+  target_compile_definitions(CUTLASS INTERFACE CUTLASS_ENABLE_CUBLAS=1)
+endif()
+
+################################################################################
+
+if(CUTLASS_ENABLE_TOOLS)
+  add_subdirectory(tools)
+endif()
+if(CUTLASS_ENABLE_EXAMPLES)
+  add_subdirectory(examples)
+endif()
+
+if(CUTLASS_ENABLE_TESTS)
+  include(CTest)
+  enable_testing()
+  add_subdirectory(test)
+endif()
+
+################################################################################
+
+install(
+  FILES ${CMAKE_CURRENT_SOURCE_DIR}/cmake/NvidiaCutlassConfig.cmake
+  DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/
+  )
+
+install(
+  EXPORT NvidiaCutlass
+  NAMESPACE nvidia::cutlass::
+  DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/
+  FILE NvidiaCutlassTargets.cmake
+  )
+
+################################################################################
+
+include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/NvidiaCutlassPackageConfig.cmake)
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@ -0,0 +1,57 @@
+![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS")
+
+[README](/README.md#documentation) > **Contributors**
+
+# CUTLASS Developers and Contributors
+
+This is the official list of CUTLASS developers and contributors.
+
+## DEVELOPERS
+Andrew Kerr  
+Haicheng Wu  
+Manish Gupta  
+Dustyn Blasig  
+Pradeep Ramani  
+Naila Farooqui  
+Piotr Majcher  
+Paul Springer  
+Jin Wang   
+Scott Yokim      
+Markus Hohnerbach  
+Aditya Atluri  
+David Tanner  
+  
+## CONTRIBUTORS
+Timothy Costa  
+Julien Demouth  
+Brian Fahs  
+Michael Goldfarb  
+Mostafa Hagog  
+Fei Hu  
+Alan Kaatz  
+Tina Li  
+Timmy Liu  
+Duane Merrill  
+Kevin Siu  
+Markus Tavenrath  
+John Tran  
+Vicki Wang  
+Junkai Wu  
+Fung Xie  
+Albert Xu  
+Jack Yang  
+Xiuxia Zhang  
+Nick Zhao  
+
+## ACKNOWLEDGEMENTS
+
+Girish Bharambe  
+Cris Cecka  
+Luke Durant  
+Olivier Giroux  
+Stephen Jones  
+Rishkul Kulkarni  
+Bryce Lelbach  
+Joel McCormack  
+Kyrylo Perelygin  
+
--- a/CUDA.cmake
+++ b/CUDA.cmake
@ -0,0 +1,349 @@
+# Copyright (c) 2017-2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without modification, are permitted
+# provided that the following conditions are met:
+#     * Redistributions of source code must retain the above copyright notice, this list of
+#       conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright notice, this list of
+#       conditions and the following disclaimer in the documentation and/or other materials
+#       provided with the distribution.
+#     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
+#       to endorse or promote products derived from this software without specific prior written
+#       permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
+# IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
+# FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+# BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+if(CUDA_COMPILER MATCHES "[Cc]lang")
+  set(CUTLASS_NATIVE_CUDA_INIT ON)
+elseif(CMAKE_VERSION VERSION_LESS 3.12.4)
+  set(CUTLASS_NATIVE_CUDA_INIT OFF)
+else()
+  set(CUTLASS_NATIVE_CUDA_INIT ON)
+endif()
+
+set(CUTLASS_NATIVE_CUDA ${CUTLASS_NATIVE_CUDA_INIT} CACHE BOOL "Utilize the CMake native CUDA flow")
+
+if(NOT DEFINED ENV{CUDACXX} AND NOT DEFINED ENV{CUDA_BIN_PATH} AND DEFINED ENV{CUDA_PATH})
+  # For backward compatibility, allow use of CUDA_PATH.
+  set(ENV{CUDACXX} $ENV{CUDA_PATH}/bin/nvcc)
+endif()
+
+if(CUTLASS_NATIVE_CUDA)
+
+  enable_language(CUDA)
+
+  if(NOT CUDA_VERSION)
+    set(CUDA_VERSION ${CMAKE_CUDA_COMPILER_VERSION})
+  endif()
+  if(NOT CUDA_TOOLKIT_ROOT_DIR)
+    get_filename_component(CUDA_TOOLKIT_ROOT_DIR "${CMAKE_CUDA_COMPILER}/../.." ABSOLUTE)
+  endif()
+
+else()
+
+  find_package(CUDA REQUIRED)
+  # We workaround missing variables with the native flow by also finding the CUDA toolkit the old way.
+
+  if(NOT CMAKE_CUDA_COMPILER_VERSION)
+    set(CMAKE_CUDA_COMPILER_VERSION ${CUDA_VERSION})
+  endif()
+
+endif()
+
+if (CUDA_VERSION VERSION_LESS 9.2)
+  message(FATAL_ERROR "CUDA 9.2+ Required, Found ${CUDA_VERSION}.")
+endif()
+if(NOT CUTLASS_NATIVE_CUDA OR CUDA_COMPILER MATCHES "[Cc]lang")
+  set(CMAKE_CUDA_COMPILER ${CUDA_TOOLKIT_ROOT_DIR}/bin/nvcc)
+  message(STATUS "CUDA Compiler: ${CMAKE_CUDA_COMPILER}")
+endif()
+
+find_library(
+  CUDART_LIBRARY cudart
+  PATHS
+  ${CUDA_TOOLKIT_ROOT_DIR}
+  PATH_SUFFIXES
+  lib/x64
+  lib64
+  lib
+  NO_DEFAULT_PATH
+  # We aren't going to search any system paths. We want to find the runtime 
+  # in the CUDA toolkit we're building against.
+  )
+
+if(NOT TARGET cudart AND CUDART_LIBRARY)
+
+  message(STATUS "CUDART: ${CUDART_LIBRARY}")
+
+  if(WIN32)
+    add_library(cudart STATIC IMPORTED GLOBAL)
+    # Even though we're linking against a .dll, in Windows you statically link against
+    # the .lib file found under lib/x64. The .dll will be loaded at runtime automatically
+    # from the PATH search.
+  else()
+    add_library(cudart SHARED IMPORTED GLOBAL)
+  endif()  
+
+  add_library(nvidia::cudart ALIAS cudart)
+  
+  set_property(
+    TARGET cudart
+    PROPERTY IMPORTED_LOCATION
+    ${CUDART_LIBRARY}
+    )
+
+elseif(TARGET cudart)
+
+  message(STATUS "CUDART: Already Found")
+
+else()
+
+  message(STATUS "CUDART: Not Found")
+
+endif()
+
+find_library(
+  CUDA_DRIVER_LIBRARY cuda
+  PATHS
+  ${CUDA_TOOLKIT_ROOT_DIR}
+  PATH_SUFFIXES
+  lib/x64
+  lib64
+  lib
+  lib64/stubs
+  lib/stubs
+  NO_DEFAULT_PATH
+  # We aren't going to search any system paths. We want to find the runtime 
+  # in the CUDA toolkit we're building against.
+  )
+
+if(NOT TARGET cuda_driver AND CUDA_DRIVER_LIBRARY)
+
+  message(STATUS "CUDA Driver: ${CUDA_DRIVER_LIBRARY}")
+
+  if(WIN32)
+    add_library(cuda_driver STATIC IMPORTED GLOBAL)
+    # Even though we're linking against a .dll, in Windows you statically link against
+    # the .lib file found under lib/x64. The .dll will be loaded at runtime automatically
+    # from the PATH search.
+  else()
+    add_library(cuda_driver SHARED IMPORTED GLOBAL)
+  endif()  
+
+  add_library(nvidia::cuda_driver ALIAS cuda_driver)
+  
+  set_property(
+    TARGET cuda_driver
+    PROPERTY IMPORTED_LOCATION
+    ${CUDA_DRIVER_LIBRARY}
+    )
+
+elseif(TARGET cuda_driver)
+
+  message(STATUS "CUDA Driver: Already Found")
+
+else()
+
+  message(STATUS "CUDA Driver: Not Found")
+
+endif()
+
+find_library(
+  NVRTC_LIBRARY nvrtc
+  PATHS
+  ${CUDA_TOOLKIT_ROOT_DIR}
+  PATH_SUFFIXES
+  lib/x64
+  lib64
+  lib
+  NO_DEFAULT_PATH
+  # We aren't going to search any system paths. We want to find the runtime 
+  # in the CUDA toolkit we're building against.
+  )
+
+if(NOT TARGET nvrtc AND NVRTC_LIBRARY)
+
+  message(STATUS "NVRTC: ${NVRTC_LIBRARY}")
+
+  if(WIN32)
+    add_library(nvrtc STATIC IMPORTED GLOBAL)
+    # Even though we're linking against a .dll, in Windows you statically link against
+    # the .lib file found under lib/x64. The .dll will be loaded at runtime automatically
+    # from the PATH search.
+  else()
+    add_library(nvrtc SHARED IMPORTED GLOBAL)
+  endif()  
+  
+  add_library(nvidia::nvrtc ALIAS nvrtc)
+  
+  set_property(
+    TARGET nvrtc
+    PROPERTY IMPORTED_LOCATION
+    ${NVRTC_LIBRARY}
+    )
+
+elseif(TARGET nvrtc)
+
+  message(STATUS "NVRTC: Already Found")
+
+else()
+
+  message(STATUS "NVRTC: Not Found")
+
+endif()
+
+include_directories(SYSTEM ${CUDA_INCLUDE_DIRS})
+# Some platforms (e.g. Visual Studio) don't add the CUDA include directories to the system include
+# paths by default, so we add it explicitly here.
+
+function(cutlass_correct_source_file_language_property)
+  if(CUDA_COMPILER MATCHES "clang")
+    foreach(File ${ARGN})
+      if(File MATCHES ".*\.cu$")
+        set_source_files_properties(${File} PROPERTIES LANGUAGE CXX)
+      endif()
+    endforeach()
+  endif()
+endfunction()
+
+set(CUTLASS_UNITY_BUILD_ENABLED OFF CACHE BOOL "Enable combined source compilation")
+set(CUTLASS_UNITY_BUILD_BATCH_SIZE 16 CACHE STRING "Batch size for unified source files")
+
+function(cutlass_unify_source_files TARGET_ARGS_VAR)
+
+  set(options)
+  set(oneValueArgs BATCH_SOURCES BATCH_SIZE)
+  set(multiValueArgs)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  if (NOT DEFINED TARGET_ARGS_VAR)
+    message(FATAL_ERROR "TARGET_ARGS_VAR parameter is required")
+  endif()
+
+  if (__BATCH_SOURCES AND NOT DEFINED __BATCH_SIZE)
+    set(__BATCH_SIZE ${CUTLASS_UNITY_BUILD_BATCH_SIZE})
+  endif()
+
+  if (CUTLASS_UNITY_BUILD_ENABLED AND DEFINED __BATCH_SIZE AND __BATCH_SIZE GREATER 1)
+
+    set(CUDA_FILE_ARGS)
+    set(TARGET_SOURCE_ARGS)
+    
+    foreach(ARG ${__UNPARSED_ARGUMENTS})
+      if(${ARG} MATCHES ".*\.cu$")
+        list(APPEND CUDA_FILE_ARGS ${ARG})
+      else()
+        list(APPEND TARGET_SOURCE_ARGS ${ARG})
+      endif()
+    endforeach()
+    
+    list(LENGTH CUDA_FILE_ARGS NUM_CUDA_FILE_ARGS)
+    while(NUM_CUDA_FILE_ARGS GREATER 0)
+      list(SUBLIST CUDA_FILE_ARGS 0 ${__BATCH_SIZE} CUDA_FILE_BATCH)
+      string(SHA256 CUDA_FILE_BATCH_HASH "${CUDA_FILE_BATCH}")
+      string(SUBSTRING ${CUDA_FILE_BATCH_HASH} 0 12 CUDA_FILE_BATCH_HASH)
+      set(BATCH_FILE ${CMAKE_CURRENT_BINARY_DIR}/${NAME}.unity.${CUDA_FILE_BATCH_HASH}.cu)
+      message(STATUS "Generating ${BATCH_FILE}")
+      file(WRITE ${BATCH_FILE} "// Unity File - Auto Generated!\n")
+      foreach(CUDA_FILE ${CUDA_FILE_BATCH})
+        get_filename_component(CUDA_FILE_ABS_PATH ${CUDA_FILE} ABSOLUTE)
+        file(APPEND ${BATCH_FILE} "#include \"${CUDA_FILE_ABS_PATH}\"\n")
+      endforeach()
+      list(APPEND TARGET_SOURCE_ARGS ${BATCH_FILE})
+      if (NUM_CUDA_FILE_ARGS LESS_EQUAL __BATCH_SIZE)
+        break()
+      endif()
+      list(SUBLIST CUDA_FILE_ARGS ${__BATCH_SIZE} -1 CUDA_FILE_ARGS)
+      list(LENGTH CUDA_FILE_ARGS NUM_CUDA_FILE_ARGS)
+    endwhile()
+
+  else()
+
+    set(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
+
+  endif()
+
+  set(${TARGET_ARGS_VAR} ${TARGET_SOURCE_ARGS} PARENT_SCOPE)
+
+endfunction()
+
+function(cutlass_add_library NAME)
+
+  set(options)
+  set(oneValueArgs EXPORT_NAME)
+  set(multiValueArgs)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  cutlass_unify_source_files(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
+
+  if(CUTLASS_NATIVE_CUDA OR CUDA_COMPILER MATCHES "clang")
+    cutlass_correct_source_file_language_property(${TARGET_SOURCE_ARGS})
+    add_library(${NAME} ${TARGET_SOURCE_ARGS})
+  else()
+    set(CUDA_LINK_LIBRARIES_KEYWORD PRIVATE)
+    cuda_add_library(${NAME} ${TARGET_SOURCE_ARGS})
+  endif()
+
+  cutlass_apply_standard_compile_options(${NAME})
+  cutlass_apply_cuda_gencode_flags(${NAME})
+
+  target_compile_features(
+   ${NAME}
+   INTERFACE
+   cxx_std_11
+   )
+
+  if(__EXPORT_NAME)
+    add_library(nvidia::cutlass::${__EXPORT_NAME} ALIAS ${NAME})
+    set_target_properties(${NAME} PROPERTIES EXPORT_NAME ${__EXPORT_NAME})
+  endif()
+
+endfunction()
+
+function(cutlass_add_executable NAME)
+
+  set(options)
+  set(oneValueArgs)
+  set(multiValueArgs)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  cutlass_unify_source_files(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
+
+  if(CUTLASS_NATIVE_CUDA OR CUDA_COMPILER MATCHES "clang")
+    cutlass_correct_source_file_language_property(${TARGET_SOURCE_ARGS})
+    add_executable(${NAME} ${TARGET_SOURCE_ARGS})
+  else()
+    set(CUDA_LINK_LIBRARIES_KEYWORD PRIVATE)
+    cuda_add_executable(${NAME} ${TARGET_SOURCE_ARGS})
+  endif()
+
+  cutlass_apply_standard_compile_options(${NAME})
+  cutlass_apply_cuda_gencode_flags(${NAME})
+
+  target_compile_features(
+   ${NAME}
+   INTERFACE
+   cxx_std_11
+   )
+
+endfunction()
+
+function(cutlass_target_sources NAME)
+
+  set(options)
+  set(oneValueArgs)
+  set(multiValueArgs)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  cutlass_unify_source_files(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
+  cutlass_correct_source_file_language_property(${TARGET_SOURCE_ARGS})
+  target_sources(${NAME} ${TARGET_SOURCE_ARGS})
+
+endfunction()
--- a/33
+++ b/33
@ -32,7 +32,7 @@ DOXYFILE_ENCODING      = UTF-8
 # title of most generated pages and in a few other places.
 # The default value is: My Project.

-PROJECT_NAME           = "Cutlass"
+PROJECT_NAME           = "CUTLASS"

 # The PROJECT_NUMBER tag can be used to enter a project or revision number. This
 # could be handy for archiving the generated documentation or if some version
@ -51,7 +51,7 @@ PROJECT_BRIEF          = "CUDA Templates for Linear Algebra Subroutines and Solv
 # and the maximum width should not exceed 200 pixels. Doxygen will copy the logo
 # to the output directory.

-PROJECT_LOGO           =
+PROJECT_LOGO           = media/images/cutlass-logo-small.png

 # The OUTPUT_DIRECTORY tag is used to specify the (relative or absolute) path
 # into which the generated documentation will be written. If a relative path is
@ -206,7 +206,7 @@ SEPARATE_MEMBER_PAGES  = NO
 # uses this value to replace tabs by spaces in code fragments.
 # Minimum value: 1, maximum value: 16, default value: 4.

-TAB_SIZE               = 4
+TAB_SIZE               = 2

 # This tag can be used to specify a number of aliases that act as commands in
 # the documentation. An alias has the form:
@ -218,7 +218,8 @@ TAB_SIZE               = 4
 # "Side Effects:". You can put \n's in the value part of an alias to insert
 # newlines.

-ALIASES                =
+#ALIASES += "concept{1}=@ingroup \1\n@par Implemented concepts:\n@ref \1"
+ALIASES += "concept{1}=@ingroup \1"

 # This tag can be used to specify a number of word-keyword mappings (TCL only).
 # A mapping has the form "name=value". For example adding "class=itcl::class"
@ -296,7 +297,7 @@ AUTOLINK_SUPPORT       = YES
 # diagrams that involve STL classes more complete and accurate.
 # The default value is: NO.

-BUILTIN_STL_SUPPORT    = NO
+BUILTIN_STL_SUPPORT    = YES

 # If you use Microsoft's C++/CLI language, you should set this option to YES to
 # enable parsing support.
@ -396,7 +397,7 @@ LOOKUP_CACHE_SIZE      = 0
 # normally produced when WARNINGS is set to YES.
 # The default value is: NO.

-EXTRACT_ALL            = NO
+EXTRACT_ALL            = YES

 # If the EXTRACT_PRIVATE tag is set to YES all private members of a class will
 # be included in the documentation.
@ -733,7 +734,9 @@ WARN_LOGFILE           =
 # spaces.
 # Note: If this tag is empty the current directory is searched.

-INPUT                  = cutlass cutlass/gemm cutlass/util
+INPUT                  = include/cutlass tools/util/include/cutlass/ tools/library/include/cutlass/
+
+INPUT += media/docs/doxygen_mainpage.md

 # This tag can be used to specify the character encoding of the source files
 # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
@ -759,7 +762,7 @@ FILE_PATTERNS          =
 # be searched for input files as well.
 # The default value is: NO.

-RECURSIVE              = NO
+RECURSIVE              = YES

 # The EXCLUDE tag can be used to specify files and/or directories that should be
 # excluded from the INPUT source files. This way you can easily exclude a
@ -869,7 +872,7 @@ FILTER_SOURCE_PATTERNS =
 # (index.html). This can be useful if you have a project on for instance GitHub
 # and want to reuse the introduction page also for the doxygen output.

-USE_MDFILE_AS_MAINPAGE =
+USE_MDFILE_AS_MAINPAGE = media/docs/doxygen_mainpage.md

 #---------------------------------------------------------------------------
 # Configuration options related to source browsing
@ -998,7 +1001,7 @@ GENERATE_HTML          = YES
 # The default directory is: html.
 # This tag requires that the tag GENERATE_HTML is set to YES.

-HTML_OUTPUT            = generated-html
+HTML_OUTPUT            = 

 # The HTML_FILE_EXTENSION tag can be used to specify the file extension for each
 # generated HTML page (for example: .htm, .php, .asp).
@ -1079,7 +1082,7 @@ HTML_EXTRA_FILES       =
 # Minimum value: 0, maximum value: 359, default value: 220.
 # This tag requires that the tag GENERATE_HTML is set to YES.

-HTML_COLORSTYLE_HUE    = 82
+HTML_COLORSTYLE_HUE    = 100

 # The HTML_COLORSTYLE_SAT tag controls the purity (or saturation) of the colors
 # in the HTML output. For a value of 0 the output will use grayscales only. A
@ -1087,7 +1090,7 @@ HTML_COLORSTYLE_HUE    = 82
 # Minimum value: 0, maximum value: 255, default value: 100.
 # This tag requires that the tag GENERATE_HTML is set to YES.

-HTML_COLORSTYLE_SAT    = 100
+HTML_COLORSTYLE_SAT    = 50

 # The HTML_COLORSTYLE_GAMMA tag controls the gamma correction applied to the
 # luminance component of the colors in the HTML output. Values below 100
@ -1106,7 +1109,7 @@ HTML_COLORSTYLE_GAMMA  = 80
 # The default value is: YES.
 # This tag requires that the tag GENERATE_HTML is set to YES.

-HTML_TIMESTAMP         = YES
+HTML_TIMESTAMP         = NO

 # If the HTML_DYNAMIC_SECTIONS tag is set to YES then the generated HTML
 # documentation will contain sections that can be hidden and shown after the
@ -2032,7 +2035,7 @@ HIDE_UNDOC_RELATIONS   = YES
 # set to NO
 # The default value is: NO.

-HAVE_DOT               = NO
+HAVE_DOT               = $(HAVE_DOT)

 # The DOT_NUM_THREADS specifies the number of dot invocations doxygen is allowed
 # to run in parallel. When set to 0 doxygen will base this on the number of
@ -2204,7 +2207,7 @@ INTERACTIVE_SVG        = NO
 # found. If left blank, it is assumed the dot tool can be found in the path.
 # This tag requires that the tag HAVE_DOT is set to YES.

-DOT_PATH               =
+DOT_PATH               = $(DOT_PATH)

 # The DOTFILE_DIRS tag can be used to specify one or more directories that
 # contain dot files that are included in the documentation (see the \dotfile
--- a/LICENSE.txt
+++ b/LICENSE.txt
@ -0,0 +1,23 @@
+Copyright (c) 2017 - 2020, NVIDIA CORPORATION.  All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+   *  Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+   *  Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+   *  Neither the name of the NVIDIA CORPORATION nor the
+      names of its contributors may be used to endorse or promote products
+      derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/README.md
+++ b/README.md
@ -1,75 +1,339 @@
-# Introduction
+![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
+
+# CUTLASS 2.2
+
+_CUTLASS 2.2 - June 2020_
+
+CUTLASS is a collection of CUDA C++ template abstractions for implementing
+high-performance matrix-multiplication (GEMM) at all levels and scales within CUDA.
+It incorporates strategies for hierarchical decomposition and data movement similar
+to those used to implement cuBLAS.  CUTLASS decomposes these "moving parts" into
+reusable, modular software components abstracted by C++ template classes.  These
+thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized
+and tuned via custom tiling sizes, data types, and other algorithmic policy. The
+resulting flexibility simplifies their use as building blocks within custom kernels
+and applications.
+
+To support a wide variety of applications, CUTLASS provides extensive support for
+mixed-precision computations, providing specialized data-movement and
+multiply-accumulate abstractions for half-precision floating
+point (FP16), BFloat16 (BF16), Tensor Float 32 (TF32),
+single-precision floating point (FP32), double-precision floating
+point (FP64) types, integer data types (4b and 8b), and binary data types (1b). 
+
+Furthermore, CUTLASS demonstrates warp-synchronous matrix multiply operations 
+targeting the  programmable, high-throughput _Tensor Cores_ implemented by 
+NVIDIA's Volta, Turing, and Ampere architectures.
+
+See the [Quick Start Guide](/media/docs/quickstart.md) to get started quickly.
+
+See the [functionality listing](media/docs/functionality.md) for the list of operations
+supported at each level of the execution model hierarchy.
+
+# What's New in CUTLASS 2.2
+
+CUTLASS 2.2 is a significant update to CUTLASS adding:
+
+- Coverage of [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)
+- Tensor Core-accelerated GEMMs targeting Tensor Float 32, BFloat16, and double-precision data types
+- Deep software pipelines using asynchronous copy
+- Described in [GTC 2020 Webinar (SR 21745)](https://developer.nvidia.com/gtc/2020/video/s21745)
+- Intended to be compiled with [CUDA 11 Toolkit](https://developer.nvidia.com/cuda-toolkit)
+
+# What's New in CUTLASS 2.1
+
+CUTLASS 2.1 is a minor update to CUTLASS 2.0 adding:
+
+- [Planar complex GEMM kernels](/examples/10_planar_complex/planar_complex.cu) targeting Volta and Turing Tensor Cores
+- BLAS-style API to launch kernels compiled into the [CUTLASS Library](/media/docs/quickstart.md#cutlass-library)
+
+# What's New in CUTLASS 2.0
+
+CUTLASS 2.0 is a substantial refactoring from the previous version, intended to offer:
+
+- Better performance over 1.x, particularly for kernels targeting Turing Tensor Cores
+- Robust and durable templates that reliably span the design space
+- Encapsulated functionality that may be reusable in other contexts
+
+**See the [CHANGELOG](CHANGELOG.md) for more details.**
+
+# Performance
+
+<p align="center"><img src=/media/images/cutlass-performance-plot.png></p>
+
+CUTLASS primitives are very efficient.  When used to construct device-wide GEMM kernels,
+they exhibit performance comparable to cuBLAS for scalar GEMM
+computations. The above figure shows CUTLASS performance relative to cuBLAS
+for large matrix dimensions on an NVIDIA GeForce 2080 Ti, an NVIDIA A100, and an NVIDIA TitanV
+using CUDA 11.0 Toolkit. Tensor Core operations are implemented using CUDA's 
+[mma instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma).
+
+# Compatibility
+
+CUTLASS requires a C++11 host compiler and 
+performs best when built with the [CUDA 11.0 Toolkit](https://developer.nvidia.com/cuda-toolkit).
+It is compatible with CUDA 9.2, CUDA 10.0, CUDA 10.1, and CUDA 10.2.
+
+We have tested the following environments.
+
+|**Operating System** | **Compiler** |
+|-----------------|----------|
+| Windows 10      | Microsoft Visual Studio 2015|
+|                 | Microsoft Visual Studio 2017|
+| Ubuntu 16.04 | GCC 5.4.0 |
+| Ubuntu 18.04 | GCC 7.5.0 |
+
+Additionally, CUTLASS may be built with clang. 
+See [these instructions](media/docs/quickstart.md#clang) for more details.
+
+CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on
+any Maxwell-, Pascal-, Volta-, Turing-, or NVIDIA Ampere- architecture NVIDIA GPU.
+
+|**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit**|**CUDA Toolkit Enabling Native Tensor Cores**|
+|---|---|---|---|
+|NVIDIA Tesla P100|6.0|9.2|  |
+|NVIDIA GeForce 1080|6.1|9.2|  |
+|NVIDIA TitanXP|6.1|9.2|  |
+|NVIDIA Tesla V100|7.0|9.2|10.1|
+|NVIDIA TitanV|7.0|9.2|10.1|
+|NVIDIA GeForce RTX 2080 TI, 2080, 2070|7.5|10.0|10.2|
+|NVIDIA Tesla T4|7.5|10.0|10.2|
+|NVIDIA A100|8.0|11.0|11.0|
+
+# Documentation
+
+CUTLASS 2.2 is described in the following documents and the accompanying
+[Doxygen documentation](https://nvidia.github.io/cutlass).
+
+- [Quick Start Guide](/media/docs/quickstart.md) - build and run CUTLASS
+- [Functionality](/media/docs/functionality.md) - summarizes functionality available in CUTLASS
+- [Efficient GEMM in CUDA](media/docs/efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
+- [GEMM API](media/docs/gemm_api.md) - describes the CUTLASS GEMM model and C++ template concepts 
+- [Code Organization](media/docs/code_organization.md) - describes the organization and contents of the CUTLASS project
+- [Terminology](media/docs/terminology.md) - describes terms used in the code
+- [Programming Guidelines](media/docs/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
+- [Fundamental types](media/docs/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
+- [Layouts](media/docs/layout.md) - describes layouts of matrices and tensors in memory
+- [Tile Iterators](media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
+- [CUTLASS Profiler](media/docs/profiler.md) - command-line driven profiling application
+- [CUTLASS Utilities](media/docs/utilities.md) - additional templates used to facilate rapid development
+
+We have also described the structure of an efficient GEMM in our talk at the
+[GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).
+
+# Building CUTLASS
+
+CUTLASS is a header-only template library and does not need to be built to be used by other
+projects. Client applications should target CUTLASS's `include/` directory in their include
+paths.
+
+CUTLASS unit tests, examples, and utilities can be build with CMake starting version 3.12. 
+Make sure the `CUDACXX` environment  variable points to NVCC in the CUDA Toolkit installed
+on your system.
+
+```
+$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
+```
+
+Create a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels
+for CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, and 8.0. To reduce compile time you can specify
+the architectures to build CUTLASS for by changing the CMake configuration setting
+`CUTLASS_NVCC_ARCHS`.
+
+```
+$ mkdir build && cd build
+
+$ cmake .. -DCUTLASS_NVCC_ARCHS=75               # compiles for NVIDIA's Turing GPU architecture
+```
+
+From the `build/` directory, compile and run the CUTLASS unit tests by building the target `test_unit` with make.
+
+The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS,
+and they may be executed in parallel via make's `-j` command line argument.
+
+```
+$ make test_unit -j
+...
+...
+...
+[----------] Global test environment tear-down
+[==========] 946 tests from 57 test cases ran. (10812 ms total)
+[  PASSED  ] 946 tests.
+```
+
+All tests should pass on supported platforms, though the exact number of tests may vary over time.

-CUTLASS is a CUDA C++ template library for implementing matrix-multiply
-procedures that may be instantiated in CUDA device kernels. CUTLASS applies
-object-oriented and generic programming techniques to maximize flexibility of
-the resulting code and facilitate composition with caller-supplied functionality.
-CUDA C++ templates are used to specify policy decisions such as block sizes,
-data types of input and accumulator operands, and element-wise operations applied
-to the results of matrix multiply.

 # Project Structure

-CUTLASS is arranged as a header-only library with several example test programs
-that demonstrate instantiating a GEMM task within a CUDA kernel. Comments inline
-with the source explain the individual components.
+CUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests. 
+[Doxygen documentation](https://nvidia.github.io/cutlass) provides a complete list of files, classes, 
+and template concepts defined in the CUTLASS project.

-The repository is organized in the following arrangement.
+A detailed explanation of the source code organization may be found in the 
+[CUTLASS documentation](media/docs/code_organization.md), but several main components are summarized below.

-    cutlass/                Root of header-only source library for matrix multiply
-      gemm/                 Implementation of GEMM __device__ code and supporting components
-      util/                 Utility components for CUDA device-side CUDA development
+## CUTLASS Template Library

-A test program is provided to illustrate the use of CUTLASS. This is implemented
-in the following directory.
+```
+include/                     # client applications should target this directory in their build's include paths

-    cutlass_test            Root of test programs depicting CUTLASS kernels
-      util/                 Utilities
-      gemm.cu               Simple example calling CUTLASS and CUBLAS GEMM kernels
-      Makefile              Build script for test programs
+  cutlass/                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only

+    arch/                    # direct exposure of architecture features (including instruction-level GEMMs)

-# Makefile usage
+    gemm/                    # code specialized for general matrix product computations

-There are different sample targets for different GEMM data types and
-transposititions.  Be sure to specify your target architecture.
+    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory

-    make <sgemm|dgemm|hgemm|igemm|wgemm> sm=<60|61|70> \
-      [transpose=<nn|nt|tn|tt>] [verbose=<0|1>] [keep=<0|1>]
+    platform/                # CUDA-capable Standard Library components

+    reduction/               # bandwidth-limited reduction kernels that do not fit the "gemm" model
+    
+    transform/               # code specialized for layout, type, and domain transformations

-# Program usage
+    *                        # core vocabulary types, containers, and basic numeric operations
+```

- Program usage:
+### CUTLASS SDK Examples

-     <s|d|h|i|w>gemm_<nn|nt|tn|tt>
-           [--help]
-           [--schmoo || --m=<height> --n=<width> --k=<depth>]
-           [--i=<timing iterations>]
-           [--device=<device-id>]
-           [--alpha=<alpha> --beta=<beta>]
+[CUTLASS SDK examples](/examples) apply CUTLASS templates to implement basic computations.
+
+```
+examples/
+  00_basic_gemm/             # launches a basic GEMM with single precision inputs and outputs
+
+  01_cutlass_utilities/      # demonstrates CUTLASS Utilities for allocating and initializing tensors
+  
+  02_dump_reg_smem/          # debugging utilities for printing register and shared memory contents
+  
+  03_visualize_layout/       # utility for visualizing all layout functions in CUTLASS
+
+  04_tile_iterator/          # example demonstrating an iterator over tiles in memory
+
+  05_batched_gemm/           # example demonstrating CUTLASS's batched strided GEMM operation
+
+  06_splitK_gemm/            # exmaple demonstrating CUTLASS's Split-K parallel reduction kernel
+
+  07_volta_tensorop_gemm/    # example demonstrating mixed precision GEMM using Volta Tensor Cores
+
+  08_turing_tensorop_gemm/   # example demonstrating integer GEMM using Turing Tensor Cores
+
+  10_planar_complex/         # example demonstrating planar complex GEMM kernels
+
+  11_planar_complex_array/   # example demonstrating planar complex kernels with batch-specific problem sizes
+
+  12_gemm_bias_relu/         # example demonstrating GEMM fused with bias and relu
+
+  13_fused_two_gemms/        # example demonstrating two GEMms fused in one kernel
+```
+
+### Tools
+```
+tools/
+  library/                   # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates
+    include/
+      cutlass/
+        library/
+
+  profiler/                  # CUTLASS Profiler         - command-line utility for executing operations in the
+                             #                            CUTLASS Library
+  
+  util/                      # CUTLASS Utilities        - contains numerous helper classes for
+    include/                 #                            manging tensors in device memory, reference
+      cutlass/               #                            implementations for GEMM, random initialization
+        util/                #                            of tensors, and I/O.
+```
+
+### Test
+
+The `test/unit/` directory consist of unit tests implemented with Google Test that demonstrate
+basic usage of Core API components and complete tests of the CUTLASS GEMM computations.
+
+Instructions for building and running the Unit tests are described in the [Quickstart guide](media/docs/quickstart.md).
+
+# Performance Profiling
+
+The `tools/profiler/` directory contains a command-line utility for launching each of the GEMM kernels.
+It can be built as follows:
+
+```
+$ make cutlass_profiler -j
+```
+
+To limit compilation time, only one tile size is instantiated for each data type, math instruction, and layout.
+To instantiate all, set the following environment variable when running CMake from an empty `build/` directory.
+```
+$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all
+...
+$ make cutlass_profiler -j
+```
+
+Example command line for profiling SGEMM kernels is as follows:
+```
+$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
+
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: gemm
+       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+          cuBLAS: Passed
+
+       Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1  \
+                  --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
+                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
+
+           Bytes: 180355072  bytes
+           FLOPs: 115992428544  flops
+
+         Runtime: 6.73655  ms
+          Memory: 24.934 GiB/s
+
+            Math: 17218.4 GFLOP/s
+```
+
+[Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)


 # About

-CUTLASS is released by NVIDIA Corporation as Open Source software under the
-BSD license.
+CUTLASS is released by NVIDIA Corporation as Open Source software under the 
+[3-clause "New" BSD license](LICENSE.txt).

+# Contributors
+
+The official list of CUTLASS developers and contributors is available here: [CONTRIBUTORS](CONTRIBUTORS.md).

 # Copyright

-Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
+Copyright (c) 2017-2020, NVIDIA CORPORATION.  All rights reserved.

-Redistribution and use in source and binary forms, with or without
-modification, are not permitted.
+```
+  Redistribution and use in source and binary forms, with or without modification, are permitted
+  provided that the following conditions are met:
+      * Redistributions of source code must retain the above copyright notice, this list of
+        conditions and the following disclaimer.
+      * Redistributions in binary form must reproduce the above copyright notice, this list of
+        conditions and the following disclaimer in the documentation and/or other materials
+        provided with the distribution.
+      * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
+        to endorse or promote products derived from this software without specific prior written
+        permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
+  IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
+  FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+  OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+  STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```

-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/bin2hex.cmake
+++ b/bin2hex.cmake
@ -0,0 +1,26 @@
+# A small utility function which generates a C-header from an input file
+function(FILE_TO_C_STRING FILENAME VARIABLE_NAME OUTPUT_STRING ZERO_TERMINATED)
+  FILE(READ "${FILENAME}" HEX_INPUT HEX)
+  if (${ZERO_TERMINATED})
+    string(APPEND HEX_INPUT "00")
+  endif()
+
+  string(REGEX REPLACE "(....)" "\\1\n" HEX_OUTPUT ${HEX_INPUT})
+  string(REGEX REPLACE "([0-9a-f][0-9a-f])" "0x\\1," HEX_OUTPUT ${HEX_OUTPUT})
+
+  set(HEX_OUTPUT "static char const ${VARIABLE_NAME}[] = {\n  ${HEX_OUTPUT}\n};\n")
+
+  set(${OUTPUT_STRING} "${HEX_OUTPUT}" PARENT_SCOPE)
+endfunction()
+
+# message("Create header file for ${FILE_IN}")
+# message("Create header file for ${FILE_OUT}")
+file_to_c_string(${FILE_IN} ${VARIABLE_NAME} OUTPUT_STRING ZERO_TERMINATED)
+
+set(RESULT "#pragma once\n")
+string(APPEND RESULT "namespace cutlass {\n")
+string(APPEND RESULT "namespace nvrtc {\n")
+string(APPEND RESULT "${OUTPUT_STRING}")
+string(APPEND RESULT "} // namespace nvrtc\n")
+string(APPEND RESULT "} // namespace cutlass\n")
+file(WRITE "${FILE_OUT}" "${RESULT}")
--- a/cmake/NvidiaCutlassConfig.cmake
+++ b/cmake/NvidiaCutlassConfig.cmake
@ -0,0 +1,7 @@
+get_filename_component(NvidiaCutlass_CMAKE_DIR "${CMAKE_CURRENT_LIST_FILE}" PATH)
+
+include(CMakeFindDependencyMacro)
+
+if(NOT TARGET nvidia::cutlass::CUTLASS)
+    include("${NvidiaCutlass_CMAKE_DIR}/NvidiaCutlassTargets.cmake")
+endif()
--- a/cmake/NvidiaCutlassPackageConfig.cmake
+++ b/cmake/NvidiaCutlassPackageConfig.cmake
@ -0,0 +1,14 @@
+set(CPACK_PACKAGE_NAME NvidiaCutlass)
+set(CPACK_PACKAGE_VENDOR NVIDIA)
+set(CPACK_PACKAGE_CONTACT info@nvidia.com)
+set(CPACK_PACKAGE_DESCRIPTION_SUMMARY "CUTLASS CUDA C++ Template Linear Algebra Library")
+set(CPACK_PACKAGE_INSTALL_DIRECTORY ${CPACK_PACKAGE_NAME})
+set(CPACK_PACKAGE_VERSION_MAJOR ${PROJECT_VERSION_MAJOR})
+set(CPACK_PACKAGE_VERSION_MINOR ${PROJECT_VERSION_MINOR})
+set(CPACK_PACKAGE_VERSION_PATCH ${PROJECT_VERSION_PATCH})
+set(CPACK_VERBATIM_VARIABLES YES)
+# set(CPACK_PACKAGE_DESCRIPTION_FILE ${CMAKE_CURRENT_LIST_DIR}/Description.txt)
+# set(CPACK_RESOURCE_FILE_WELCOME ${CMAKE_CURRENT_LIST_DIR}/Welcome.txt)
+# set(CPACK_RESOURCE_FILE_LICENSE ${CMAKE_CURRENT_LIST_DIR}/License.txt)
+# set(CPACK_RESOURCE_FILE_README ${CMAKE_CURRENT_LIST_DIR}/Readme.txt)
+include(CPack)
--- a/cmake/googletest.cmake
+++ b/cmake/googletest.cmake
@ -0,0 +1,23 @@
+include(FetchContent)
+
+set(GOOGLETEST_DIR "" CACHE STRING "Location of local GoogleTest repo to build against")
+
+if(GOOGLETEST_DIR)
+  set(FETCHCONTENT_SOURCE_DIR_GOOGLETEST ${GOOGLETEST_DIR} CACHE STRING "GoogleTest source directory override")
+endif()
+
+FetchContent_Declare(
+  googletest
+  GIT_REPOSITORY https://github.com/google/googletest.git
+  GIT_TAG        0fe9660
+  )
+
+FetchContent_GetProperties(googletest)
+
+if(NOT googletest_POPULATED)
+  FetchContent_Populate(googletest)
+  if (MSVC)
+    set(gtest_force_shared_crt ON CACHE BOOL "" FORCE)
+  endif()
+  add_subdirectory(${googletest_SOURCE_DIR} ${googletest_BINARY_DIR} EXCLUDE_FROM_ALL)
+endif()
--- a/cmake/nop.cu
+++ b/cmake/nop.cu
@ -0,0 +1,43 @@
+/***************************************************************************************************
+ * Copyright (c) 2017-2020, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without modification, are permitted
+ * provided that the following conditions are met:
+ *     * Redistributions of source code must retain the above copyright notice, this list of
+ *       conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright notice, this list of
+ *       conditions and the following disclaimer in the documentation and/or other materials
+ *       provided with the distribution.
+ *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
+ *       to endorse or promote products derived from this software without specific prior written
+ *       permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
+ * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
+ * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Basic CUDA file for testing compiler flags.
+*/
+
+__device__ int inner()
+{
+  return -1;
+}
+
+__global__ void test()
+{
+  inner();
+}
+
+int main()
+{
+  test<<<1,1>>>();
+  return 0;
+}
--- a/cmake/version.h.in
+++ b/cmake/version.h.in
@ -0,0 +1,38 @@
+#include <cstdint>
+#include <string>
+
+#define CUTLASS_MAJOR @CUTLASS_VERSION_MAJOR@
+#define CUTLASS_MINOR @CUTLASS_VERSION_MINOR@
+#define CUTLASS_PATCH @CUTLASS_VERSION_PATCH@
+#define CUTLASS_BUILD @CUTLASS_VERSION_BUILD@
+#define CUTLASS_VERSION ((CUTLASS_MAJOR)*100 + (CUTLASS_MINOR)*10 + CUTLASS_PATCH)
+
+namespace cutlass {
+
+  inline uint32_t getVersion() {
+    return CUTLASS_VERSION;
+  }
+  inline uint32_t getVersionMajor() {
+    return CUTLASS_MAJOR;
+  }
+  inline uint32_t getVersionMinor() {
+    return CUTLASS_MINOR;
+  }
+  inline uint32_t getVersionPatch() {
+    return CUTLASS_PATCH;
+  }
+  inline uint32_t getVersionBuild() {
+    return CUTLASS_BUILD + 0;
+  }
+  inline std::string getVersionString() {
+    std::string version = "@CUTLASS_VERSION@";
+    if (getVersionBuild()) {
+      version += "." + std::to_string(getVersionBuild());
+    }
+    return version;
+  }
+  inline std::string getGitRevision() {
+    return "@CUTLASS_REVISION@";
+  }
+
+} // namespace cutlass
--- a/common.mk
+++ b/common.mk
@ -1,154 +0,0 @@
-#/******************************************************************************
-# * Copyright (c) 2011, Duane Merrill.  All rights reserved.
-# * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
-# *
-# * Redistribution and use in source and binary forms, with or without
-# * modification, are permitted provided that the following conditions are met:
-# *	 * Redistributions of source code must retain the above copyright
-# *	   notice, this list of conditions and the following disclaimer.
-# *	 * Redistributions in binary form must reproduce the above copyright
-# *	   notice, this list of conditions and the following disclaimer in the
-# *	   documentation and/or other materials provided with the distribution.
-# *	 * Neither the name of the NVIDIA CORPORATION nor the
-# *	   names of its contributors may be used to endorse or promote products
-# *	   derived from this software without specific prior written permission.
-# *
-# * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-# * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-# * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-# * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-# * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-# * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-# * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-# * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-# * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-# *
-#******************************************************************************/
-
-
-#-------------------------------------------------------------------------------
-# Commandline Options
-#-------------------------------------------------------------------------------
-
-# sm=<XX,...> Compute-capability to compile for, e.g., "sm=200,300,350" (SM2.0 by default).
-
-COMMA := ,
-ifdef sm
-	SM_ARCH := $(subst $(COMMA),-,$(sm))
-else
-    $(error Please specify SM architecture makefile argument: "sm=XX")
-endif
-
-ifeq (70, $(findstring 70, $(SM_ARCH)))
-    SM_TARGETS 	+= -gencode=arch=compute_70,code=\"sm_70,compute_70\"
-endif
-ifeq (62, $(findstring 62, $(SM_ARCH)))
-    SM_TARGETS 	+= -gencode=arch=compute_62,code=\"sm_62,compute_62\"
-endif
-ifeq (61, $(findstring 61, $(SM_ARCH)))
-    SM_TARGETS 	+= -gencode=arch=compute_61,code=\"sm_61,compute_61\"
-endif
-ifeq (60, $(findstring 60, $(SM_ARCH)))
-    SM_TARGETS 	+= -gencode=arch=compute_60,code=\"sm_60,compute_60\"
-endif
-ifeq (52, $(findstring 52, $(SM_ARCH)))
-    SM_TARGETS 	+= -gencode=arch=compute_52,code=\"sm_52,compute_52\"
-endif
-ifeq (37, $(findstring 37, $(SM_ARCH)))
-    SM_TARGETS 	+= -gencode=arch=compute_37,code=\"sm_37,compute_37\"
-endif
-ifeq (35, $(findstring 35, $(SM_ARCH)))
-    SM_TARGETS 	+= -gencode=arch=compute_35,code=\"sm_35,compute_35\"
-endif
-ifeq (30, $(findstring 30, $(SM_ARCH)))
-    SM_TARGETS 	+= -gencode=arch=compute_30,code=\"sm_30,compute_30\"
-endif
-ifeq (21, $(findstring 21, $(SM_ARCH)))
-    SM_TARGETS 	+= -gencode=arch=compute_20,code=\"sm_21,compute_20\"
-endif
-ifeq (20, $(findstring 20, $(SM_ARCH)))
-    SM_TARGETS 	+= -gencode=arch=compute_20,code=\"sm_20,compute_20\"
-endif
-
-
-# [verbose=<0|1>] Verbose toolchain output from nvcc option
-ifeq ($(verbose), 1)
-	NVCCFLAGS += -v
-endif
-
-
-# [keep=<0|1>] Keep intermediate compilation artifacts option
-ifeq ($(keep), 1)
-	NVCCFLAGS += -keep
-endif
-
-
-# [debug=<0|1>] Generate debug mode code
-ifeq ($(debug), 1)
-	NVCCFLAGS += -G
-endif
-
-
-#-------------------------------------------------------------------------------
-# Compiler and compilation platform
-#-------------------------------------------------------------------------------
-
-BASE_DIR := $(dir $(lastword $(MAKEFILE_LIST)))
-
-NVCC := "$(shell which nvcc)"
-ifdef nvccver
-    NVCC_VERSION := $(nvccver)
-else
-    NVCC_VERSION := $(strip $(shell nvcc --version | grep release | sed 's/.*release //' |  sed 's/,.*//'))
-endif
-
-# Detect OS
-OSUPPER := $(shell uname -s 2>/dev/null | tr [:lower:] [:upper:])
-
-# Default flags: verbose kernel properties (regs, smem, cmem, etc.); runtimes for compilation phases
-NVCCFLAGS += -O3 -Xptxas -v
-
-ifeq (WIN_NT, $(findstring WIN_NT, $(OSUPPER)))
-    # For MSVC
-
-    # Enable more warnings and treat as errors
-    NVCCFLAGS += -Xcompiler /W3 -Xcompiler /WX
-
-    # Disable excess x86 floating point precision that can lead to results being labeled incorrectly
-    NVCCFLAGS += -Xcompiler /fp:strict
-
-	# Compiler
-	CC := cl
-
-	# Multithreaded runtime
-	NVCCFLAGS += -Xcompiler /MT
-
-	CUDART_CYG := "$(shell dirname $(NVCC))/../lib/x64/cudart.lib"
-	CUDART := "$(shell cygpath -w $(CUDART_CYG))"
-
-else
-    # For g++
-
-    # Disable excess x86 floating point precision that can lead to results being labeled incorrectly
-    #NVCCFLAGS += -Xcompiler -ffloat-store
-
-	# Compiler
-    CC := g++
-
-    CUDART := "$(shell dirname $(NVCC))/../lib64/libcudart_static.a"
-
-endif
-
-
-# Suffix to append to each binary
-BIN_SUFFIX := sm$(SM_ARCH)_$(NVCC_VERSION)
-
-
-#-------------------------------------------------------------------------------
-# Function for computing dependency Lists
-#-------------------------------------------------------------------------------
-
-rwildcard=$(foreach d,$(wildcard $1*),$(call rwildcard,$d/,$2) $(filter $(subst *,%,$2),$d))
-
-
--- a/cuBLAS.cmake
+++ b/cuBLAS.cmake
@ -0,0 +1,125 @@
+
+message(STATUS "Configuring cublas ...")
+
+if((DEFINED CUTLASS_ENABLE_CUBLAS AND NOT CUTLASS_ENABLE_CUBLAS) OR
+   (DEFINED CUBLAS_ENABLED AND NOT CUBLAS_ENABLED))
+  
+  # Don't add cuBLAS if it's defined and false, assume it's not found.
+
+  set(CUBLAS_FOUND OFF)
+  message(STATUS "cuBLAS Disabled.")
+
+elseif(NOT TARGET cublas)
+ 
+  find_path(
+    _CUBLAS_INCLUDE_DIR
+    NAMES cublas.h
+    HINTS
+      ${CUBLAS_INCLUDE_PATH}
+      ENV CUBLAS_INCLUDE_PATH
+      ${CUBLAS_PATH}
+      ENV CUBLAS_PATH
+      ${CUDA_TOOLKIT_ROOT_DIR}
+    PATH_SUFFIXES
+      include
+    )
+
+  find_library(
+    _CUBLAS_LIBRARY
+    NAMES cublas
+    HINTS
+      ${CUBLAS_LIBRARY_PATH}
+      ENV CUBLAS_LIBRARY_PATH
+      ${_CUBLAS_INCLUDE_DIR}/..
+      ${CUBLAS_PATH}
+      ENV CUBLAS_PATH
+      ${CUDA_TOOLKIT_ROOT_DIR}
+    PATH_SUFFIXES
+      lib64
+      lib/x64
+      lib
+    )
+
+  if(_CUBLAS_INCLUDE_DIR AND _CUBLAS_LIBRARY)
+
+    message(STATUS "cuBLAS: ${_CUBLAS_LIBRARY}")
+    message(STATUS "cuBLAS: ${_CUBLAS_INCLUDE_DIR}")
+    
+    set(CUBLAS_FOUND ON CACHE INTERNAL "cublas Library Found")
+    set(CUBLAS_LIBRARY ${_CUBLAS_LIBRARY})
+    set(CUBLAS_INCLUDE_DIR ${_CUBLAS_INCLUDE_DIR})
+
+  else()
+
+    message(STATUS "cublas not found.")
+    set(CUBLAS_FOUND OFF CACHE INTERNAL "cublas Library Found")
+
+  endif()
+
+endif()
+
+set(CUTLASS_ENABLE_CUBLAS ${CUBLAS_FOUND} CACHE BOOL "Enable CUTLASS to build with cuBLAS library.")
+
+if(CUTLASS_ENABLE_CUBLAS AND NOT CUBLAS_FOUND)
+  message(FATAL_ERROR "CUTLASS_ENABLE_CUBLAS enabled but cuBLAS library could not be found.")
+endif()
+
+if(CUTLASS_ENABLE_CUBLAS AND NOT TARGET cublas)
+
+  if(WIN32)
+    add_library(cublas STATIC IMPORTED GLOBAL)
+  else()
+    add_library(cublas SHARED IMPORTED GLOBAL)
+  endif()
+
+  add_library(nvidia::cublas ALIAS cublas)
+
+  set_property(
+    TARGET cublas
+    PROPERTY IMPORTED_LOCATION
+    ${CUBLAS_LIBRARY})
+    
+  target_include_directories(
+    cublas
+    INTERFACE
+    $<INSTALL_INTERFACE:include>
+    $<BUILD_INTERFACE:${CUBLAS_INCLUDE_DIR}>)
+
+  find_library(
+    _CUBLASLT_LIBRARY
+    NAMES cublasLt
+    HINTS
+      ${CUBLAS_LIBRARY_PATH}
+      ENV CUBLAS_LIBRARY_PATH
+      ${_CUBLAS_INCLUDE_DIR}/..
+      ${CUBLAS_PATH}
+      ENV CUBLAS_PATH
+      ${CUDA_TOOLKIT_ROOT_DIR}
+    PATH_SUFFIXES
+      lib64
+      lib/x64
+      lib
+    )
+
+  if(_CUBLASLT_LIBRARY AND NOT TARGET cublasLt)
+
+    if(WIN32)
+      add_library(cublasLt STATIC IMPORTED GLOBAL)
+    else()
+      add_library(cublasLt SHARED IMPORTED GLOBAL)
+    endif()
+    
+    set_property(
+      TARGET cublasLt
+      PROPERTY IMPORTED_LOCATION
+      ${_CUBLASLT_LIBRARY})
+  
+    add_library(nvidia::cublasLt ALIAS cublasLt)
+
+    target_link_libraries(cublas INTERFACE cublasLt)
+
+  endif()
+
+endif()
+
+message(STATUS "Configuring cuBLAS ... done.")
--- a/cutlass/gemm/block_loader.h
+++ b/cutlass/gemm/block_loader.h
@ -1,154 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * block-wide tile-loading abstractions
- */
-
-#include "../util/util.h"
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * load_algorithm
- ******************************************************************************/
-
-/**
- * \brief Enumeration of matrix loading algorithms
- */
-struct load_algorithm
-{
-    /// \brief Enumerants.  See corresponding tag types.
-    enum kind_t
-    {
-        CongruousCopy  = 0,
-        CrosswiseCopy  = 1,
-    };
-
-    /**
-     * \brief Generic tag
-     */
-    template <kind_t Kind>
-    struct any_tag : nv_std::integral_constant<kind_t, Kind> {};
-
-    /**
-     * \brief Copy from a global matrix that is row-major in relation
-     * to the local row-major tile
-     */
-    typedef any_tag<CongruousCopy> contiguous_tag_t;
-
-    /**
-     * \brief Copy from a global matrix that is column-major in relation
-     * to the local row-major tile
-     */
-    typedef any_tag<CrosswiseCopy> crosswise_tag_t;
-
-};
-
-
-/******************************************************************************
- * block_loader
- ******************************************************************************/
-
-/**
- * \brief A three-phase data loading abstraction (prefetch, commit, and
- * advance) for iterating over ranges of block-wide matrix tiles.
- *
- * Each iteration sequence produces a KxL (height-by-width) block-wide tile of
- * value_t in shared memory. The layout of the shared
- * block-wide tile is a row-major (L-major) tiling of dp_vector_t items, which are
- * themselves column-major (K-major) vectors of value_t.  Its dimensions are:
- *    K = BlockDpVectorsK * (sizeof(dp_vector_t) / sizeof(value_t)
- *    L = BlockDpVectorsL
- *
- * NB: This generic class is not directly constructible.  Architecture- and
- * algorithm-specific template specializations will provide the API
- * functionality prescribed here.
- *
- */
-template <
-    int BlockThreads,                       ///< Number of threads in each thread block (blockDim.x)
-    int BlockDpVectorsK,                    ///< Extent of block-wide tile in dp_vector_t along the K-axis (height)
-    int BlockDpVectorsL,                    ///< Extent of block-wide tile in dp_vector_t along the L-axis (width)
-    typename value_t,                       ///< Input matrix value type
-    int LeadingDimAlignBytes,               ///< Byte alignment of input matrix leading dimension
-    bool AllowRaggedTiles,                  ///< Whether the input matrix's dimensions need not be an even-multiple of the block-wide tile dimensions
-    typename dp_vector_t,                   ///< Dot-product vector type along the K-axis
-    load_algorithm::kind_t LoadAlgorithm>   ///< Algorithm for loading a shared tile of KxL matrix data
-struct block_loader
-{
-    //-------------------------------------------------------------------------
-    // Constructor API
-    //-------------------------------------------------------------------------
-
-    /// Constructor
-    block_loader(
-        value_t *d_matrix,              ///< Pointer to input matrix
-        int matrix_values_l,            ///< Extent of the input matrix in value_t along the L-axis
-        int matrix_values_stride_k,     ///< Distance in value_t within pitched-linear memory between successive coordinates along the K-axis
-        int matrix_values_stride_l,     ///< Distance in value_t within pitched-linear memory between successive coordinates along the L-axis
-        int2 block_begin_item_coords,   ///< Thread block's starting value_t coordinates (l, k) within the input matrix
-        int block_end_item_k);          ///< Thread block's ending coordinate (k) within the input matrix (one-past)
-
-    //-------------------------------------------------------------------------
-    // Loader API
-    //-------------------------------------------------------------------------
-
-    /**
-     * Request the current block-wide tile
-     */
-    void request();
-
-
-    /**
-     * Advance the loader to the next block-wide tile in the K-axis
-     */
-    void next();
-
-
-    /**
-     * Commit the previously-requested block-wide tile to shared memory
-     *
-     * NB: To facilitate padding for avoiding shared memory bank conflicts, we
-     * allow the row stride _BlockDpVectorsL to be arbitrarily bigger than the
-     * tile width BlockDpVectorsL.
-     */
-    template <int _BlockDpVectorsL>
-    void commit(
-        dp_vector_t (&scratch_tile)[BlockDpVectorsK][_BlockDpVectorsL]);
-
-};
-
-
-} // namespace gemm
-} // namespace cutlass
-
-
-/******************************************************************************
- * Tail-include specializations that adhere to the block_loader API
- ******************************************************************************/
-
-#include "block_loader_crosswise.h"
-#include "block_loader_congruous_dp1.h"
-#include "block_loader_congruous_idp4.h"
--- a/cutlass/gemm/block_loader_congruous_dp1.h
+++ b/cutlass/gemm/block_loader_congruous_dp1.h
@ -1,398 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Tile-loading abstraction for thread blocks
- */
-
-#include "../util/util.h"
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * block_loader (CongruousCopy + dp1 specialization)
- ******************************************************************************/
-
-/**
- * \brief A three-phase data loading abstraction (prefetch, commit, and
- * advance) for iterating over ranges of block-wide matrix tiles.
- * (CongruousCopy + dp1 specialization)
- *
- * Each iteration sequence produces a KxL (height-by-width) block-wide tile of
- * value_t in shared memory.  The layout of the shared block-wide tile is
- * a row-major (L-major) tiling of singleton "dp1" dp_vector_t items, where
- * dp_vector_t == value_t.  Its dimensions are:
- *    K = BlockDpVectorsK
- *    L = BlockDpVectorsL
- *
- * The data is copied from a corresponding tile of global matrix data whose
- * layout of value_t is also L-major. This constitutes a CongruousCopy
- * between the L-major global tile and the L-major shared tile.
- *
- * NB: Because they are "dp1" singletons, the K-major orientation of
- * dp_vector_t in shared memory is irrelevant, and the L-major global and
- * shared tile layouts are perfectly congruous.  As a result, we can increase
- * the granularity of data transfer via vectorization of loads and stores
- * without any intermediate {dis|re}assembly.
- *
- * NB: Consecutive threads within a block are mapped in L-major
- * fashion across a first-set of LDG-vectors of dp_vector_t (value_t) within
- * their global tile. Successive sets of LDG-vectors are then strip-mined
- * as necessary down the K-axis.  These discontiguous LDG-vectors comprise the
- * thread's "slice" of the block-wide tile.
- */
-template <
-    int BlockThreads,           ///< Number of threads in each thread block (blockDim.x)
-    int BlockDpVectorsK,        ///< Extent of block-wide tile in dp_vector_t along the K-axis (height)
-    int BlockDpVectorsL,        ///< Extent of block-wide tile in dp_vector_t along the L-axis (width)
-    typename value_t,           ///< Input matrix value type
-    int LeadingDimAlignBytes,   ///< Byte alignment of input matrix leading dimension
-    bool AllowRaggedTiles       ///< Whether the input matrix's dimensions need not be an even-multiple of the block-wide tile dimensions
->
-struct block_loader<
-    BlockThreads,
-    BlockDpVectorsK,
-    BlockDpVectorsL,
-    value_t,
-    LeadingDimAlignBytes,
-    AllowRaggedTiles,
-    value_t,                        ///< Dot-product vector type along the K-axis (dp1 specialization)
-    load_algorithm::CongruousCopy>  ///< Algorithm for loading a shared tile of KxL matrix data (CongruousCopy specialization)
-{
-    //-------------------------------------------------------------------------
-    // Constants and types
-    //-------------------------------------------------------------------------
-
-    /// Dot-product vector type along the K-axis
-    typedef value_t dp_vector_t;
-
-    enum
-    {
-        /// Number of value_t in a dp_vector_t
-        DpVectorItems = divide_assert<sizeof(dp_vector_t), sizeof(value_t)>::value,
-
-        /// Number of dp_vector_t in a block-wide tile
-        BlockDpVectors = BlockDpVectorsK * BlockDpVectorsL,
-
-        /// Number of dp_vector_t in a thread-tile
-        ThreadDpVectors = divide_assert<BlockDpVectors, BlockThreads>::value,
-    };
-
-    /// Data movement type, coarsened by LeadingDimAlignBytes, capped by the
-    /// smaller of either ThreadDpVectors or BlockDpVectorsL
-    typedef io_vector<
-            dp_vector_t,
-            __NV_STD_MIN(ThreadDpVectors, BlockDpVectorsL),
-            LeadingDimAlignBytes>
-        ldg_vector_t;
-
-    enum
-    {
-        /// Number of dp_vector_t per ldg_vector_t
-        LdgVectorDpVectors = ldg_vector_t::VectorItems,
-
-        /// Number of value_t per ldg_vector_t
-        LdgVectorItems = LdgVectorDpVectors * DpVectorItems,
-
-
-
-        /// Total number of ldg_vector_t within each block-wide tile
-        BlockLdgVectors = divide_assert<BlockDpVectors, LdgVectorDpVectors>::value,
-
-        /// Extent of the block-wide tile in ldg_vector_t along L-axis
-        BlockLdgVectorsL = divide_assert<BlockDpVectorsL, LdgVectorDpVectors>::value,
-
-        /// Extent of the block-wide tile in ldg_vector_t along K-axis
-        BlockLdgVectorsK = BlockDpVectorsK,
-
-
-
-        /// Number of ldg_vector_t within each thread-tile
-        ThreadLdgVectors = divide_assert<BlockLdgVectors, BlockThreads>::value,
-
-        /// Extent of the thread tile in ldg_vector_t along L-axis
-        ThreadLdgVectorsL = __NV_STD_MAX(1, (BlockLdgVectorsL / BlockThreads)),
-
-        /// Extent of the thread tile in ldg_vector_t along K-axis
-        ThreadLdgVectorsK = divide_assert<ThreadLdgVectors, ThreadLdgVectorsL>::value,
-
-
-
-        /// Number of ldg_vector_t within each stripmine-tile
-        StripmineLdgVectors = BlockThreads,
-
-        /// Extent of the stripmine tile in ldg_vector_t along L-axis
-        StripmineLdgVectorsL = __NV_STD_MIN(BlockLdgVectorsL, StripmineLdgVectors),
-
-        /// Extent of the stripmine tile in ldg_vector_t along K-axis
-        StripmineLdgVectorsK = divide_assert<StripmineLdgVectors, StripmineLdgVectorsL>::value,
-
-
-
-        /// Alignment in dp_vector_t along L needed for committing prefetch
-        AlignmentDpVectorsL = LdgVectorDpVectors,
-    };
-
-    /// Predicate bit vector
-    typedef uint64_t predicate_mask_t;
-
-
-    //-------------------------------------------------------------------------
-    // Assert assumptions
-    //-------------------------------------------------------------------------
-
-    static_assert(
-        (ThreadLdgVectors <= sizeof(predicate_mask_t) * 8),
-        "Predicate mask type does not contain enough bits for encoding load predicates");
-
-
-    //-------------------------------------------------------------------------
-    // Members
-    //-------------------------------------------------------------------------
-
-    /// Input pointer to matrix in ldg_vector_t
-    ldg_vector_t *d_matrix_ldgvecs;
-
-    /// Extent of the input matrix in ldg_vector_t along the L-axis
-    int matrix_ldgvecs_l;
-
-    /// Thread block's ending ldg_vector_t coordinate (k) within the input matrix (one-past)
-    int block_end_ldgvec_k;
-
-    /// Predicate bits for guarding ldg_vector_t loads within "whole-k" block-wide tiles
-    predicate_mask_t guard;
-
-    /// Predicate bits for guarding ldg_vector_t loads within the final block-wide "residue" tile
-    predicate_mask_t residue_guard;
-
-    /// Iteration span in "whole-k" block-wide tiles
-    int wholek_tiles_remaining;
-
-    /// Distance in ldg_vector_t within pitched-linear memory between successive coordinates along the K-axis
-    int matrix_ldgvec_stride_k;
-
-    /// Distance in ldg_vector_t within pitched-linear memory between successive coordinates along the L-axis
-    int matrix_ldgvec_stride_l;
-
-    /// ldg_vector_t coordinates (l, k) of thread-tile within the block-wide tile
-    int2 block_thread_ldgvec_coords;
-
-    /// Thread-wide tile of prefetch data
-    ldg_vector_t thread_tile[ThreadLdgVectorsK][ThreadLdgVectorsL];
-
-
-    //-------------------------------------------------------------------------
-    // Constructor API
-    //-------------------------------------------------------------------------
-
-    /// Constructor
-    inline __device__
-    block_loader(
-        value_t *d_matrix_items,        ///< Input pointer to matrix in value_t
-        int matrix_items_l,             ///< Extent of the input matrix in value_t along the L-axis
-        int matrix_items_stride_k,      ///< Distance in value_t within pitched-linear memory between successive coordinates along the K-axis
-        int matrix_items_stride_l,      ///< Distance in value_t within pitched-linear memory between successive coordinates along the L-axis
-        int2 matrix_block_item_coords,  ///< value_t coordinates (l, k) of first block-wide tile within the input matrix
-        int block_end_item_k)           ///< Thread block's ending coordinate (k) within the input matrix (one-past)
-    :
-        block_end_ldgvec_k(block_end_item_k),
-        guard(0),
-        residue_guard(0)
-    {
-        matrix_ldgvecs_l = matrix_items_l / LdgVectorItems;
-        matrix_ldgvec_stride_k = matrix_items_stride_k / LdgVectorItems,
-        matrix_ldgvec_stride_l = matrix_items_stride_l;
-
-        // ldg_vector_t coordinates (l, k) of thread-tile within the block-wide tile
-        block_thread_ldgvec_coords = make_int2(
-            threadIdx.x % BlockLdgVectorsL,                 // l-coordinate
-            threadIdx.x / BlockLdgVectorsL);                // k-coordinate
-
-        // ldg_vector_t coordinates (l, k) of first block-wide tile within the input matrix
-        int2 matrix_block_ldgvec_coords = make_int2(
-            matrix_block_item_coords.x / LdgVectorItems,     // l-coordinate
-            matrix_block_item_coords.y);                    // k-coordinate
-
-        // Iteration span in ldg_vector_t
-        int span_ldgvec_k = (block_end_item_k - matrix_block_item_coords.y);
-
-
-
-        // ldg_vector_t coordinates (l, k) of first thread-tile tile within the input matrix
-        int2 matrix_thread_ldgvec_coords = make_int2(
-            block_thread_ldgvec_coords.x + matrix_block_ldgvec_coords.x,
-            block_thread_ldgvec_coords.y + matrix_block_ldgvec_coords.y);
-
-        // Iteration range in "whole-k" block-wide tiles
-        wholek_tiles_remaining = span_ldgvec_k / BlockLdgVectorsK;
-
-        // Extent of final residue-tile in ldg_vector_t along K-axis
-        int residue_ldgvecs_k = span_ldgvec_k % BlockLdgVectorsK;
-
-        // Initialize I/O predicates
-        if (AllowRaggedTiles)
-        {
-            // Outer thread-tile ldg_vector_t iteration (K-axis)
-            #pragma unroll
-            for (int thread_ldgvec_k = 0; thread_ldgvec_k < ThreadLdgVectorsK; ++thread_ldgvec_k)
-            {
-                int block_ldgvec_k = block_thread_ldgvec_coords.y + (thread_ldgvec_k * StripmineLdgVectorsK);
-
-                // Whether block_ldgvec_coords.y is valid in the final residue tile
-                predicate_mask_t valid_k = (block_ldgvec_k < residue_ldgvecs_k);
-
-                // Inner thread-tile ldg_vector_t iteration (L-axis)
-                #pragma unroll
-                for (int thread_ldgvec_l = 0; thread_ldgvec_l < ThreadLdgVectorsL; ++thread_ldgvec_l)
-                {
-                    int block_ldgvec_l = block_thread_ldgvec_coords.x + (thread_ldgvec_l * StripmineLdgVectorsL);
-
-                    // Whether block_ldgvec_coords.x is valid any block-wide tile
-                    predicate_mask_t valid_l = (matrix_block_ldgvec_coords.x + block_ldgvec_l < matrix_ldgvecs_l);
-
-                    // Linear index of ldg_vector_t load
-                    int ldgvec_idx = thread_ldgvec_l + (thread_ldgvec_k * ThreadLdgVectorsL);
-
-                    // Set predicate guard bits
-                    guard |= (valid_l << ldgvec_idx);
-                    residue_guard |= ((valid_l & valid_k) << ldgvec_idx);
-                }
-            }
-
-            // Promote residue-guard to primary-guard if no full tiles remain
-            if (!wholek_tiles_remaining)
-            {
-                guard = residue_guard;
-            }
-        }
-
-        // Update the input pointer to be matrix_thread_ldgvec_coords
-        this->d_matrix_ldgvecs =
-            reinterpret_cast<ldg_vector_t*>(d_matrix_items) +
-            (matrix_thread_ldgvec_coords.y * matrix_ldgvec_stride_k) +
-            (matrix_thread_ldgvec_coords.x * matrix_ldgvec_stride_l);
-    }
-
-
-    //-------------------------------------------------------------------------
-    // Loader API
-    //-------------------------------------------------------------------------
-
-    /**
-     * Request the current block-wide tile
-     */
-    inline __device__
-    void request()
-    {
-        // Outer thread-tile ldg_vector_t iteration (K-axis)
-        #pragma unroll
-        for (int thread_ldgvec_k = 0; thread_ldgvec_k < ThreadLdgVectorsK; ++thread_ldgvec_k)
-        {
-            // Inner thread-tile ldg_vector_t iteration (L-axis)
-            #pragma unroll
-            for (int thread_ldgvec_l = 0; thread_ldgvec_l < ThreadLdgVectorsL; ++thread_ldgvec_l)
-            {
-                // Linear index of ldg_vector_t load
-                int ldgvec_idx = (thread_ldgvec_k * ThreadLdgVectorsL) + thread_ldgvec_l;
-
-                // Unpack predicate guard
-                predicate_mask_t valid = ((guard >> ldgvec_idx) & 1);
-
-                if (!AllowRaggedTiles || valid)
-                {
-                    // Perform load
-                    thread_tile[thread_ldgvec_k][thread_ldgvec_l].load(
-                        d_matrix_ldgvecs +
-                        (thread_ldgvec_k * StripmineLdgVectorsK * matrix_ldgvec_stride_k) +
-                        (thread_ldgvec_l * StripmineLdgVectorsL * matrix_ldgvec_stride_l));
-                }
-                else
-                {
-                    // Zero-initialize
-                    #pragma unroll
-                    for (int dpvec = 0; dpvec < LdgVectorDpVectors; ++dpvec)
-                        thread_tile[thread_ldgvec_k][thread_ldgvec_l].buff[dpvec] = 0;
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Advance the loader to the next block-wide tile in the K-axis
-     */
-    inline __device__
-    void next()
-    {
-        d_matrix_ldgvecs += (matrix_ldgvec_stride_k * BlockLdgVectorsK);
-
-        if (AllowRaggedTiles)
-        {
-            --wholek_tiles_remaining;
-
-            // Promote residue-guard to primary-guard if no full tiles remain
-            if (!wholek_tiles_remaining)
-            {
-                guard = residue_guard;
-            }
-        }
-    }
-
-
-    /**
-     * Commit the previously-requested block-wide tile to shared memory
-     *
-     * NB: To facilitate padding for avoiding shared memory bank conflicts, we
-     * allow the row stride SmemDpVectorsL to be arbitrarily bigger than the
-     * tile width BlockDpVectorsL.
-     */
-    template <int SmemDpVectorsL>
-    inline __device__
-    void commit(
-        dp_vector_t (&scratch_tile)[BlockDpVectorsK][SmemDpVectorsL])
-    {
-        static_assert(SmemDpVectorsL >= BlockDpVectorsL, "Row stride must be >= tile width.");
-
-        // Outer thread-tile ldg_vector_t iteration (K-axis)
-        #pragma unroll
-        for (int thread_ldgvec_k = 0; thread_ldgvec_k < ThreadLdgVectorsK; ++thread_ldgvec_k)
-        {
-            int block_ldgvec_k = block_thread_ldgvec_coords.y + (thread_ldgvec_k * StripmineLdgVectorsK);
-
-            // Inner thread-tile ldg_vector_t iteration (L-axis)
-            #pragma unroll
-            for (int thread_ldgvec_l = 0; thread_ldgvec_l < ThreadLdgVectorsL; ++thread_ldgvec_l)
-            {
-                int block_ldgvec_l = block_thread_ldgvec_coords.x + (thread_ldgvec_l * StripmineLdgVectorsL);
-
-                thread_tile[thread_ldgvec_k][thread_ldgvec_l].store(
-                    &scratch_tile[block_ldgvec_k][block_ldgvec_l * LdgVectorDpVectors]);
-            }
-        }
-    }
-};
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/gemm/block_loader_congruous_idp4.h
+++ b/cutlass/gemm/block_loader_congruous_idp4.h
@ -1,536 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Tile-loading abstraction for thread blocks
- */
-
-#include "../util/util.h"
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * block_loader (CongruousCopy + idp4 specialization)
- ******************************************************************************/
-
-/**
- * \brief A three-phase data loading abstraction (prefetch, commit, and
- * advance) for iterating over ranges of block-wide matrix tiles.
- * (CongruousCopy + idp4 specialization)
- *
- * Each iteration sequence produces a KxL (height-by-width) block-wide tile of
- * value_t in shared memory. The layout of the shared block-wide tile is
- * a row-major (L-major) tiling of int32_t dp_vector_t, which are themselves
- * column-major (K-major) vectors of int8_t value_t.  Its dimensions are:
- *    K = BlockDpVectorsK * (sizeof(dp_vector_t) / sizeof(value_t)
- *    L = BlockDpVectorsL
- *
- * The data is copied from a corresponding tile of global matrix data whose
- * layout of value_t is also L-major. This constitutes a CongruousCopy between
- * the L-major global tile and the L-major shared tile.
- *
- * NB: The K-major value_t in shared dp_vector_t are imperfectly congruous
- * with the L-major value_t in global memory. As a result, the granularity
- * of data transfer is a "dp-square" of (DpVectorItems * DpVectorItems) values
- * that must be transposed from L-oriented dp_vector_t to K-oriented
- * dp_vector_t prior to commitment.
- *
- * NB: Consecutive threads within a block are mapped in L-major
- * fashion across a first-set of squares within their global tile. Successive
- * sets of squares are then strip-mined as necessary down the K-axis.  These
- * discontiguous squares comprise the thread's "slice" of the block-wide tile.
- */
-template <
-    int BlockThreads,           ///< Number of threads in each thread block (blockDim.x)
-    int _BlockDpVectorsK,        ///< Extent of block-wide tile in dp_vector_t along the K-axis (height)
-    int _BlockDpVectorsL,        ///< Extent of block-wide tile in dp_vector_t along the L-axis (width)
-    int LeadingDimAlignBytes,   ///< Byte alignment of input matrix leading dimension
-    bool AllowRaggedTiles       ///< Whether the input matrix's dimensions need not be an even-multiple of the block-wide tile dimensions
->
-struct block_loader<
-    BlockThreads,
-    _BlockDpVectorsK,
-    _BlockDpVectorsL,
-    int8_t,                         ///< Input matrix value type (idp4 specialization)
-    LeadingDimAlignBytes,
-    AllowRaggedTiles,
-    int32_t,                        ///< Dot-product vector type along the K-axis (idp4 specialization)
-    load_algorithm::CongruousCopy>  ///< Algorithm for loading a shared tile of KxL matrix data (CrosswiseCopy specialization)
-{
-    //-------------------------------------------------------------------------
-    // Constants and types
-    //-------------------------------------------------------------------------
-
-    /// Input matrix value type
-    typedef int8_t value_t;
-
-    /// Dot-product vector type along the K-axis
-    typedef int32_t dp_vector_t;
-
-    enum
-    {
-        /// Number of value_t in a dp_vector_t
-        DpVectorItems = divide_assert<sizeof(dp_vector_t), sizeof(value_t)>::value,
-
-        /// Number of dp_vector_t in a block-wide tile
-        BlockDpVectors = _BlockDpVectorsK * _BlockDpVectorsL,
-
-        /// Number of dp_vector_t in a thread-tile
-        ThreadDpVectors = divide_assert<BlockDpVectors, BlockThreads>::value,
-
-        /// Number of dp_vector_t in a dp-square
-        SquareDpVectors = DpVectorItems,
-
-        /// Number of dp-square tiles in a thread-tile
-        ThreadSquares = divide_assert<ThreadDpVectors, SquareDpVectors>::value,
-
-        /// Extent of block-wide tile in transposed dp_vector_t along the K-axis (height)
-        BlockTransDpVectorsK = _BlockDpVectorsK * DpVectorItems,
-
-        /// Extent of block-wide tile in transposed dp_vector_t along the L-axis (height)
-        BlockTransDpVectorsL = divide_assert<_BlockDpVectorsL, DpVectorItems>::value,
-
-
-    };
-
-    /// Load-from-global data movement type, coarsened by LeadingDimAlignBytes, capped by the
-    /// smaller of either ThreadSquares or BlockTransDpVectorsL
-    typedef io_vector<
-            dp_vector_t,
-            __NV_STD_MIN(ThreadSquares, BlockTransDpVectorsL),
-            LeadingDimAlignBytes>
-        ldg_vector_t;
-
-    /// Store-to-shared data movement type equivalent to a dp-square
-    typedef io_vector<
-            dp_vector_t,
-            SquareDpVectors>
-        sts_vector_t;
-
-    enum
-    {
-        /// Number of dp_vector_t per ldg_vector_t
-        LdgVectorDpVectors = ldg_vector_t::VectorItems,
-
-        /// Number of value_t per ldg_vector_t
-        LdgVectorItems = LdgVectorDpVectors * DpVectorItems,
-
-
-
-        /// Total number of ldg_vector_t within each block-wide tile
-        BlockLdgVectors = divide_assert<BlockDpVectors, LdgVectorDpVectors>::value,
-
-        /// Extent of the block-wide tile in ldg_vector_t along L-axis
-        BlockLdgVectorsL = divide_assert<BlockTransDpVectorsL, LdgVectorDpVectors>::value,
-
-        /// Extent of the block-wide tile in ldg_vector_t along K-axis
-        BlockLdgVectorsK = BlockTransDpVectorsK,
-
-
-
-        /// Number of ldg_vector_t within each thread-tile
-        ThreadLdgVectors = divide_assert<BlockLdgVectors, BlockThreads>::value,
-
-        /// Extent of the thread tile in ldg_vector_t along L-axis
-        ThreadLdgVectorsL = __NV_STD_MAX(1, (BlockLdgVectorsL / BlockThreads)),
-
-        /// Extent of the thread tile in ldg_vector_t along K-axis
-        ThreadLdgVectorsK = divide_assert<ThreadLdgVectors, ThreadLdgVectorsL>::value,
-
-        /// Extent of the thread tile in dp-square tiles along K-axis
-        ThreadSquaresK = divide_assert<ThreadLdgVectorsK, SquareDpVectors>::value,
-
-
-
-        /// Number of ldg_vector_t within each stripmine-tile
-        StripmineLdgVectors = BlockThreads * SquareDpVectors,
-
-        /// Extent of the stripmine tile in ldg_vector_t along L-axis
-        StripmineLdgVectorsL = __NV_STD_MIN(BlockLdgVectorsL, BlockThreads),
-
-        /// Extent of the stripmine tile in ldg_vector_t along K-axis
-        StripmineLdgVectorsK = divide_assert<StripmineLdgVectors, StripmineLdgVectorsL>::value,
-
-        /// Extent of the stripmine tile in dp-square tiles along K-axis
-        StripmineSquaresK = divide_assert<StripmineLdgVectorsK, SquareDpVectors>::value,
-
-
-
-        /// Alignment in dp_vector_t along L needed for committing prefetch
-        AlignmentDpVectorsL = LdgVectorDpVectors,
-    };
-
-    /// Predicate mask type
-    typedef uint32_t predicate_mask_t;
-
-
-    //-------------------------------------------------------------------------
-    // Assert assumptions
-    //-------------------------------------------------------------------------
-
-    static_assert((LeadingDimAlignBytes >= 4) && (LeadingDimAlignBytes % 4 == 0),
-        "Alignment for matrix operands to IGEMM must be a multiple of 4 bytes.");
-
-    static_assert(
-        (ThreadLdgVectors <= sizeof(predicate_mask_t) * 8),
-        "Predicate mask type does not contain enough bits for encoding load predicates");
-
-
-    //-------------------------------------------------------------------------
-    // Members
-    //-------------------------------------------------------------------------
-
-    /// Input pointer to matrix in ldg_vector_t
-    ldg_vector_t *d_matrix_ldgvecs;
-
-    /// Extent of the input matrix in ldg_vector_t along the L-axis
-    int matrix_ldgvecs_l;
-
-    /// Thread block's ending ldg_vector_t coordinate (k) within the input matrix (one-past)
-    int block_end_ldgvec_k;
-
-    /// Predicate bits for guarding ldg_vector_t loads within "whole-k" block-wide tiles
-    predicate_mask_t guard;
-
-    /// Predicate bits for guarding ldg_vector_t loads within the final block-wide "residue" tile
-    predicate_mask_t residue_guard;
-
-    /// Iteration span in "whole-k" block-wide tiles
-    int wholek_tiles_remaining;
-
-    /// Distance in ldg_vector_t within pitched-linear memory between successive coordinates along the K-axis
-    int matrix_ldgvec_stride_k;
-
-    /// Distance in ldg_vector_t within pitched-linear memory between successive coordinates along the L-axis
-    int matrix_ldgvec_stride_l;
-
-    /// ldg_vector_t coordinates (l, k) of thread-tile within the block-wide tile
-    int2 block_thread_ldgvec_coords;
-
-    /// Thread-wide tile of prefetch data
-    ldg_vector_t thread_tile[ThreadSquaresK][SquareDpVectors][ThreadLdgVectorsL];
-
-
-
-    //-------------------------------------------------------------------------
-    // Utility methods
-    //-------------------------------------------------------------------------
-
-
-    /**
-     * \brief Byte-permute. Pick four arbitrary bytes from two 32-bit registers, and reassemble them into a 32-bit destination register.  For SM2.0 or later.
-     *
-     * \par
-     * The bytes in the two source registers \p a and \p b are numbered from 0 to 7:
-     * {\p b, \p a} = {{b7, b6, b5, b4}, {b3, b2, b1, b0}}. For each of the four bytes
-     * {b3, b2, b1, b0} selected in the return value, a 4-bit selector is defined within
-     * the four lower "nibbles" of \p index: {\p index } = {n7, n6, n5, n4, n3, n2, n1, n0}
-     *
-     * \par Snippet
-     * The code snippet below illustrates byte-permute.
-     * \par
-     * \code
-     * #include <cub/cub.cuh>
-     *
-     * __global__ void ExampleKernel(...)
-     * {
-     *     int a        = 0x03020100;
-     *     int b        = 0x07060504;
-     *     int index    = 0x00007531;
-     *
-     *     int selected = prmt(a, b, index);    // 0x07050301
-     *
-     * \endcode
-     *
-     */
-    inline __device__
-    int32_t prmt(int32_t a, int32_t b, unsigned int index)
-    {
-        int ret;
-        asm volatile("prmt.b32 %0, %1, %2, %3;" : "=r"(ret) : "r"(a), "r"(b), "r"(index));
-        return ret;
-    }
-
-
-    /**
-     * Convert a "dp-square" from L-major to K-major
-     */
-    inline __device__
-    void transpose_dp_square(dp_vector_t (&dp_square)[SquareDpVectors])
-    {
-        // Transpose dp_vector_t squares
-        int32_t y = prmt(dp_square[0], dp_square[1], 0x00007362);
-        int32_t w = prmt(dp_square[2], dp_square[3], 0x00007362);
-        int32_t x = prmt(dp_square[0], dp_square[1], 0x00005140);
-        int32_t z = prmt(dp_square[2], dp_square[3], 0x00005140);
-
-        dp_square[0] = prmt(x, z, 0x00005410);
-        dp_square[1] = prmt(x, z, 0x00007632);
-        dp_square[2] = prmt(y, w, 0x00005410);
-        dp_square[3] = prmt(y, w, 0x00007632);
-    }
-
-    //-------------------------------------------------------------------------
-    // Constructor API
-    //-------------------------------------------------------------------------
-
-    /// Constructor
-    inline __device__
-    block_loader(
-        value_t *d_matrix_items,        ///< Input pointer to matrix in value_t
-        int matrix_items_l,             ///< Extent of the input matrix in value_t along the L-axis
-        int matrix_items_stride_k,      ///< Distance in value_t within pitched-linear memory between successive coordinates along the K-axis
-        int matrix_items_stride_l,      ///< Distance in value_t within pitched-linear memory between successive coordinates along the L-axis
-        int2 matrix_block_item_coords,  ///< value_t coordinates (l, k) of first block-wide tile within the input matrix
-        int block_end_item_k)           ///< Thread block's ending coordinate (k) within the input matrix (one-past)
-    :
-        block_end_ldgvec_k(block_end_item_k),
-        guard(0),
-        residue_guard(0)
-    {
-        matrix_ldgvecs_l = matrix_items_l / LdgVectorItems;
-        matrix_ldgvec_stride_k = matrix_items_stride_k / LdgVectorItems,
-        matrix_ldgvec_stride_l = matrix_items_stride_l;
-
-        // ldg_vector_t coordinates (l, k) of thread-tile within the block-wide tile
-        block_thread_ldgvec_coords = make_int2(
-            threadIdx.x % BlockLdgVectorsL,                      // l-coordinate
-            (threadIdx.x / BlockLdgVectorsL) * SquareDpVectors); // k-coordinate
-
-        // ldg_vector_t coordinates (l, k) of first block-wide tile within the input matrix
-        int2 matrix_block_ldgvec_coords = make_int2(
-            matrix_block_item_coords.x / LdgVectorItems,         // l-coordinate
-            matrix_block_item_coords.y);                        // k-coordinate
-
-        // Iteration span in ldg_vector_t
-        int span_ldgvec_k = (block_end_item_k - matrix_block_item_coords.y);
-
-
-
-        // ldg_vector_t coordinates (l, k) of first thread-tile tile within the input matrix
-        int2 matrix_thread_ldgvec_coords = make_int2(
-            block_thread_ldgvec_coords.x + matrix_block_ldgvec_coords.x,
-            block_thread_ldgvec_coords.y + matrix_block_ldgvec_coords.y);
-
-        // Iteration range in "whole-k" block-wide tiles
-        wholek_tiles_remaining = span_ldgvec_k / BlockLdgVectorsK;
-
-        // Extent of final residue-tile in ldg_vector_t along K-axis
-        int residue_ldgvecs_k = span_ldgvec_k % BlockLdgVectorsK;
-
-        // Initialize I/O predicates
-        if (AllowRaggedTiles)
-        {
-            // Iterate through rows of squares in thread tile
-            #pragma unroll
-            for (int thread_square_k = 0; thread_square_k < ThreadSquaresK; ++thread_square_k)
-            {
-                // Iterate through rows of dp_vector_t in each square
-                #pragma unroll
-                for (int square_dpvec = 0; square_dpvec < SquareDpVectors; ++square_dpvec)
-                {
-                    // ldg_vector_t K-coordinate in block-wide tile (K-axis strip-mining of ldg_vector_t within block-tile)
-                    int block_ldgvec_k =
-                        block_thread_ldgvec_coords.y +
-                        (thread_square_k * StripmineLdgVectorsK) +
-                        square_dpvec;
-
-                    // Whether block_ldgvec_coords.y is valid in the final residue tile
-                    predicate_mask_t valid_k = (block_ldgvec_k < residue_ldgvecs_k);
-
-                    // L-axis strip-mining of block-tile
-                    #pragma unroll
-                    for (int thread_ldgvec_l = 0; thread_ldgvec_l < ThreadLdgVectorsL; ++thread_ldgvec_l)
-                    {
-                        // ldg_vector_t L-coordinate in block-wide tile (L-axis strip-mining of ldg_vector_t within block-tile)
-                        int block_ldgvec_l = block_thread_ldgvec_coords.x + (thread_ldgvec_l * StripmineLdgVectorsL);
-
-                        // Whether block_ldgvec_coords.x is valid any block-wide tile
-                        predicate_mask_t valid_l = (matrix_block_ldgvec_coords.x + block_ldgvec_l < matrix_ldgvecs_l);
-
-                        // Linear index of ldg_vector_t load
-                        int ldgvec_idx =
-                            (thread_square_k * SquareDpVectors * ThreadLdgVectorsL) +
-                            (square_dpvec * ThreadLdgVectorsL) +
-                            thread_ldgvec_l;
-
-                        // Set predicate guard bits
-                        guard |= (valid_l << ldgvec_idx);
-                        residue_guard |= ((valid_l & valid_k) << ldgvec_idx);
-                    }
-                }
-            }
-
-            // Promote residue-guard to primary-guard if no full tiles remain
-            if (!wholek_tiles_remaining)
-            {
-                guard = residue_guard;
-            }
-        }
-
-        // Update the input pointer to be matrix_thread_ldgvec_coords
-        this->d_matrix_ldgvecs =
-            reinterpret_cast<ldg_vector_t*>(d_matrix_items) +
-            (matrix_thread_ldgvec_coords.y * matrix_ldgvec_stride_k) +
-            (matrix_thread_ldgvec_coords.x * matrix_ldgvec_stride_l);
-    }
-
-
-    //-------------------------------------------------------------------------
-    // Loader API
-    //-------------------------------------------------------------------------
-
-    /**
-     * Request the current block-wide tile
-     */
-    inline __device__
-    void request()
-    {
-        // Each thread iterates through the ldg_vector_t in its thread tile
-
-        // Iterate through rows of squares in thread tile
-        #pragma unroll
-        for (int thread_square_k = 0; thread_square_k < ThreadSquaresK; ++thread_square_k)
-        {
-            // Iterate through rows of dp_vector_t in each square
-            #pragma unroll
-            for (int square_dpvec = 0; square_dpvec < SquareDpVectors; ++square_dpvec)
-            {
-                // Iterate through ldg_vector_t in each row
-                #pragma unroll
-                for (int thread_ldgvec_l = 0; thread_ldgvec_l < ThreadLdgVectorsL; ++thread_ldgvec_l)
-                {
-                    // Linear index of ldg_vector_t load
-                    int ldgvec_idx =
-                        (thread_square_k * SquareDpVectors * ThreadLdgVectorsL) +
-                        (square_dpvec * ThreadLdgVectorsL) +
-                        thread_ldgvec_l;
-
-                    // Unpack predicate guard
-                    predicate_mask_t valid = ((guard >> ldgvec_idx) & 1);
-
-                    if (!AllowRaggedTiles || valid)
-                    {
-                        // Perform load
-                        thread_tile[thread_square_k][square_dpvec][thread_ldgvec_l].load(
-                            d_matrix_ldgvecs +
-                            (((thread_square_k * StripmineLdgVectorsK) + square_dpvec) * matrix_ldgvec_stride_k) +
-                            (thread_ldgvec_l * StripmineLdgVectorsL * matrix_ldgvec_stride_l));
-                    }
-                    else
-                    {
-                        // Zero-initialize
-                        #pragma unroll
-                        for (int dpvec = 0; dpvec < LdgVectorDpVectors; ++dpvec)
-                            thread_tile[thread_square_k][square_dpvec][thread_ldgvec_l].buff[dpvec] = 0;
-                    }
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Advance the loader to the next block-wide tile in the K-axis
-     */
-    inline __device__
-    void next()
-    {
-        d_matrix_ldgvecs += (matrix_ldgvec_stride_k * BlockLdgVectorsK);
-
-        if (AllowRaggedTiles)
-        {
-            --wholek_tiles_remaining;
-
-            // Promote residue-guard to primary-guard if no full tiles remain
-            if (!wholek_tiles_remaining)
-            {
-                guard = residue_guard;
-            }
-        }
-    }
-
-
-    /**
-     * Commit the previously-requested block-wide tile to shared memory
-     *
-     * NB: To facilitate padding for avoiding shared memory bank conflicts, we
-     * allow the row stride SmemDpVectorsL to be arbitrarily bigger than the
-     * tile width BlockDpVectorsL.
-     */
-    template <int SmemDpVectorsL>
-    inline __device__
-    void commit(
-        dp_vector_t (&scratch_tile)[_BlockDpVectorsK][SmemDpVectorsL])
-    {
-        static_assert(SmemDpVectorsL >= _BlockDpVectorsL, "Row stride must be >= tile width.");
-
-        // Square K-coordinate of thread tile in block-wide tile
-        int block_thread_square_k = block_thread_ldgvec_coords.y / SquareDpVectors;
-
-        // Iterate through rows of squares in thread tile
-        #pragma unroll
-        for (int thread_square_k = 0; thread_square_k < ThreadSquaresK; ++thread_square_k)
-        {
-            // Square K-coordinate in block-wide tile (K-axis strip-mining of squares within block-tile)
-            int block_square_k = block_thread_square_k + (thread_square_k * StripmineSquaresK);
-
-            // Iterate through ldg_vector_t in each row
-            #pragma unroll
-            for (int thread_ldgvec_l = 0; thread_ldgvec_l < ThreadLdgVectorsL; ++thread_ldgvec_l)
-            {
-                // ldg_vector_t L-coordinate in block-wide tile (L-axis strip-mining of ldg_vector_t within block-tile)
-                int block_ldgvec_l = block_thread_ldgvec_coords.x + (thread_ldgvec_l * StripmineLdgVectorsL);
-
-                // Iterate through squares in each ldg_vector_t
-                #pragma unroll
-                for (int ldgvec_dpvec_l = 0; ldgvec_dpvec_l < LdgVectorDpVectors; ++ldgvec_dpvec_l)
-                {
-                    // Square L-coordinate in block-wide tile (L-axis raking of square-slices within ldg_vector_t)
-                    int block_square_l = (block_ldgvec_l * LdgVectorDpVectors) + ldgvec_dpvec_l;
-
-                    // Assemble square of L-major dp_vector_t from stack of slices
-                    sts_vector_t square;
-
-                    // Iterate through rows of dp_vector_t in each square
-                    #pragma unroll
-                    for (int square_dpvec = 0; square_dpvec < SquareDpVectors; ++square_dpvec)
-                    {
-                        square.buff[square_dpvec] = thread_tile[thread_square_k][square_dpvec][thread_ldgvec_l].buff[ldgvec_dpvec_l];
-                    }
-
-                    // Un-transpose square from L-major to K-major
-                    transpose_dp_square(square.buff);
-
-                    // Store dp-square
-                    square.store(&scratch_tile[block_square_k][block_square_l * SquareDpVectors]);
-                }
-            }
-        }
-    }
-};
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/gemm/block_loader_crosswise.h
+++ b/cutlass/gemm/block_loader_crosswise.h
@ -1,403 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Tile-loading abstraction for thread blocks
- */
-
-#include "../util/util.h"
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * block_loader (CrosswiseCopy specialization)
- ******************************************************************************/
-
-/**
- * \brief A three-phase data loading abstraction (prefetch, commit, and
- * advance) for iterating over ranges of block-wide matrix tiles.
- * (CrosswiseCopy specialization)
- *
- * Each iteration sequence produces a KxL (height-by-width) block-wide tile of
- * value_t in shared memory.  The layout of the shared block-wide tile is
- * a row-major (L-major) tiling of dp_vector_t items, which are themselves
- * column-major (K-major) vectors of value_t.  Its dimensions are:
- *    K = BlockDpVectorsK * (sizeof(dp_vector_t) / sizeof(value_t)
- *    L = BlockDpVectorsL
- *
- * The data is copied from a corresponding tile of global matrix data whose
- * layout of value_t is K-major.  This constitutes a CrosswiseCopy between
- * the K-major global tile and the L-major shared tile.
- *
- * NB: The orientation of dp_vector_t components in shared memory is congruous
- * with the global matrix data, so we can use dp_vector_t as the minimum
- * granularity of data transfer without any intermediate {dis|re}assembly
- * of its value_t components.  However, the global and shared memory layouts
- * of dp_vector_t items are cross-wise with respect to each other, so any
- * further LDG-vectorization of dp_vector_t data requires intermediate
- * disassembly into dp_vector_t components to be stored individually into
- * the shared tile.
- *
- * NB: Consecutive threads within a block are mapped in K-major
- * fashion down a first set of LDG-vectors of dp_vector_t within their global
- * tile. Successive sets of LDG-vectors are then strip-mined as necessary
- * across the L-axis.  These discontiguous LDG-vectors comprise the thread's
- * "slice" of the block-wide tile.
- */
-template <
-    int BlockThreads,           ///< Number of threads in each thread block (blockDim.x)
-    int BlockDpVectorsK,        ///< Extent of block-wide tile in dp_vector_t along the K-axis (height)
-    int BlockDpVectorsL,        ///< Extent of block-wide tile in dp_vector_t along the L-axis (width)
-    typename value_t,           ///< Input matrix value type
-    int LeadingDimAlignBytes,   ///< Byte alignment of input matrix leading dimension
-    bool AllowRaggedTiles,      ///< Whether the input matrix's dimensions need not be an even-multiple of the block-wide tile dimensions
-    typename dp_vector_t>       ///< Dot-product vector type along the K-axis
-struct block_loader<
-    BlockThreads,
-    BlockDpVectorsK,
-    BlockDpVectorsL,
-    value_t,
-    LeadingDimAlignBytes,
-    AllowRaggedTiles,
-    dp_vector_t,
-    load_algorithm::CrosswiseCopy>  ///< Algorithm for loading a shared tile of KxL matrix data (CrosswiseCopy specialization)
-{
-    //-------------------------------------------------------------------------
-    // Constants and types
-    //-------------------------------------------------------------------------
-
-    enum
-    {
-        /// Number of value_t in a dp_vector_t
-        DpVectorItems = divide_assert<sizeof(dp_vector_t), sizeof(value_t)>::value,
-
-        /// Number of dp_vector_t in a block-wide tile
-        BlockDpVectors = BlockDpVectorsK * BlockDpVectorsL,
-
-        /// Number of dp_vector_t in a thread-tile
-        ThreadDpVectors = divide_assert<BlockDpVectors, BlockThreads>::value,
-    };
-
-    /// Data movement type, coarsened by LeadingDimAlignBytes, capped by the
-    /// smaller of either ThreadDpVectors or BlockDpVectorsK
-    typedef io_vector<
-            dp_vector_t,
-            __NV_STD_MIN(ThreadDpVectors, BlockDpVectorsK),
-            LeadingDimAlignBytes>
-        ldg_vector_t;
-
-    enum
-    {
-        /// Number of dp_vector_t per ldg_vector_t
-        LdgVectorDpVectors = ldg_vector_t::VectorItems,
-
-        /// Number of value_t per ldg_vector_t
-        LdgVectorItems = LdgVectorDpVectors * DpVectorItems,
-
-
-
-        /// Total number of ldg_vector_t within each block-wide tile
-        BlockLdgVectors = divide_assert<BlockDpVectors, LdgVectorDpVectors>::value,
-
-        /// Extent of the block-wide tile in ldg_vector_t along K-axis
-        BlockLdgVectorsK = divide_assert<BlockDpVectorsK, LdgVectorDpVectors>::value,
-
-        /// Extent of the block-wide tile in ldg_vector_t along L-axis
-        BlockLdgVectorsL = BlockDpVectorsL,
-
-
-
-        /// Number of ldg_vector_t within each thread-tile
-        ThreadLdgVectors = divide_assert<BlockLdgVectors, BlockThreads>::value,
-
-        /// Extent of the thread tile in ldg_vector_t along K-axis
-        ThreadLdgVectorsK = __NV_STD_MAX(1, (BlockLdgVectorsK / BlockThreads)),
-
-        /// Extent of the thread tile in ldg_vector_t along L-axis
-        ThreadLdgVectorsL = divide_assert<ThreadLdgVectors, ThreadLdgVectorsK>::value,
-
-
-
-        /// Number of ldg_vector_t within each stripmine-tile
-        StripmineLdgVectors = BlockThreads,
-
-        /// Extent of the stripmine tile in ldg_vector_t along K-axis
-        StripmineLdgVectorsK = __NV_STD_MIN(BlockLdgVectorsK, StripmineLdgVectors),
-
-        /// Extent of the stripmine tile in ldg_vector_t along L-axis
-        StripmineLdgVectorsL = divide_assert<StripmineLdgVectors, StripmineLdgVectorsK>::value,
-
-
-
-        /// Alignment in dp_vector_t along L needed for committing prefetch
-        AlignmentDpVectorsL = 1,
-    };
-
-    /// Predicate bit vector
-    typedef uint64_t predicate_mask_t;
-
-
-    //-------------------------------------------------------------------------
-    // Assert assumptions
-    //-------------------------------------------------------------------------
-
-    static_assert(
-        (ThreadLdgVectors <= sizeof(predicate_mask_t) * 8),
-        "Predicate mask type does not contain enough bits for encoding load predicates");
-
-
-    //-------------------------------------------------------------------------
-    // Members
-    //-------------------------------------------------------------------------
-
-    /// Input pointer to matrix in ldg_vector_t
-    ldg_vector_t *d_matrix_ldgvecs;
-
-    /// Extent of the input matrix in ldg_vector_t along the L-axis
-    int matrix_ldgvecs_l;
-
-    /// Thread block's ending ldg_vector_t coordinate (k) within the input matrix (one-past)
-    int block_end_ldgvec_k;
-
-    /// Predicate bits for guarding ldg_vector_t loads within "whole-k" block-wide tiles
-    predicate_mask_t guard;
-
-    /// Predicate bits for guarding ldg_vector_t loads within the final block-wide "residue" tile
-    predicate_mask_t residue_guard;
-
-    /// Iteration span in "whole-k" block-wide tiles
-    int wholek_tiles_remaining;
-
-    /// Distance in ldg_vector_t within pitched-linear memory between successive coordinates along the K-axis
-    int matrix_ldgvec_stride_k;
-
-    /// Distance in ldg_vector_t within pitched-linear memory between successive coordinates along the L-axis
-    int matrix_ldgvec_stride_l;
-
-    /// ldg_vector_t coordinates (l, k) of thread-tile within the block-wide tile
-    int2 block_thread_ldgvec_coords;
-
-    /// Thread-wide tile of prefetch data
-    ldg_vector_t thread_tile[ThreadLdgVectorsK][ThreadLdgVectorsL];
-
-
-    //-------------------------------------------------------------------------
-    // Constructor API
-    //-------------------------------------------------------------------------
-
-    /// Constructor
-    inline __device__
-    block_loader(
-        value_t *d_matrix_items,        ///< Input pointer to matrix in value_t
-        int matrix_items_l,             ///< Extent of the input matrix in value_t along the L-axis
-        int matrix_items_stride_k,      ///< Distance in value_t within pitched-linear memory between successive coordinates along the K-axis
-        int matrix_items_stride_l,      ///< Distance in value_t within pitched-linear memory between successive coordinates along the L-axis
-        int2 matrix_block_item_coords,  ///< value_t coordinates (l, k) of first block-wide tile within the input matrix
-        int block_end_item_k)           ///< Thread block's ending coordinate (k) within the input matrix (one-past)
-    :
-        block_end_ldgvec_k(block_end_item_k),
-        guard(0),
-        residue_guard(0)
-    {
-        matrix_ldgvecs_l = matrix_items_l;
-        matrix_ldgvec_stride_k = matrix_items_stride_k;
-        matrix_ldgvec_stride_l = (matrix_items_stride_l / LdgVectorItems);
-
-        // ldg_vector_t coordinates (l, k) of thread-tile within the block-wide tile
-        block_thread_ldgvec_coords = make_int2(
-            (threadIdx.x / BlockLdgVectorsK),                // l-coordinate
-            (threadIdx.x % BlockLdgVectorsK));               // k-coordinate
-
-        // ldg_vector_t coordinates (l, k) of first block-wide tile within the input matrix
-        int2 matrix_block_ldgvec_coords = make_int2(
-            matrix_block_item_coords.x,                     // l-coordinate
-            matrix_block_item_coords.y / LdgVectorItems);    // k-coordinate
-
-        // Iteration span in ldg_vector_t
-        int span_ldgvec_k = (block_end_item_k - matrix_block_item_coords.y) / LdgVectorItems;
-
-
-
-        // ldg_vector_t coordinates (l, k) of first thread-tile tile within the input matrix
-        int2 matrix_thread_ldgvec_coords = make_int2(
-            block_thread_ldgvec_coords.x + matrix_block_ldgvec_coords.x,
-            block_thread_ldgvec_coords.y + matrix_block_ldgvec_coords.y);
-
-        // Iteration range in "whole-k" block-wide tiles
-        wholek_tiles_remaining = span_ldgvec_k / BlockLdgVectorsK;
-
-        // Extent of final residue-tile in ldg_vector_t along K-axis
-        int residue_ldgvecs_k = span_ldgvec_k % BlockLdgVectorsK;
-
-        // Initialize I/O predicates
-        if (AllowRaggedTiles)
-        {
-            // Outer thread-tile ldg_vector_t iteration (K-axis)
-            #pragma unroll
-            for (int thread_ldgvec_k = 0; thread_ldgvec_k < ThreadLdgVectorsK; ++thread_ldgvec_k)
-            {
-                int block_ldgvec_k = block_thread_ldgvec_coords.y + (thread_ldgvec_k * StripmineLdgVectorsK);
-
-                // Whether block_ldgvec_coords.y is valid in the final residue tile
-                predicate_mask_t valid_k = (block_ldgvec_k < residue_ldgvecs_k);
-
-                // Inner thread-tile ldg_vector_t iteration (L-axis)
-                #pragma unroll
-                for (int thread_ldgvec_l = 0; thread_ldgvec_l < ThreadLdgVectorsL; ++thread_ldgvec_l)
-                {
-                    int block_ldgvec_l = block_thread_ldgvec_coords.x + (thread_ldgvec_l * StripmineLdgVectorsL);
-
-                    // Whether block_ldgvec_coords.x is valid any block-wide tile
-                    predicate_mask_t valid_l = (matrix_block_ldgvec_coords.x + block_ldgvec_l < matrix_ldgvecs_l);
-
-                    // Linear index of ldg_vector_t load
-                    int ldgvec_idx = thread_ldgvec_l + (thread_ldgvec_k * ThreadLdgVectorsL);
-
-                    // Set predicate guard bits
-                    guard |= (valid_l << ldgvec_idx);
-                    residue_guard |= ((valid_l & valid_k) << ldgvec_idx);
-                }
-            }
-
-            // Promote residue-guard to primary-guard if no full tiles remain
-            if (!wholek_tiles_remaining)
-            {
-                guard = residue_guard;
-            }
-        }
-
-        // Update the input pointer to be matrix_thread_ldgvec_coords
-        this->d_matrix_ldgvecs =
-            reinterpret_cast<ldg_vector_t*>(d_matrix_items) +
-            (matrix_thread_ldgvec_coords.y * matrix_ldgvec_stride_k) +
-            (matrix_thread_ldgvec_coords.x * matrix_ldgvec_stride_l);
-    }
-
-
-    //-------------------------------------------------------------------------
-    // Loader API
-    //-------------------------------------------------------------------------
-
-    /**
-     * Request the current block-wide tile
-     */
-    inline __device__
-    void request()
-    {
-        // Outer thread-tile ldg_vector_t iteration (K-axis)
-        #pragma unroll
-        for (int thread_ldgvec_k = 0; thread_ldgvec_k < ThreadLdgVectorsK; ++thread_ldgvec_k)
-        {
-            // Inner thread-tile ldg_vector_t iteration (L-axis)
-            #pragma unroll
-            for (int thread_ldgvec_l = 0; thread_ldgvec_l < ThreadLdgVectorsL; ++thread_ldgvec_l)
-            {
-                // Linear index of ldg_vector_t load
-                int ldgvec_idx = (thread_ldgvec_k * ThreadLdgVectorsL) + thread_ldgvec_l;
-
-                // Unpack predicate guard
-                predicate_mask_t valid = ((guard >> ldgvec_idx) & 1);
-
-                if (!AllowRaggedTiles || valid)
-                {
-                    // Perform load
-                    thread_tile[thread_ldgvec_k][thread_ldgvec_l].load(
-                        d_matrix_ldgvecs +
-                        (thread_ldgvec_k * StripmineLdgVectorsK * matrix_ldgvec_stride_k) +
-                        (thread_ldgvec_l * StripmineLdgVectorsL * matrix_ldgvec_stride_l));
-                }
-                else
-                {
-                    // Zero-initialize
-                    #pragma unroll
-                    for (int dpvec = 0; dpvec < LdgVectorDpVectors; ++dpvec)
-                        thread_tile[thread_ldgvec_k][thread_ldgvec_l].buff[dpvec] = 0;
-                }
-            }
-        }
-    }
-
-
-    /**
-     * Advance the loader to the next block-wide tile in the K-axis
-     */
-    inline __device__
-    void next()
-    {
-        d_matrix_ldgvecs += (matrix_ldgvec_stride_k * BlockLdgVectorsK);
-
-        if (AllowRaggedTiles)
-        {
-            --wholek_tiles_remaining;
-
-            // Promote residue-guard to primary-guard if no full tiles remain
-            if (!wholek_tiles_remaining)
-            {
-                guard = residue_guard;
-            }
-        }
-    }
-
-
-    /**
-     * Commit the previously-requested block-wide tile to shared memory
-     *
-     * NB: To facilitate padding for avoiding shared memory bank conflicts, we
-     * allow the row stride SmemDpVectorsL to be arbitrarily bigger than the
-     * tile width BlockDpVectorsL.
-     */
-    template <int SmemDpVectorsL>
-    inline __device__
-    void commit(
-        dp_vector_t (&scratch_tile)[BlockDpVectorsK][SmemDpVectorsL])
-    {
-        static_assert(SmemDpVectorsL >= BlockDpVectorsL, "Row stride must be >= tile width.");
-
-        // Outer thread-tile ldg_vector_t iteration (K-axis)
-        #pragma unroll
-        for (int thread_ldgvec_k = 0; thread_ldgvec_k < ThreadLdgVectorsK; ++thread_ldgvec_k)
-        {
-            int block_ldgvec_k = block_thread_ldgvec_coords.y + (thread_ldgvec_k * StripmineLdgVectorsK);
-
-            // Inner thread-tile ldg_vector_t iteration (L-axis)
-            #pragma unroll
-            for (int thread_ldgvec_l = 0; thread_ldgvec_l < ThreadLdgVectorsL; ++thread_ldgvec_l)
-            {
-                int block_ldgvec_l = block_thread_ldgvec_coords.x + (thread_ldgvec_l * StripmineLdgVectorsL);
-
-                // Write column of dp_vector_t
-                #pragma unroll
-                for (int dpvec = 0; dpvec < LdgVectorDpVectors; ++dpvec)
-                {
-                    scratch_tile[(block_ldgvec_k * LdgVectorDpVectors) + dpvec][block_ldgvec_l] =
-                        thread_tile[thread_ldgvec_k][thread_ldgvec_l].buff[dpvec];
-                }
-            }
-        }
-    }
-};
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/gemm/block_loader_wmma.h
+++ b/cutlass/gemm/block_loader_wmma.h
@ -1,314 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Tile-loading abstraction for thread blocks
- */
-
-#include "../util/util.h"
-
-namespace cutlass {
-namespace gemm {
-
-
-/**
- * block-wide tile loader supporting congruous mapping of data from source and
- * destination addressable storage. Typically, this will be used to load a
- * block-wide tile from global memory into shared memory.
- *
- * This enables the caller to specify MatrixAlignBytes guarantees of the input pointer
- * and performs memory operations on vectors. This increases the efficiency of
- * memory operations and reduces the number of guard predicates needed.
- *
- */
-template <
-    bool congruous,         ///< Indicates whether the "GEMM K" dimension refers to strided matrix dimension
-    int BlockThreads,       ///< Number of threads participating in the streaming operation
-    int BlockItemsL,       ///< Extent of block-wide tile in value_t along the L-axis (width)
-    int BlockItemsK,        ///< Extent of block-wide tile in value_t along the K-axis (height)
-    typename value_t,       ///< Input matrix value type
-    int MatrixAlignBytes,   ///< Byte alignment of input matrix
-    bool AllowRaggedTiles   ///< Whether the input matrix's dimensions need not be an even-multiple of the block-wide tile dimensions
->
-struct block_loader_wmma
-{
-    //-------------------------------------------------------------------------
-    // Constants and types
-    //-------------------------------------------------------------------------
-
-    /// Predicate bit vector
-    typedef uint64_t predicate_mask_t;
-
-    /// Data movement type, coarsened by MatrixAlignBytes
-    typedef io_vector<
-            value_t,
-            divide_assert<MatrixAlignBytes, sizeof(value_t)>::value,
-            MatrixAlignBytes>
-        ldg_vector_t;
-
-    enum
-    {
-        /// Number of items per ldg_vector_t
-        LdgVectorItems = ldg_vector_t::VectorItems,
-
-        /// Total number of ldg_vector_t within the block-wide tile
-        BlockLdgVectors = divide_assert<(BlockItemsL * BlockItemsK), LdgVectorItems>::value,
-
-        /// Extent of the block-wide tile in ldg_vector_t along K-axis
-        BlockLdgVectorsK = BlockItemsK,
-
-        /// Extent of the block-wide tile in ldg_vector_t along L-axis
-        BlockLdgVectorsL = divide_assert<BlockItemsL, LdgVectorItems>::value,
-
-        /// Number of ldg_vector_t within each thread tile
-        ThreadLdgVectors = divide_assert<BlockLdgVectors, BlockThreads>::value,
-
-        /// Extent of the thread tile in ldg_vector_t along the L-axis
-        ThreadLdgVectorsL = __NV_STD_MAX(1, BlockLdgVectorsL / BlockThreads),
-
-        /// Block-wide strip-mining distance between ldg_vector_t along the K-axis
-        BlockLdgVectorStrideK = __NV_STD_MAX(1, BlockThreads / BlockLdgVectorsL),
-
-        /// Extent of the thread tile in ldg_vector_t along the K-axis
-        ThreadLdgVectorsK = divide_assert<BlockLdgVectorsK, BlockLdgVectorStrideK>::value,
-    };
-
-
-    //-------------------------------------------------------------------------
-    // Assert assumptions
-    //-------------------------------------------------------------------------
-
-    /// Define assertions
-    static_assert(ThreadLdgVectorsL * ThreadLdgVectorsK == ThreadLdgVectors,
-        "Number of vectors must be fully covered by the thread's 2D vector tile.");
-
-    /// Predicate masks must be large enough to guard every vector load
-    static_assert(sizeof(predicate_mask_t) * 8 >= ThreadLdgVectorsL * ThreadLdgVectorsK,
-        "Predicate bit vector must be large enough to guard every vector load.");
-
-    //-------------------------------------------------------------------------
-    // Members
-    //-------------------------------------------------------------------------
-
-    /// pointer to tile in global memory
-    const ldg_vector_t *ptr;
-
-    /// stride of the matrix in the K-axis
-    int matrix_values_stride_k;
-
-    /// Guard predicate
-    predicate_mask_t guard;
-
-    /// Guard for the last request iteration
-    predicate_mask_t residue_guard;
-
-    /// Number of 'whole' request iterations before encountering the residue
-    int request_iterations;
-
-    /// fetch registers
-    ldg_vector_t fetch[ThreadLdgVectors];
-
-    /// Thread's base offset from the start of a block-wide tile
-    int thread_offset_l;
-
-    /// Thread's basae offset from the start of a block-wide tile
-    int thread_offset_k;
-
-
-    //-------------------------------------------------------------------------
-    // Constructor API
-    //-------------------------------------------------------------------------
-
-    /// Constructor
-    inline __device__
-    block_loader_wmma(
-        const value_t *d_matrix,        ///< Pointer to input matrix
-        int matrix_values_l,            ///< Extent of the input matrix in value_t along the L-axis
-        int start_l,                    ///< Starting location in tile
-        int dim_k,                      ///< Inner dimension of tile, used for computing guard predicates
-        int _matrix_values_stride_k,    ///< Stride of K-axis of atrix
-        int start_k,                    ///< Tile's starting location
-        int2 block_begin_item_coords)  ///< Thread block's starting value_t coordinates (l, k) within the input matrix
-    :
-        ptr(reinterpret_cast<const ldg_vector_t *>(d_matrix)),
-        matrix_values_stride_k(_matrix_values_stride_k / LdgVectorItems),
-        guard(0),
-        residue_guard(0)
-    {
-        // Compute block's starting coordinates in units of vectors
-        int block_base_l = block_begin_item_coords.x / LdgVectorItems;
-        int block_base_k = block_begin_item_coords.y;
-
-        // Compute a thread tiling of the block-wide tile
-        int tid = threadIdx.x;
-        thread_offset_l = tid % BlockLdgVectorsL;
-        thread_offset_k = tid / BlockLdgVectorsL;
-
-        // Add the block and thread offsets to the source pointer
-        ptr += (block_base_l + thread_offset_l) +
-            (block_base_k + thread_offset_k) * matrix_values_stride_k;
-
-        // When AllowRaggedTiles support is enabled, compute a bit vector of guard
-        // predicates
-        if (AllowRaggedTiles)
-        {
-            if (congruous)
-            {
-                request_iterations = (dim_k - start_k) / BlockItemsK;
-            }
-            else
-            {
-                request_iterations = (matrix_values_l - start_l) / BlockItemsL;
-            }
-
-            #pragma unroll
-            for (int k_idx = 0; k_idx < ThreadLdgVectorsK; ++k_idx)
-            {
-                #pragma unroll
-                for (int l_idx = 0; l_idx < ThreadLdgVectorsL; ++l_idx)
-                {
-                    int item = l_idx + k_idx * ThreadLdgVectorsL;
-
-                    // Global vector L and K indices
-                    int vec_l = l_idx * BlockThreads;
-                    int vec_k = k_idx * BlockLdgVectorStrideK;
-
-                    predicate_mask_t pred;
-                    predicate_mask_t residue_pred;
-
-                    if (congruous)
-                    {
-                        pred = (((block_base_l + thread_offset_l + vec_l) * LdgVectorItems < matrix_values_l) ? 1 : 0);
-                        residue_pred = ((block_base_k + thread_offset_k + vec_k < (dim_k % BlockItemsK)) ? 1 : 0);
-                    }
-                    else
-                    {
-                        pred = ((block_base_k + thread_offset_k + vec_k < dim_k) ? 1 : 0);
-                        residue_pred = (((block_base_l + thread_offset_l + vec_l) * LdgVectorItems < (matrix_values_l % BlockItemsL)) ? 1 : 0);
-                    }
-
-                    // Update the guard and residue_guard word with predicate bits
-                    guard |= (pred << item);
-                    residue_guard |= (residue_pred << item);
-                }
-            }
-
-            // If there are zero full request iterations, compute the intersection
-            // with the residue guard.
-            if (!request_iterations)
-            {
-                guard &= residue_guard;
-            }
-        }
-    }
-
-
-
-    /**
-     * Request the current block-wide tile from source memory
-     */
-    inline __device__
-    void request()
-    {
-        #pragma unroll
-        for (int k_idx = 0; k_idx < ThreadLdgVectorsK; ++k_idx)
-        {
-            #pragma unroll
-            for (int l_idx = 0; l_idx < ThreadLdgVectorsL; ++l_idx)
-            {
-                int load_idx = l_idx + (k_idx * ThreadLdgVectorsL);
-                bool pred = !AllowRaggedTiles || (guard & (predicate_mask_t(1) << load_idx));
-                if (pred)
-                {
-                    fetch[load_idx].load(
-                        ptr +
-                        (k_idx * BlockLdgVectorStrideK * matrix_values_stride_k) + (l_idx * BlockThreads));
-                }
-                else
-                {
-                    #pragma unroll
-                    for (int elem_idx = 0; elem_idx < LdgVectorItems; ++elem_idx)
-                    {
-                        fetch[load_idx].buff[elem_idx] = 0;
-                    }
-                }
-            }
-        }
-    }
-
-
-    /// Advance to the next block-wide tile
-    inline __device__
-    void next()
-    {
-        if (congruous)
-        {
-            ptr += BlockItemsK * matrix_values_stride_k;
-        }
-        else
-        {
-            ptr += BlockLdgVectorsL;
-        }
-
-        // Track number of full iterations to intersect with the residue guard predicates.
-        if (AllowRaggedTiles)
-        {
-            --request_iterations;
-            if (!request_iterations)
-            {
-                guard &= residue_guard;
-            }
-        }
-    }
-
-
-    /// Commit the values to the scratch tile to destination memory.
-    template <int SmemStride>
-    inline __device__
-    void commit(value_t *scratch_tile)
-    {
-        static_assert(SmemStride % LdgVectorItems == 0,
-            "SMEM stride must be divisible by the size of vector loads");
-
-        ldg_vector_t *smem_ptr = reinterpret_cast<ldg_vector_t *>(scratch_tile);
-        smem_ptr += thread_offset_l + thread_offset_k * SmemStride / LdgVectorItems;
-
-        #pragma unroll
-        for (int k_idx = 0; k_idx < ThreadLdgVectorsK; ++k_idx)
-        {
-            #pragma unroll
-            for (int l_idx = 0; l_idx < ThreadLdgVectorsL; ++l_idx)
-            {
-                int load_idx = l_idx + (k_idx * ThreadLdgVectorsL);
-
-                fetch[load_idx].store(smem_ptr +
-                    (k_idx * BlockLdgVectorStrideK * SmemStride / LdgVectorItems) +
-                    (l_idx * BlockThreads));
-            }
-        }
-    }
-};
-
-
-} // namespace gemm
-} // namespace cutlass
-
--- a/cutlass/gemm/block_task.h
+++ b/cutlass/gemm/block_task.h
@ -1,669 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * A block-wide task abstraction for computing device-wide GEMM
- */
-
-#include <stdint.h>
-
-#include "../util/util.h"
-
-#include "grid_raster.h"
-#include "block_loader.h"
-#include "k_split_control.h"
-#include "thread_accumulator.h"
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * block_task_policy
- ******************************************************************************/
-
-/**
- * \brief Parameterizable tuning policy for \p block_task
- *
- * Once parameterized, \p block_task_policy provides the member constant
- * \p BlockThreads indicating to the required thread block size
- */
-template <
-    int _BlockItemsY,                               ///< Height in rows of a block-wide tile in matrix C
-    int _BlockItemsX,                               ///< Width in columns of a block-wide tile in matrix C
-    int _BlockItemsK,                               ///< Extent of block-wide A|B tiles in value_t along the K-axis
-    int _ThreadItemsY,                              ///< Height in rows of a thread tile in C
-    int _ThreadItemsX,                              ///< Width in columns of a thread tile in C
-    bool _UseDoubleScratchTiles,                    ///< Whether to halve synchronization overhead at the expense of doubled shared memory and addressing overhead
-    grid_raster_strategy::kind_t _RasterStrategy>   ///< Strategy for enumerating \p block_task within an input matrix
-struct block_task_policy
-{
-    enum
-    {
-        /// Height in rows of a block-wide tile in matrix C
-        BlockItemsY = _BlockItemsY,
-
-        /// Width in columns of a block-wide tile in matrix C
-        BlockItemsX = _BlockItemsX,
-
-        /// Height in rows of a thread tile in C
-        ThreadItemsY = _ThreadItemsY,
-
-        /// Width in columns of a thread tile in C
-        ThreadItemsX = _ThreadItemsX,
-
-        /// Extent of block-wide A|B tiles in value_t along the K-axis
-        BlockItemsK = _BlockItemsK,
-
-        /// Whether to halve synchronization overhead at the expense of doubled shared memory and addressing overhead
-        UseDoubleScratchTiles = _UseDoubleScratchTiles,
-
-        /// Number of threads in each thread block (blockDim.x)
-        BlockThreads = divide_assert<
-                            (BlockItemsY * BlockItemsX),
-                            (ThreadItemsY * ThreadItemsX)>::value,
-    };
-
-    /// Strategy for enumerating \p block_task within an input matrix
-    static const grid_raster_strategy::kind_t RasterStrategy = _RasterStrategy;
-};
-
-
-/******************************************************************************
- * block_task
- ******************************************************************************/
-
-/**
- * \brief A block-wide task abstraction for computing device-wide GEMM
- *
- * Each thread_block is assigned a unique tile of output matrix C to compute by
- * consuming the corresponding stripes of the input matrices A and B.
- */
-template <
-    typename                    block_task_policy_t,    ///< Parameterization of block_task_policy
-    typename                    value_t,                ///< Multiplicand value type (matrices A and B)
-    typename                    accum_t,                ///< Accumulator value type (matrix C and scalars)
-    matrix_transform_t::kind_t  TransformA,             ///< View transform enumerant for matrix A
-    int                         LdgAlignA,              ///< Alignment (in bytes) for A operand
-    matrix_transform_t::kind_t  TransformB,             ///< View transform enumerant for matrix B
-    int                         LdgAlignB,              ///< Alignment (in bytes) for B operand
-    typename                    epilogue_op_t,          ///< Epilogue operation applied to GEMM
-    int                         LdgAlignC,              ///< Alignment (in bytes) for C operand
-    bool                        AllowRaggedTiles        ///< Whether the input matrix's dimensions need not be an even-multiple of the block-wide tile dimensions
->
-struct block_task
-{
-    //-------------------------------------------------------------------------
-    // Constants and types
-    //-------------------------------------------------------------------------
-
-    enum
-    {
-        /// Number of threads in each thread block (blockDim.x)
-        BlockThreads = block_task_policy_t::BlockThreads,
-
-        /// Extent of thread tile in value_t along M-axis
-        ThreadItemsY = block_task_policy_t::ThreadItemsY,
-
-        /// Extent of thread tile in value_t along N-axis
-        ThreadItemsX = block_task_policy_t::ThreadItemsX,
-    };
-
-    /// Accumulator type
-    typedef thread_accumulator<
-            ThreadItemsY,
-            ThreadItemsX,
-            value_t,
-            accum_t>
-        thread_accumulator_t;
-
-    /// Dot-product vector type along the K-axis (e.g, uchar4 when using IDP4A)
-    typedef typename thread_accumulator_t::dp_vector_t dp_vector_t;
-
-    enum
-    {
-        /// Whether this is a small, latency-bound tile
-        IsSmallTile = (ThreadItemsY < 4) && (ThreadItemsX < 4),
-
-        /// Number of value_t in dp_vector_t
-        DpVectorItems = divide_assert<sizeof(dp_vector_t), sizeof(value_t)>::value,
-
-        /// Extent of block-wide C-tile in accum_t (and A-tiles in value_t) along M-axis (height)
-        BlockItemsY = block_task_policy_t::BlockItemsY,
-
-        /// Extent of block-wide C-tile in accum_t (and B-tiles in value_t) along N-axis (width)
-        BlockItemsX = block_task_policy_t::BlockItemsX,
-
-        /// Extent of block-wide A|B tiles in value_t along the K-axis
-        BlockItemsK = block_task_policy_t::BlockItemsK,
-
-        /// Whether to halve synchronization overhead at the expense of doubled shared memory and addressing overhead
-        UseDoubleScratchTiles = block_task_policy_t::UseDoubleScratchTiles,
-
-        /// Extent of block-wide A|B tiles in dp_vector_t along the K-axis
-        BlockDpVectorsK = divide_assert<BlockItemsK,  DpVectorItems>::value,
-
-        /// Number of dp_vector_t along M-axis that can be read in a single LDS from the shared A-tile (up to 128b if more than one value_t)
-        LdsVectorDpVectorsA = __NV_STD_MIN(
-            ThreadItemsY, 
-            __NV_STD_MAX(1, (128 / (__NV_STD_MAX(sizeof(dp_vector_t), sizeof(accum_t)) * 8)))),
-
-        /// Number of dp_vector_t along N-axis that can be read in a single LDS from the shared B-tile (up to 128b if more than one value_t)
-        LdsVectorDpVectorsB = __NV_STD_MIN(
-            ThreadItemsX, 
-            __NV_STD_MAX(1, (128 / (__NV_STD_MAX(sizeof(dp_vector_t), sizeof(accum_t)) * 8)))),
-
-        /// Number of strip-mined LDS vector reads from shared A-tile
-        ThreadLdsVectorsA = divide_assert<ThreadItemsY, LdsVectorDpVectorsA>::value,
-
-        /// Number of strip-mined LDS vector reads from shared B-tile
-        ThreadLdsVectorsB = divide_assert<ThreadItemsX, LdsVectorDpVectorsB>::value,
-
-        /// Number of elements in one LDG/STG vector of C-tile
-        ThreadLdgVectorSizeC = __NV_STD_MIN(LdgAlignC, 16) / (sizeof(accum_t)),
-
-        /// Number of threads in warp
-        WarpThreads = 32,
-
-        /// Extent of warp in threads along the M-axis
-        WarpThreadsY = (BlockItemsY > BlockItemsX) ? 8 : 4,
-
-        /// Extent of warp in threads along the N-axis
-        WarpThreadsX = divide_assert<WarpThreads, WarpThreadsY>::value,
-
-        /// Extent of warp-wide tile in items along the M-axis
-        WarpItemsY = WarpThreadsY * ThreadItemsY,
-
-        /// Extent of warp-wide tile in items along the N-axis
-        WarpItemsX = WarpThreadsX * ThreadItemsX,
-
-        /// Extent of block in warps along M-axis
-        BlockWarpsY = divide_assert<BlockItemsY, WarpItemsY>::value,
-
-        /// Extent of block in warps along N-axis
-        BlockWarpsX = divide_assert<BlockItemsX, WarpItemsX>::value,
-    };
-
-    /// Load-from-shared data movement type for A-tile, coarsened by LdsVectorDpVectorsA
-    typedef io_vector<dp_vector_t, LdsVectorDpVectorsA> lds_vector_a_t;
-
-    /// Load-from-shared data movement type for B-tile, coarsened by LdsVectorDpVectorsB
-    typedef io_vector<dp_vector_t, LdsVectorDpVectorsB> lds_vector_b_t;
-
-    /// Thread block rasterization helper type
-    typedef grid_raster<
-            BlockItemsY,
-            BlockItemsX,
-            TransformA,
-            TransformB,
-            block_task_policy_t::RasterStrategy>
-        grid_raster_t;
-
-
-    /// Tile loader type for matrix A
-    typedef block_loader<
-            BlockThreads,                                       // BlockThreads
-            BlockDpVectorsK,                                    // BlockDpVectorsK
-            BlockItemsY,                                        // BlockItemsL
-            value_t,                                            // value_t
-            LdgAlignA,                                          // MatrixAlignBytes
-            AllowRaggedTiles,                                   // AllowRaggedTiles
-            dp_vector_t,                                        // dp_vector_t
-            (TransformA == matrix_transform_t::NonTranspose) ?  // LoadAlgorithm
-                load_algorithm::CongruousCopy :
-                load_algorithm::CrosswiseCopy>
-        block_loader_a_t;
-
-
-    /// Tile loader type for matrix B
-    typedef block_loader<
-            BlockThreads,                                       // BlockThreads
-            BlockDpVectorsK,                                    // BlockDpVectorsK
-            BlockItemsX,                                        // BlockItemsL
-            value_t,                                            // value_t
-            LdgAlignB,                                          // MatrixAlignBytes
-            AllowRaggedTiles,                                   // AllowRaggedTiles
-            dp_vector_t,                                        // dp_vector_t
-            (TransformB == matrix_transform_t::NonTranspose) ?  // LoadAlgorithm
-                load_algorithm::CrosswiseCopy :
-                load_algorithm::CongruousCopy>
-        block_loader_b_t;
-
-
-    enum
-    {
-        /// Number of value_t to pad the end of each row of the shared A-tile
-        PadItemsA = (TransformA == matrix_transform_t::NonTranspose) ?
-            __NV_STD_MAX(LdsVectorDpVectorsA, block_loader_a_t::AlignmentDpVectorsL) :
-            LdsVectorDpVectorsA,
-
-        /// Number of value_t to pad the end of each row of the shared B-tile
-        PadItemsB = (TransformB == matrix_transform_t::NonTranspose) ?
-            LdsVectorDpVectorsB :
-            __NV_STD_MAX(LdsVectorDpVectorsB, block_loader_b_t::AlignmentDpVectorsL),
-    };
-
-
-    /// Shared memory layout for a prefetch page
-    struct page_storage_t
-    {
-        /// Tile of A
-        dp_vector_t __align__(16) block_a[BlockDpVectorsK][BlockItemsY + PadItemsA];
-
-        /// Tile of B
-        dp_vector_t __align__(16) block_b[BlockDpVectorsK][BlockItemsX + PadItemsB];
-    };
-
-
-    /// Shared memory layout for scratch storage
-    struct scratch_storage_t
-    {
-        /// Prefetch pages
-        page_storage_t pages[UseDoubleScratchTiles ? 2 : 1];
-
-        /// Accumulator shared scratch
-        typename thread_accumulator_t::scratch_storage_t accum_scratch;
-    };
-
-
-    //-------------------------------------------------------------------------
-    // Assert assumptions
-    //-------------------------------------------------------------------------
-
-    // Ensure we have at least two unrolled innermost loop iterations (one to prefetch
-    // the next global tile and then one to prefetch the first strip of it from shared)
-    static_assert ((BlockDpVectorsK >= 2), "BlockDpVectorsK must be >= 2.");
-
-
-    //-------------------------------------------------------------------------
-    // Members
-    //-------------------------------------------------------------------------
-
-    /// Scratch storage reference
-    scratch_storage_t *scratch;
-
-    /// Which page of scratch tiles we're currently reading from
-    int page_idx;
-
-    /// Pointer to matrix C
-    accum_t *d_c;
-
-    /// Epilogue operation applied to update matrix C
-    epilogue_op_t epilogue_op;
-
-    /// Matrix height in rows of trans_op(A) and C
-    int dim_m;
-
-    /// Matrix width in columns of trans_op(B) and C
-    int dim_n;
-
-    /// Control for inter-block k-splitting
-    k_split_control k_split;
-
-    /// Thread block's base value_t coordinates (m, n) in matrix C
-    grid_raster_t grid_raster;
-
-    /// Thread block's current coordinate (k) within A|B matrices
-    int block_item_coords_k;
-
-    /// Thread block's ending coordinate (k) within A|B matrices (one-past)
-    int block_end_item_k;
-
-    /// Warp's coordinates (x, y) in thread block
-    int2 block_warp_coords;
-
-    /// Thread's coordinates (x, y) in warp
-    int2 warp_thread_coords;
-
-    /// Thread's base item offset within strip of A tile
-    int thread_strip_offset_a;
-
-    /// Thread's base item offset within strip of B tile
-    int thread_strip_offset_b;
-
-    /// Thread's active-k/prefetch-k slices from shared A tile
-    lds_vector_a_t local_slices_a[2][ThreadLdsVectorsA];
-
-    /// Thread's active-k/prefetch-k slices from shared B tile
-    lds_vector_b_t local_slices_b[2][ThreadLdsVectorsB];
-
-    /// A tile loader
-    block_loader_a_t loader_a;
-
-    /// B tile loader
-    block_loader_b_t loader_b;
-
-    /// C tile accumulator
-    thread_accumulator_t accumulator;
-
-
-    //-------------------------------------------------------------------------
-    // Coordinate system helpers
-    //-------------------------------------------------------------------------
-
-    /// Compute the warp's coordinates (x, y) in thread block
-    inline __device__
-    int2 warp_coords()
-    {
-        int warp_id = threadIdx.x / WarpThreads;
-        return make_int2(
-            warp_id % BlockWarpsX,
-            warp_id / BlockWarpsX);
-    }
-
-
-    /// Compute the thread's lane-coordinates (x, y) in warp
-    inline __device__
-    int2 thread_coords()
-    {
-        int lane_id = threadIdx.x % WarpThreads;
-
-        // Maxwell+ mapping of threads within a 2D warp for maximal LDS bandwidth
-        return make_int2(
-            lane_id / WarpThreadsY,
-            lane_id % WarpThreadsY);
-    }
-
-
-    //-------------------------------------------------------------------------
-    // Constructor API
-    //-------------------------------------------------------------------------
-
-    /// Constructor
-    inline __device__
-    block_task(
-        scratch_storage_t *scratch,
-        value_t *d_a,
-        value_t *d_b,
-        accum_t *d_c,
-        epilogue_op_t epilogue_op,
-        int dim_m,
-        int dim_n,
-        int dim_k,
-        k_split_control k_split)
-    :
-        scratch(scratch),
-        page_idx(0),
-        d_c(d_c),
-        epilogue_op(epilogue_op),
-        dim_m(dim_m),
-        dim_n(dim_n),
-        k_split(k_split),
-        block_item_coords_k(k_split.block_begin_item_k()),
-        block_end_item_k(k_split.block_end_item_k(dim_k)),
-        block_warp_coords(warp_coords()),
-        warp_thread_coords(thread_coords()),
-        thread_strip_offset_a((warp_thread_coords.y * LdsVectorDpVectorsA) + (block_warp_coords.y * WarpItemsY)),
-        thread_strip_offset_b((warp_thread_coords.x * LdsVectorDpVectorsB) + (block_warp_coords.x * WarpItemsX)),
-
-        loader_a(
-            d_a,                                                            // d_matrix
-            dim_m,                                                          // matrix_values_l
-            (TransformA == matrix_transform_t::NonTranspose) ? dim_m : 1,   // matrix_values_stride_k
-            (TransformA == matrix_transform_t::NonTranspose) ? 1 : dim_k,   // matrix_values_stride_l
-            make_int2(                                                      // block_begin_item_coords
-                grid_raster.block_item_coords.y,
-                block_item_coords_k),
-            block_end_item_k),                                              // block_end_item_k
-
-        loader_b(
-            d_b,                                                            // d_matrix
-            dim_n,                                                          // matrix_values_l
-            (TransformB == matrix_transform_t::NonTranspose) ? 1 : dim_n,   // matrix_values_stride_k
-            (TransformB == matrix_transform_t::NonTranspose) ? dim_k : 1,   // matrix_values_stride_l
-            make_int2(                                                      // block_begin_item_coords
-                grid_raster.block_item_coords.x,
-                block_item_coords_k),
-            block_end_item_k),                                              // block_end_item_k
-
-        accumulator(scratch->accum_scratch)
-    {}
-
-
-    //-------------------------------------------------------------------------
-    // Prefetching utility methods
-    //-------------------------------------------------------------------------
-
-    /**
-     * Request the calling thread's slices of the shared tiles at depth \p tile_offset_k
-     */
-    inline __device__ void request_local_prefetch(
-        lds_vector_a_t (&slice_a)[ThreadLdsVectorsA],  ///< Slice from A
-        lds_vector_b_t (&slice_b)[ThreadLdsVectorsB],  ///< Slice from B
-        int tile_offset_k)
-    {
-        // Load B strip
-        for (int i = 0; i < ThreadLdsVectorsB; ++i)
-        {
-            slice_b[i].load(
-                &scratch->pages[page_idx].block_b[tile_offset_k][thread_strip_offset_b + (i * WarpThreadsX * LdsVectorDpVectorsB)]);
-        }
-
-        // Load A strip
-        for (int i = 0; i < ThreadLdsVectorsA; ++i)
-        {
-            slice_a[i].load(
-                &scratch->pages[page_idx].block_a[tile_offset_k][thread_strip_offset_a + (i * WarpThreadsY * LdsVectorDpVectorsA)]);
-        }
-    }
-
-
-    //-------------------------------------------------------------------------
-    // Epilogue
-    //-------------------------------------------------------------------------
-
-    /**
-     * Performs the GEMM epilogue:
-     *   - Applies the scalar multipliers and addends to the accumulators
-     *   - Write the result to the output matrix
-     */
-    inline __device__ void epilogue()
-    {
-        // Wait for predecessor thread block(s) to produce block-wide tile of
-        // exclsuive partial-sums
-        k_split.wait();
-
-        // Configure epilogue as to whether the thread block is a secondary
-        // accumulator in an inter-block k-splitting scheme
-        if (k_split.is_secondary_accumulator())
-            epilogue_op.set_secondary_accumulator();
-
-        // Whether the addend from C needs loading
-        bool must_init_addend = epilogue_op.must_init_addend();
-
-        #pragma unroll
-        for (int x = 0; x < ThreadItemsX; ++x)
-        {
-            #pragma unroll
-            for (int y = 0; y < ThreadItemsY; y += LdsVectorDpVectorsA)
-            {
-                int thread_strip_b = x / LdsVectorDpVectorsB;
-                int thread_strip_a = y / LdsVectorDpVectorsA;
-
-                int thread_item_coords_tile_x = thread_strip_offset_b + (thread_strip_b * WarpThreadsX * LdsVectorDpVectorsB) + (x % LdsVectorDpVectorsB);
-                int thread_item_coords_tile_y = thread_strip_offset_a + (thread_strip_a * WarpThreadsY * LdsVectorDpVectorsA) + (y % LdsVectorDpVectorsA);
-
-                int c_idx = (grid_raster.block_item_coords.x + thread_item_coords_tile_x) * dim_m +
-                    grid_raster.block_item_coords.y + thread_item_coords_tile_y;
-
-                accum_t *my_c = d_c + c_idx;
-
-                #pragma unroll
-                for (int i = 0; i < LdsVectorDpVectorsA; ++i)
-                {
-                    accum_t c_slice = accum_t(0);
-                    accum_t *c_ptr = my_c + i;
-
-                    if ((grid_raster.block_item_coords.x + thread_item_coords_tile_x) < dim_n &&
-                        (grid_raster.block_item_coords.y + thread_item_coords_tile_y + i) < dim_m)
-                    {
-                        if (must_init_addend)
-                        {
-                            ldg_cg(c_slice, c_ptr);
-                        }
-
-                        c_slice = epilogue_op(accumulator.get(x, y + i), c_slice, c_idx + i);
-
-                        stg_cg(c_ptr, c_slice);
-                    }
-                }
-            }
-        }
-
-        // Signal k-split successor thread_block that we have produced our block-wide
-        // tile of inclusive partial-sums
-        k_split.signal();
-    }
-
-
-    //-------------------------------------------------------------------------
-    // Tile consumption
-    //-------------------------------------------------------------------------
-
-    /**
-     * Consume a tile of A and B each
-     */
-    template <bool DoGlobalPrefetch>
-    inline __device__
-    void consume_tile()
-    {
-        // Unroll BlockDpVectorsK iterations of outer-product accumulations
-        #pragma unroll
-        for (int tile_offset_k = 0; tile_offset_k < BlockDpVectorsK; tile_offset_k += 1)
-        {
-            // Last strip commits global prefetch for next tile
-            if ((tile_offset_k == BlockDpVectorsK - 1) && DoGlobalPrefetch)
-            {
-                // If not using two pages of scratch tiles, protect the above prefetch loads from the committing writes below
-                if (!UseDoubleScratchTiles)
-                    __syncthreads();
-
-                // If using two pages of scratch tiles, switch to next page before writing
-                if (UseDoubleScratchTiles)
-                {
-                    page_idx = (page_idx ? 0 : 1);
-                }
-
-                // Commit global prefetch data to scratch page
-                loader_a.commit(scratch->pages[page_idx].block_a);
-                loader_b.commit(scratch->pages[page_idx].block_b);
-
-                __syncthreads();
-            }
-
-            // Request local prefetch for next strip
-            request_local_prefetch(
-                local_slices_a[(tile_offset_k + 1) % 2],
-                local_slices_b[(tile_offset_k + 1) % 2],
-                (tile_offset_k + 1) % BlockDpVectorsK);
-
-            // Request global prefetch for next tile on first strip
-            if ((tile_offset_k == 0) && DoGlobalPrefetch)
-            {
-                loader_b.request();
-                loader_b.next();
-                loader_a.request();
-                loader_a.next();
-            }
-
-            // Cast strip-mined loads to contiguous array of dp_vector_t
-            typedef dp_vector_t thread_tile_a_t[ThreadLdsVectorsA * LdsVectorDpVectorsA];
-            typedef dp_vector_t thread_tile_b_t[ThreadLdsVectorsB * LdsVectorDpVectorsB];
-            thread_tile_a_t &thread_tile_a = reinterpret_cast<thread_tile_a_t&>(local_slices_a[(tile_offset_k) % 2]);
-            thread_tile_b_t &thread_tile_b = reinterpret_cast<thread_tile_b_t&>(local_slices_b[(tile_offset_k) % 2]);
-
-            // Accumulate this dp-stripe product
-            accumulator.multiply_accumulate(thread_tile_a, thread_tile_b);
-        }
-    }
-
-
-
-    //-------------------------------------------------------------------------
-    // GEMM API
-    //-------------------------------------------------------------------------
-
-    /**
-     * Compute GEMM
-     */
-    inline __device__
-    void run()
-    {
-        // Quit if the thread block is fully out-of-bounds
-        if (grid_raster.is_block_oob(dim_m, dim_n))
-        {
-            asm volatile("exit;");
-        }
-
-        // Request global prefetch of first tile
-        loader_a.request();
-        loader_a.next();
-        loader_b.request();
-        loader_b.next();
-
-        // Commit global prefetch of first tile to shared memory
-        loader_a.commit(scratch->pages[page_idx].block_a);
-        loader_b.commit(scratch->pages[page_idx].block_b);
-
-        // Advance to next A,B tiles in K-axis
-        block_item_coords_k += BlockItemsK;
-
-        // Synchronize shared tiles and prepared accumulator
-        __syncthreads();
-
-        // Initialize thread's slice of accumulators
-        accumulator.init();
-
-        // Request first iteration of local prefetch strips
-        request_local_prefetch(
-            local_slices_a[0],
-            local_slices_b[0],
-            0);
-
-        //
-        // Main loop
-        //
-
-        // Consume tiles in A and B along the K-axis (all but last tile)
-        #pragma unroll 1
-        while (block_item_coords_k < block_end_item_k)
-        {
-            consume_tile<true>();
-
-            // Advance to next A,B tiles in K-axis
-            block_item_coords_k += BlockItemsK;
-        }
-
-        // Consume last tile
-        consume_tile<false>();
-
-        //
-        // Eplilogue
-        //
-
-        epilogue();
-    }
-};
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/gemm/block_task_wmma.h
+++ b/cutlass/gemm/block_task_wmma.h
@ -1,759 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file
- * A block-wide task abstraction for computing device-wide GEMM
- */
-
-#pragma once
-
-// Compiler guard conditional to avoid compilation errors on versions of CUDA that
-// do not support the WMMA API.
-#if defined (WMMA)
-
-#include <stdint.h>
-
-#include "../util/util.h"
-
-#include "grid_raster.h"
-#include "block_loader.h"
-#include "block_loader_wmma.h"
-#include "wmma_accumulator.h"
-
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * block_task_wmma_policy
- ******************************************************************************/
-
-/**
- * \brief Parameterizable tuning policy for block-wide WMMA GEMM tasks
- *
- * Once parameterized, \p block_task_policy provides the member constant
- * \p BlockThreads indicating to the required thread block size
- */
-template <
-    int _BlockItemsY,                               ///< Height in rows of a block-wide tile in matrix C
-    int _BlockItemsX,                               ///< Width in columns of a block-wide tile in matrix C
-    int _BlockItemsK,                               ///< Extent of block-wide A|B tiles in value_t along the K-axis
-    int _WarpItemsY,                                ///< Height in rows of a Warp tile's accumulators
-    int _WarpItemsX,                                ///< Width in columns of a Warp tile's accumulators
-    int _WmmaItemsY,                                ///< Height in rows of a discrete WMMA block's accumulators
-    int _WmmaItemsX,                                ///< Width in columns of a discrete WMMA block's accumulators
-    int _WmmaItemsK,                                ///< Depth of each discrete WMMA block
-    bool _UseDoubleScratchTiles,                    ///< Whether to halve synchronization overhead at the expense of doubled shared memory and addressing overhead
-    grid_raster_strategy::kind_t _RasterStrategy>   ///< Strategy for enumerating \p block_task within an input matrix
-struct block_task_wmma_policy
-{
-    /// Strategy for enumerating \p block_task within an input matrix
-    static const grid_raster_strategy::kind_t RasterStrategy = _RasterStrategy;
-
-    enum
-    {
-        /// Height in rows of a block-wide tile in matrix C
-        BlockItemsY = _BlockItemsY,
-
-        /// Width in columns of a block-wide tile in matrix C
-        BlockItemsX = _BlockItemsX,
-
-        /// Extent of block-wide A|B tiles in value_t along the K-axis
-        BlockItemsK = _BlockItemsK,
-
-        /// Height in rows of a Warp tile's accumulators
-        WarpItemsX = _WarpItemsX,
-
-        /// Width in columns of a Warp tile's accumulators
-        WarpItemsY = _WarpItemsY,
-
-        /// Width in columns of a discrete WMMA block's accumulators
-        WmmaItemsX = _WmmaItemsX,
-
-        /// Height in rows of a discrete WMMA block's accumulators
-        WmmaItemsY = _WmmaItemsY,
-
-        /// Depth of each discrete WMMA block
-        WmmaItemsK = _WmmaItemsK,
-
-        /// Whether to halve synchronization overhead at the expense of doubled shared memory and addressing overhead
-        UseDoubleScratchTiles = _UseDoubleScratchTiles,
-
-
-        //
-        // Derived quantities
-        //
-
-        /// Machine warp size
-        WarpThreads = 32,
-
-        /// Number of WMMA operations in the height dimension
-        WmmaBlocksY = divide_assert<WarpItemsY, WmmaItemsY>::value,
-
-        /// Number of WMMA operations in the height dimension
-        WmmaBlocksX = divide_assert<WarpItemsX, WmmaItemsX>::value,
-
-        /// Number of warps in each thread block
-        BlockWarps = divide_assert<BlockItemsY * BlockItemsX, WarpItemsX * WarpItemsY>::value,
-
-        /// Number of threads in each thread block (blockDim.x)
-        BlockThreads = BlockWarps * WarpThreads,
-    };
-};
-
-
-/******************************************************************************
- * block_task_wmma
- ******************************************************************************/
-
-/**
- * \brief A block-wide task abstraction for computing device-wide GEMM
- *
- * Each thread_block is assigned a unique tile of output matrix C to compute by
- * consuming the corresponding stripes of the input matrices A and B.
- */
-template <
-    typename                    block_task_policy_t,    ///< Parameterization of block_task_policy
-    typename                    value_t,                ///< Multiplicand value type (matrices A and B)
-    typename                    accum_t,                ///< Accumulator value type (matrix C and scalars)
-    matrix_transform_t::kind_t  TransformA,             ///< View transform enumerant for matrix A
-    int                         LdgAlignA,              ///< Alignment (in bytes) for A operand
-    matrix_transform_t::kind_t  TransformB,             ///< View transform enumerant for matrix B
-    int                         LdgAlignB,              ///< Alignment (in bytes) for B operand
-    typename                    epilogue_op_t,          ///< Epilogue operation to update matrix C
-    int                         LdgAlignC,              ///< Alignment (in bytes) for C operand
-    bool                        AllowRaggedTiles        ///< Whether the input matrix's dimensions need not be an even-multiple of the block-wide tile dimensions
->
-struct block_task_wmma
-{
-    //-------------------------------------------------------------------------
-    // Constants and types
-    //-------------------------------------------------------------------------
-
-    enum
-    {
-        /// Number of threads in each thread block (blockDim.x)
-        BlockThreads = block_task_policy_t::BlockThreads,
-
-        /// Extent of block-wide C-tile in accum_t (and A-tiles in value_t) along M-axis (height)
-        BlockItemsY = block_task_policy_t::BlockItemsY,
-
-        /// Extent of block-wide C-tile in accum_t (and B-tiles in value_t) along N-axis (width)
-        BlockItemsX = block_task_policy_t::BlockItemsX,
-
-        /// Extent of block-wide A|B tiles in value_t along the K-axis
-        BlockItemsK = block_task_policy_t::BlockItemsK,
-
-        /// Extent of warp C-tile in accum_t (and A-tiles in value_t) along M-axis (height)
-        WarpItemsY = block_task_policy_t::WarpItemsY,
-
-        /// Extent of warp C-tile in accum_t (and B-tiles in value_t) along N-axis (width)
-        WarpItemsX = block_task_policy_t::WarpItemsX,
-
-        /// Extent of warp C-tile in accum_t (and A-tiles in value_t) along M-axis (height)
-        WmmaItemsY = block_task_policy_t::WmmaItemsY,
-
-        /// Extent of warp C-tile in accum_t (and B-tiles in value_t) along N-axis (width)
-        WmmaItemsX = block_task_policy_t::WmmaItemsX,
-
-        /// Extent of warp-wide A|B-tiles in value_t along K-axis
-        WmmaItemsK = block_task_policy_t::WmmaItemsK,
-
-        /// Whether to halve synchronization overhead at the expense of doubled shared memory and addressing overhead
-        UseDoubleScratchTiles = block_task_policy_t::UseDoubleScratchTiles,
-
-        /// Number of threads in warp
-        WarpThreads = block_task_policy_t::WarpThreads,
-
-        /// Number of warps participating
-        BlockWarps = block_task_policy_t::BlockWarps,
-
-        /// Extent of block in warps along M-axis
-        BlockWarpsY = divide_assert<BlockItemsY, WarpItemsY>::value,
-
-        /// Extent of block in warps along N-axis
-        BlockWarpsX = divide_assert<BlockItemsX, WarpItemsX>::value,
-
-        /// Number of MMA unrolls
-        WmmaUnrollCount = divide_assert<BlockItemsK,  WmmaItemsK>::value,
-
-        /// True if the A matrix layout is column major (K is the strided dimension)
-        IsLayoutCongruousA = (TransformA == matrix_transform_t::NonTranspose),
-
-        /// True if the B matrix layout is row mayor (K is the strided dimension)
-        IsLayoutCongruousB = (TransformB == matrix_transform_t::Transpose),
-
-    };
-
-    /// WMMA may support unique types for A and B, so plan ahead for this
-    typedef value_t value_a_t;
-
-    /// WMMA may support unique types for A and B, so plan ahead for this
-    typedef value_t value_b_t;
-
-    /// WMMA accumulator type
-    typedef wmma_accumulator<
-            WarpItemsY,
-            WarpItemsX,
-            WmmaItemsY,
-            WmmaItemsX,
-            WmmaItemsK,
-            value_a_t,
-            value_b_t,
-            accum_t,
-            TransformA,
-            TransformB>
-        accumulator_t;
-
-    /// Thread block rasterization helper type
-    typedef grid_raster<
-            BlockItemsY,
-            BlockItemsX,
-            TransformA,
-            TransformB,
-            block_task_policy_t::RasterStrategy>
-        grid_raster_t;
-
-    /// Tile loader type for matrix A
-    typedef block_loader_wmma<
-            IsLayoutCongruousA,
-            BlockThreads,
-            (IsLayoutCongruousA ? BlockItemsY : BlockItemsK),
-            (IsLayoutCongruousA ? BlockItemsK : BlockItemsY),
-            value_a_t,
-            LdgAlignA,
-            AllowRaggedTiles>
-        block_loader_a_t;
-
-    /// Tile loader type for matrix A
-    typedef block_loader_wmma<
-            IsLayoutCongruousB,
-            BlockThreads,
-            (IsLayoutCongruousB ? BlockItemsX : BlockItemsK),
-            (IsLayoutCongruousB ? BlockItemsK : BlockItemsX),
-            value_b_t,
-            LdgAlignB,
-            AllowRaggedTiles>
-        block_loader_b_t;
-
-    /// Type alias for matrix A fragment type
-    typedef typename accumulator_t::fragment_a_t fragment_a_t;
-
-    /// Type alias for matrix B fragment type
-    typedef typename accumulator_t::fragment_b_t fragment_b_t;
-
-    enum
-    {
-        /// Number of fragments from A matrix
-        WmmaBlocksY = accumulator_t::WmmaBlocksY,
-
-        /// Number of fragments from B matrix
-        WmmaBlocksX = accumulator_t::WmmaBlocksX,
-
-        /// Number of value_t to pad the outer dimension of the shared A-tile
-        PadItemsA = 16,
-
-        /// Number of value_t to pad the outer dimension of the shared B-tile
-        PadItemsB = 16,
-
-        /// Leading dimension of A matrix tile
-        LdmSmemA = (IsLayoutCongruousA ? BlockItemsY: BlockItemsK) + PadItemsA,
-
-        /// Leading dimension of A matrix tile
-        StridedSmemA = (IsLayoutCongruousA ? BlockItemsK : BlockItemsY ),
-
-        /// Leading dimension of B matrix tile
-        LdmSmemB = (IsLayoutCongruousB? BlockItemsX : BlockItemsK)  + PadItemsB,
-
-        StridedSmemB = (IsLayoutCongruousB ? BlockItemsK : BlockItemsX),
-    };
-
-    /// Shared memory layout for a prefetch page
-    struct page_storage_t
-    {
-        /// Tile of A
-        value_a_t __align__(16) block_a[StridedSmemA][LdmSmemA];
-
-        /// Tile of B
-        value_b_t __align__(16) block_b[StridedSmemB][LdmSmemB];
-    };
-
-    /// Shared memory layout for scratch storage
-    struct scratch_storage_t
-    {
-        union
-        {
-            /// Prefetch pages
-            uninitialized<page_storage_t> pages[UseDoubleScratchTiles ? 2 : 1];
-
-            /// Scratch storage for warps
-            accum_t epilogue[BlockWarps][WmmaItemsX * WmmaItemsY];
-        };
-    };
-
-    //-------------------------------------------------------------------------
-    // Assert assumptions
-    //-------------------------------------------------------------------------
-
-    // Ensure we have at least two unrolled innermost loop iterations (one to prefetch
-    // the next global tile and then one to prefetch the first strip of it from shared)
-    static_assert ((BlockItemsK >= 2), "BlockItemsK must be >= 2.");
-
-    //-------------------------------------------------------------------------
-    // Members
-    //-------------------------------------------------------------------------
-
-    /// Scratch storage reference
-    scratch_storage_t *scratch;
-
-    /// Which page of scratch tiles we're currently reading from
-    int page_idx;
-
-    /// Pointer to matrix C
-    accum_t *d_c;
-
-    /// Epilogue operation applied to update matrix C
-    epilogue_op_t epilogue_op;
-
-    /// Matrix height in rows of trans_op(A) and C
-    int dim_m;
-
-    /// Matrix width in columns of trans_op(B) and C
-    int dim_n;
-
-    /// Control for inter-block k-splitting
-    k_split_control k_split;
-
-    /// Thread block's base value_t coordinates (m, n) in matrix C
-    grid_raster_t grid_raster;
-
-    /// Thread block's current coordinate (k) within A|B matrices
-    int block_item_coords_k;
-
-    /// Thread block's ending coordinate (k) within A|B matrices (one-past)
-    int block_end_item_k;
-
-    /// Warp's coordinates (x, y) in thread block
-    int2 block_warp_item_coords;
-
-    /// A tile loader
-    block_loader_a_t loader_a;
-
-    /// B tile loader
-    block_loader_b_t loader_b;
-
-    /// Thread's active-k/prefetch-k slices from shared A tile
-    fragment_a_t local_slices_a[2][WmmaBlocksY];
-
-    /// Thread's active-k/prefetch-k slices from shared B tile
-    fragment_b_t local_slices_b[2][WmmaBlocksX];
-
-    /// Accumulator tile
-    accumulator_t accumulator;
-
-
-    //-------------------------------------------------------------------------
-    // Coordinate system helpers
-    //-------------------------------------------------------------------------
-
-    /// Compute the warp's item-coordinates (x, y) in thread block
-    inline __device__
-    int2 warp_item_coords()
-    {
-        int warp_id = threadIdx.x / WarpThreads;
-
-        return make_int2(
-            (warp_id / BlockWarpsY) * WarpItemsX,
-            (warp_id % BlockWarpsY) * WarpItemsY);
-    }
-
-    /// Compute the thread block's base item-coordinates in matrix A
-    inline __device__
-    int2 a_block_item_coords()
-    {
-        if (TransformA == matrix_transform_t::NonTranspose)
-        {
-            return make_int2(grid_raster.block_item_coords.y, block_item_coords_k);
-        }
-        else
-        {
-            return make_int2(block_item_coords_k, grid_raster.block_item_coords.y);
-        }
-    }
-
-    /// Compute the thread block's base item-coordinates in matrix B
-    inline __device__
-    int2 b_block_item_coords()
-    {
-        if (TransformB == matrix_transform_t::Transpose)
-        {
-            return make_int2(grid_raster.block_item_coords.x, block_item_coords_k);
-        }
-        else
-        {
-            return make_int2(block_item_coords_k, grid_raster.block_item_coords.x);
-        }
-    }
-
-    //-------------------------------------------------------------------------
-    // Constructor API
-    //-------------------------------------------------------------------------
-
-    /// Constructor
-    inline __device__
-    block_task_wmma(
-        scratch_storage_t *scratch,
-        value_t *d_a,
-        value_t *d_b,
-        accum_t *d_c,
-        epilogue_op_t epilogue_op,
-        int dim_m,
-        int dim_n,
-        int dim_k,
-        k_split_control k_split)
-    :
-        scratch(scratch),
-        page_idx(0),
-        d_c(d_c),
-        epilogue_op(epilogue_op),
-        dim_m(dim_m),
-        dim_n(dim_n),
-        k_split(k_split),
-        block_item_coords_k(k_split.block_begin_item_k()),
-        block_end_item_k(k_split.block_end_item_k(dim_k)),
-        block_warp_item_coords(warp_item_coords()),
-
-        loader_a(
-            reinterpret_cast<value_a_t const *>(d_a),
-            (IsLayoutCongruousA ? dim_m : block_end_item_k),
-            (IsLayoutCongruousA ? 0 : block_item_coords_k),
-            (IsLayoutCongruousA ? block_end_item_k : dim_m),
-            (IsLayoutCongruousA ? dim_m : dim_k),
-            (IsLayoutCongruousA ? block_item_coords_k : 0),
-            a_block_item_coords()),
-
-        loader_b(
-            reinterpret_cast<value_b_t const *>(d_b),
-            (IsLayoutCongruousB ? dim_n : block_end_item_k),
-            (IsLayoutCongruousB ? 0 : block_item_coords_k),
-            (IsLayoutCongruousB ? block_end_item_k : dim_n),
-            (IsLayoutCongruousB ? dim_n : dim_k),
-            (IsLayoutCongruousB ? block_item_coords_k : 0),
-            b_block_item_coords())
-    {}
-
-
-    //-------------------------------------------------------------------------
-    // Prefetching utility methods
-    //-------------------------------------------------------------------------
-
-    /**
-     * Request the calling thread's slices of the shared tiles at depth \p tile_offset_k
-     */
-    inline __device__ void request_local_prefetch(
-        fragment_a_t local_slices_a[WmmaBlocksY],  ///< Slice from A
-        fragment_b_t local_slices_b[WmmaBlocksX],  ///< Slice from B
-        int tile_offset_k)
-    {
-        value_b_t const *smem_A_base = &scratch->pages[page_idx].alias().block_a[0][0];
-        value_b_t const *smem_B_base = &scratch->pages[page_idx].alias().block_b[0][0];
-
-        int constexpr kstride_a = (IsLayoutCongruousA ? LdmSmemA : 1);
-        int constexpr lstride_a = (IsLayoutCongruousA ? 1 : LdmSmemA);
-
-        int constexpr kstride_b = (IsLayoutCongruousB ? LdmSmemB : 1);
-        int constexpr lstride_b = (IsLayoutCongruousB ? 1 : LdmSmemB);
-
-        // Load B strip
-        #pragma unroll
-        for (int i = 0; i < WmmaBlocksX; ++i)
-        {
-            value_b_t const *smem_B_ptr =
-                &smem_B_base[tile_offset_k * kstride_b + (block_warp_item_coords.x + WmmaItemsX * i) * lstride_b];
-
-            nvcuda::wmma::load_matrix_sync(local_slices_b[i], smem_B_ptr, LdmSmemB);
-        }
-
-        // Load A strip
-        #pragma unroll
-        for (int i = 0; i < WmmaBlocksY; ++i)
-        {
-            value_a_t const *smem_A_ptr =
-                &smem_A_base[tile_offset_k * kstride_a + (block_warp_item_coords.y + WmmaItemsY * i) * lstride_a];
-
-            nvcuda::wmma::load_matrix_sync(local_slices_a[i], smem_A_ptr, LdmSmemA);
-        }
-    }
-
-
-    //-------------------------------------------------------------------------
-    // Epilogue
-    //-------------------------------------------------------------------------
-
-    /**
-     * Performs the GEMM epilogue:
-     *   - Applies the scalar multipliers and addends to the accumulators
-     *   - Write the result to the output matrix
-     */
-    inline __device__ void epilogue()
-    {
-        // Wait for predecessor thread block(s) to produce partial-sums
-        k_split.wait();
-
-        // Configure epilogue as to whether the thread block is a secondary
-        // accumulator in an inter-block k-splitting scheme
-        if (k_split.is_secondary_accumulator())
-            epilogue_op.set_secondary_accumulator();
-
-        // Whether or not the addend from C needs loading
-        bool must_init_addend = epilogue_op.must_init_addend();
-
-        int warp_base_x = grid_raster.block_item_coords.x + block_warp_item_coords.x;
-        int warp_base_y = grid_raster.block_item_coords.y + block_warp_item_coords.y;
-
-        int constexpr SmemStride = WmmaItemsY;
-
-        int warp_id = threadIdx.x / 32;
-
-        // Compute shape of one accumulator read/modify/write operation
-        int constexpr ItemsY = (WmmaItemsY);
-        int constexpr ItemsX = (32 / ItemsY);
-        int constexpr IterationsX = WmmaItemsX / ItemsX;
-
-        // Compute a rasterization of warp lanes across the WMMA tile.
-        int lane_id = (threadIdx.x % 32);
-        int lane_read_x = (lane_id / ItemsY);
-        int lane_read_y = (lane_id % ItemsY);
-
-        accum_t *smem_scratch = scratch->epilogue[warp_id];
-        accum_t const *smem_read_ptr = smem_scratch + lane_read_y + lane_read_x * SmemStride;
-
-        #pragma unroll
-        for (int xb = 0; xb < WmmaBlocksX; ++xb)
-        {
-            #pragma unroll
-            for (int yb = 0; yb < WmmaBlocksY; ++yb)
-            {
-                // Store accumulator tile to SMEM
-                nvcuda::wmma::store_matrix_sync(
-                    smem_scratch,
-                    accumulator.accumulators[xb][yb],
-                    SmemStride,
-                    matrix_layout<matrix_transform_t::NonTranspose>::kind);
-
-                // Synchronize threads within the warp
-                __syncthreads();
-
-                // Compute lane coordinates so that each thread efficiently accesses SMEM.
-                int c_x = (warp_base_x + (xb) * WmmaItemsX + lane_read_x);
-                int c_y = (warp_base_y + (yb) * WmmaItemsY + lane_read_y);
-
-                // Compute guard predicate by comparing against problem dimensions.
-                bool pred = c_y < dim_m;
-
-                // Compute output pointer from lane coordinates
-                int c_index = c_x * dim_m + c_y;
-                accum_t *c_ptr = reinterpret_cast<accum_t *>(d_c) + c_x * dim_m + c_y;
-
-                // Iterate over columns of output tile. Load from SMEM, compute epilogue operation,
-                // and stream output to global memory
-                #pragma unroll
-                for (int item_x = 0; item_x < IterationsX; ++item_x)
-                {
-                    accum_t accum = smem_read_ptr[item_x * ItemsX * SmemStride];
-                    accum_t c_element = 0;
-
-                    // Filter against problem dimensions as the warp iterates across the columns of
-                    // output.
-                    pred = (pred && ((c_x + item_x * ItemsX) < dim_n));
-
-                    if (must_init_addend && pred)
-                    {
-                        // NB: inline PTX to utilize strong operations for inter-block synchronization.
-                        //     The following is equivalent to:
-                        //
-                        //         c_element = c_ptr[0];
-                        asm volatile ("ld.global.cg.f32 %0, [%1];\n" : "=f"(c_element) : "l"(c_ptr));
-                    }
-
-                    c_element = epilogue_op(accum, c_element, c_index);
-
-                    if (pred)
-                    {
-                        // NB: inline PTX to utilize strong operations for inter-block synchronization.
-                        //     The following is equivalent to:
-                        //
-                        //         c_ptr[0] = c_element;
-
-                        asm volatile ("st.global.cg.f32 [%0], %1;\n" : : "l"(c_ptr), "f"(c_element));
-                    }
-
-                    // Increment output pointer
-                    c_ptr += dim_m * ItemsX;
-                    c_index += dim_m * ItemsX;
-                }
-                __syncthreads();
-            }
-        }
-
-        // Signal k-split successor thread_block
-        k_split.signal();
-    }
-
-    //-------------------------------------------------------------------------
-    // Tile consumption
-    //-------------------------------------------------------------------------
-
-    /**
-     * Consume a tile of A and B each
-     */
-    template <bool DoGlobalPrefetch>
-    inline __device__
-    void consume_tile()
-    {
-        // Request global prefetch for next tile on first strip
-        if (DoGlobalPrefetch)
-        {
-            loader_b.request();
-            loader_b.next();
-            loader_a.request();
-            loader_a.next();
-        }
-
-        // Unroll BlockDpVectorsK iterations of outer-product accumulations
-        #pragma unroll
-        for (int iteration = 0; iteration < WmmaUnrollCount; ++iteration)
-        {
-            int tile_offset_k = iteration * WmmaItemsK;
-
-            // Active load-from-shared index
-            int active_lds_idx = __NV_STD_MIN(WmmaUnrollCount - 1, (iteration) % 2);
-
-            // Next load-from-shared index
-            int next_lds_idx = __NV_STD_MIN(WmmaUnrollCount - 1, (iteration + 1) % 2);
-
-            // The last unrolled iteration commits the global fetches
-            if ((iteration == WmmaUnrollCount - 1) && DoGlobalPrefetch)
-            {
-                // If not using two pages of scratch tiles, protect the above prefetch loads from
-                // the committing writes below
-                if (!UseDoubleScratchTiles)
-                {
-                    __syncthreads();
-                }
-                else
-                {
-                    page_idx = (page_idx ? 0 : 1);
-                }
-
-                // Commit global prefetch data to scratch page
-                loader_a.template commit<LdmSmemA>(&scratch->pages[page_idx].alias().block_a[0][0]);
-                loader_b.template commit<LdmSmemB>(&scratch->pages[page_idx].alias().block_b[0][0]);
-
-                __syncthreads();
-            }
-
-            // Accumulate this dp-stripe product
-            accumulator.multiply_accumulate(
-                local_slices_a[active_lds_idx],
-                local_slices_b[active_lds_idx]);
-
-            // Request local prefetch for next strip
-            request_local_prefetch(
-                local_slices_a[next_lds_idx],
-                local_slices_b[next_lds_idx],
-                (tile_offset_k + WmmaItemsK) % BlockItemsK);
-        }
-    }
-
-    //-------------------------------------------------------------------------
-    // GEMM API
-    //-------------------------------------------------------------------------
-
-    /**
-     * Compute GEMM
-     */
-    inline __device__
-    void run()
-    {
-        // Quit if the thread block is fully out-of-bounds
-        if (grid_raster.is_block_oob(dim_m, dim_n))
-        {
-            asm volatile("exit;");
-        }
-
-        // Request global prefetch of first tile
-        loader_a.request();
-        loader_a.next();
-        loader_b.request();
-        loader_b.next();
-
-        // Commit global prefetch of first tile to shared memory
-        loader_a.template commit<LdmSmemA>(&scratch->pages[page_idx].alias().block_a[0][0]);
-        loader_b.template commit<LdmSmemB>(&scratch->pages[page_idx].alias().block_b[0][0]);
-
-        // Advance to next A,B tiles in K-axis
-        block_item_coords_k += BlockItemsK;
-
-        // Synchronize shared tiles and prepared accumulator
-        __syncthreads();
-
-        // Initialize thread's slice of accumulators
-        accumulator.init();
-
-        // Request first iteration of local prefetch strips
-        request_local_prefetch(
-            local_slices_a[0],
-            local_slices_b[0],
-            0);
-
-        //
-        // Main loop
-        //
-
-        // Consume tiles in A and B along the K-axis (all but last tile)
-        #pragma unroll 1
-        while (block_item_coords_k < block_end_item_k)
-        {
-            consume_tile<true>();
-
-            // Advance to next A,B tiles in K-axis
-            block_item_coords_k += BlockItemsK;
-        }
-
-        consume_tile<false>();
-
-        //
-        // Eplilogue
-        //
-
-        // prevent overwriting SMEM until all warps have finished loading data
-        __syncthreads();
-
-        // store accumulator tile to global memory
-        epilogue();
-    }
-};
-
-} // namespace gemm
-} // namespace cutlass
-
-#endif
--- a/cutlass/gemm/dispatch.h
+++ b/cutlass/gemm/dispatch.h
@ -1,534 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * GEMM kernel entrypoint and dispatch stub
- */
-
-#include <stdint.h>
-
-#include "../util/util.h"
-#include "block_task.h"
-#include "block_task_wmma.h"
-#include "grid_raster.h"
-#include "dispatch_policies.h"
-#include "k_split_control.h"
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * param_pack
- ******************************************************************************/
-
-/**
- * Parameter-pack structure
- *
- * Kernel launch latency is reduced when kernel arguments are wrapped into
- * a single parameter
- */
-template <
-    typename value_t,
-    typename accum_t,
-    typename epilogue_op_t>
-struct param_pack
-{
-    int m;                      ///< Height in rows of op(A) and C
-    int n;                      ///< Width in columns of op(B) and C
-    int k;                      ///< Width in columns of op(A) and height in rows of op(B)
-    k_split_control k_split;    ///< Abstraction for controlling inter-block k-splitting
-    value_t *d_a;               ///< Pointer to matrix A array values
-    value_t *d_b;               ///< Pointer to matrix B array values
-    accum_t *d_c;               ///< Pointer to matrix C array values
-    epilogue_op_t epilogue_op;
-
-    param_pack(
-        int m,                      ///< Height in rows of op(A) and C
-        int n,                      ///< Width in columns of op(B) and C
-        int k,                      ///< Width in columns of op(A) and height in rows of op(B)
-        k_split_control k_split,    ///< Abstraction for controlling inter-block k-splitting
-        epilogue_op_t op,           ///< Epilogue operation to update matrix C
-        value_t *d_a,               ///< Pointer to matrix A array values
-        value_t *d_b,               ///< Pointer to matrix B array values
-        accum_t *d_c)               ///< Pointer to matrix C array values
-    :
-        m(m),
-        n(n),
-        k(k),
-        k_split(k_split),
-        epilogue_op(op),
-        d_a(d_a),
-        d_b(d_b),
-        d_c(d_c)
-    {}
-
-};
-
-
-/******************************************************************************
- * Conditionally select the appropriate GEMM threadblock task
- ******************************************************************************/
-
-/// Conditional selection for block task
-template <
-    math_operation_class_t      math_op,            ///<
-    typename                    block_task_policy_t,  ///< Parameterization of block_task_policy
-    typename                    value_t,            ///< Multiplicand value type (matrices A and B)
-    typename                    accum_t,            ///< Accumulator value type (matrix C and scalars)
-    matrix_transform_t::kind_t  TransformA,         ///< View transform enumerant for matrix A
-    int                         LdgAlignA,          ///< Alignment (in bytes) for A operand
-    matrix_transform_t::kind_t  TransformB,         ///< View transform enumerant for matrix B
-    int                         LdgAlignB,          ///< Alignment (in bytes) for B operand
-    typename                    epilogue_op_t,      ///< Epilogue operation applied to GEMM
-    int                         LdgAlignC,          ///< Alignment (in bytes) for C operand
-    bool                        AllowRaggedTiles    ///< Whether GEMM supports matrix sizes other than multiple of BlockItems{XY}
->
-struct gemm_block_task;
-
-/// Scalar math operations
-template <
-    typename                    block_task_policy_t,  ///< Parameterization of block_task_policy
-    typename                    value_t,            ///< Multiplicand value type (matrices A and B)
-    typename                    accum_t,            ///< Accumulator value type (matrix C and scalars)
-    matrix_transform_t::kind_t  TransformA,         ///< View transform enumerant for matrix A
-    int                         LdgAlignA,          ///< Alignment (in bytes) for A operand
-    matrix_transform_t::kind_t  TransformB,         ///< View transform enumerant for matrix B
-    int                         LdgAlignB,          ///< Alignment (in bytes) for B operand
-    typename                    epilogue_op_t,      ///< Epilogue operation applied to GEMM
-    int                         LdgAlignC,          ///< Alignment (in bytes) for C operand
-    bool                        AllowRaggedTiles    ///< Whether GEMM supports matrix sizes other than multiple of BlockItems{XY}
->
-struct gemm_block_task<
-    math_operation_class_t::scalar,
-    block_task_policy_t,
-    value_t,
-    accum_t,
-    TransformA,
-    LdgAlignA,
-    TransformB,
-    LdgAlignB,
-    epilogue_op_t,
-    LdgAlignC,
-    AllowRaggedTiles
->
-{
-    // Parameterize task type
-    typedef block_task<
-            block_task_policy_t,
-            value_t,
-            accum_t,
-            TransformA,
-            LdgAlignA,
-            TransformB,
-            LdgAlignB,
-            epilogue_op_t,
-            LdgAlignC,
-            AllowRaggedTiles> type;
-};
-
-/// Matrix math operations
-template <
-    typename                    block_task_policy_t,  ///< Parameterization of block_task_policy
-    typename                    value_t,            ///< Multiplicand value type (matrices A and B)
-    typename                    accum_t,            ///< Accumulator value type (matrix C and scalars)
-    matrix_transform_t::kind_t  TransformA,         ///< View transform enumerant for matrix A
-    int                         LdgAlignA,          ///< Alignment (in bytes) for A operand
-    matrix_transform_t::kind_t  TransformB,         ///< View transform enumerant for matrix B
-    int                         LdgAlignB,          ///< Alignment (in bytes) for B operand
-    typename                    epilogue_op_t,      ///< Epilogue operation applied to GEMM
-    int                         LdgAlignC,          ///< Alignment (in bytes) for C operand
-    bool                        AllowRaggedTiles    ///< Whether GEMM supports matrix sizes other than multiple of BlockItems{XY}
->
-struct gemm_block_task<
-    math_operation_class_t::matrix,
-    block_task_policy_t,
-    value_t,
-    accum_t,
-    TransformA,
-    LdgAlignA,
-    TransformB,
-    LdgAlignB,
-    epilogue_op_t,
-    LdgAlignC,
-    AllowRaggedTiles>
-{
-
-#if defined(WMMA)   // conditional compilation with WMMA headers
-
-    // Parameterize task type
-    typedef block_task_wmma<
-            block_task_policy_t,
-            value_t,
-            accum_t,
-            TransformA,
-            LdgAlignA,
-            TransformB,
-            LdgAlignB,
-            epilogue_op_t,
-            LdgAlignC,
-            AllowRaggedTiles> type;
-
-#endif
-};
-
-/******************************************************************************
- * GEMM kernel entrypoint
- ******************************************************************************/
-
-/**
- * GEMM kernel
- *
- * NB: Not sure why NVVM is doing stuff with "__launch_bounds__" instead of just
- * passing it along to PTXAS, but it is currently resulting in less optimal codegen
- */
-template <
-    math_operation_class_t      math_op,            ///< Indicates which class of math operation to select
-    typename                    block_task_policy_t,  ///< Parameterization of block_task_policy
-    matrix_transform_t::kind_t  TransformA,         ///< Transformation op for matrix A
-    int                         LdgAlignA,          ///< Alignment of A matrix elements in bytes
-    matrix_transform_t::kind_t  TransformB,         ///< Transformation op for matrix B
-    int                         LdgAlignB,          ///< Alignment of B matrix elements in bytes
-    typename                    value_t,            ///< Multiplicand value type (matrices A and B)
-    typename                    accum_t,            ///< Accumulator value type (matrix C and scalars)
-    typename                    epilogue_op_t,      ///< Epilogue operation applied to update matrix C
-    int                         LdgAlignC,          ///< Alignment of C elements in bytes
-    bool                        AllowRaggedTiles>   ///< Boolean to indicate whether AllowRaggedTiles handling is enabled
-__global__ void kernel(param_pack<value_t, accum_t, epilogue_op_t> pack)
-{
-    // Parameterize task type
-    typedef typename gemm_block_task<
-        math_op,
-        block_task_policy_t,
-        value_t,
-        accum_t,
-        TransformA,
-        LdgAlignA,
-        TransformB,
-        LdgAlignB,
-        epilogue_op_t,
-        LdgAlignC,
-        AllowRaggedTiles>::type block_task_t;
-
-    // Declare statically-allocated shared storage
-    __shared__ typename block_task_t::scratch_storage_t smem;
-
-    // Construct and run the task
-    block_task_t(
-        &smem,
-        pack.d_a,
-        pack.d_b,
-        pack.d_c,
-        pack.epilogue_op,
-        pack.m,
-        pack.n,
-        pack.k,
-        pack.k_split).run();
-}
-
-
-/******************************************************************************
- * Launch configuration description returned to the caller
- ******************************************************************************/
-
-/// Return details about the launch configuration to the caller
-struct launch_configuration
-{
-    //
-    // Data members
-    //
-
-    /// cudaError_t resulting from grid launch
-    cudaError_t result;
-
-    /// Extent of a thread block's partition along the GEMM K-axis
-    int split_k;
-
-    /// Kernel grid extents in thread blocks
-    dim3 grid;
-
-    /// Thread block extents in threads
-    dim3 block;
-
-    //
-    // Methods
-    //
-
-    /// Constructor
-    launch_configuration():
-        result(cudaSuccess),
-        split_k(0),
-        grid(0, 0, 0),
-        block(0, 0, 0) {
-
-    }
-
-    /// Conversion from cudaError_t
-    launch_configuration(cudaError_t result):
-        result(result),
-        split_k(1),
-        grid(0, 0, 0),
-        block(0, 0, 0) {
-
-    }
-
-    /// Launch configuration for Cutlass kernels
-    launch_configuration(
-        cudaError_t result,
-        int split_k,
-        dim3 grid,
-        dim3 block
-    ):
-        result(result),
-        split_k(split_k),
-        grid(grid),
-        block(block) {
-
-    }
-};
-
-
-/******************************************************************************
- * Dispatch stub
- ******************************************************************************/
-
-/**
- * GEMM dispatch stub
- *
- * This function also serves as the autotuning entrypoint to evaluate different
- * tuning parameterizations of kernel.
- */
-template <
-    math_operation_class_t      math_op,            ///< Indicates which class of math operation to select
-    typename                    block_task_policy_t,  ///< Parameterization of block_task_policy
-    matrix_transform_t::kind_t  TransformA,         ///< Transformation op for matrix A
-    int                         LdgAlignA,          ///< Alignment of A matrix elements in bytes
-    matrix_transform_t::kind_t  TransformB,         ///< Transformation op for matrix B
-    int                         LdgAlignB,          ///< Alignment of B matrix elements in bytes
-    typename                    value_t,            ///< Multiplicand value type (matrices A and B)
-    typename                    accum_t,            ///< Accumulator value type (matrix C and scalars)
-    typename                    epilogue_op_t,      ///< Epilogue operation
-    int                         LdgAlignC,          ///< Alignment of C matrix elements in bytes
-    bool                        AllowRaggedTiles,   ///< Boolean to indicate whether AllowRaggedTiles handling is enabled
-    typename                    kernel_ptr_t>       ///< GEMM kernel function pointer type
-launch_configuration dispatch(
-    kernel_ptr_t    kernel_ptr,                     ///< GEMM kernel function pointer
-    int             m,                              ///< Height in rows of op(A) and C
-    int             n,                              ///< Width in columns of op(B) and C
-    int             k,                              ///< Width in columns of op(A) and height in rows of op(B)
-    epilogue_op_t   epilogue_op,                    ///< Epilogue operation to update matrix C
-    value_t         *d_a,                           ///< Device pointer to matrix A array values
-    value_t         *d_b,                           ///< Device pointer to matrix B array values
-    accum_t         *d_c,                           ///< Device pointer to matrix C array values
-    cudaStream_t    stream = 0,                     ///< CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool            debug_synchronous = true)       ///< Whether or not to synchronize the stream after every kernel launch
-                                                    ///  to check for errors.  Also causes launch configurations to be printed
-                                                    ///  to the console if DEBUG is defined.  Default is \p false.
-{
-    // Thread block rasterization type
-    typedef grid_raster<
-            block_task_policy_t::BlockItemsY,
-            block_task_policy_t::BlockItemsX,
-            TransformA,
-            TransformB,
-            block_task_policy_t::RasterStrategy>
-        grid_raster_t;
-
-    launch_configuration config;
-
-    // Compute block dims
-    config.block = dim3(block_task_policy_t::BlockThreads);
-
-    // Compute shared memory
-    int dynamic_smem_bytes = 0;
-
-    // Compute occupancy
-    int max_sm_occupancy;
-    if (CUDA_PERROR_DEBUG(config.result = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
-        &max_sm_occupancy,
-        kernel_ptr,
-        config.block.x * config.block.y,
-        dynamic_smem_bytes)))
-    {
-        return config;
-    }
-
-    // Compute grid extents
-    config.grid = grid_raster_t::grid_dims(m, n);
-
-    // Get SM count
-    int sm_count;
-    if (CUDA_PERROR_DEBUG(config.result = get_sm_count(sm_count)))
-        return config;
-
-    // Get k-split flag storage (TODO: make a pool)
-    int *d_flags;
-    if (CUDA_PERROR_DEBUG(config.result = cudaGetSymbolAddress((void**) &d_flags, d_flags_split_k)))
-        return config;
-
-    // Construct k-split coordinator
-    k_split_control k_split(
-        d_flags,
-        sm_count,
-        max_sm_occupancy,
-        k,
-        block_task_policy_t::BlockItemsK,
-        config.block,
-        config.grid);     // in,out
-
-    config.split_k = k_split.split_k;
-
-    // Log kernel configuration
-    if (debug_synchronous)
-    {
-        // Compute tiling efficiency
-        float block_tiling_efficiency = float(block_task_policy_t::BlockItemsY * block_task_policy_t::BlockItemsX) /
-            float(block_task_policy_t::BlockItemsY + block_task_policy_t::BlockItemsX);
-
-        float tiling_efficiency = block_tiling_efficiency;
-
-        float wave_efficiency = k_split.get_wave_efficiency(
-            sm_count, max_sm_occupancy, config.block, config.grid);
-
-        CUDA_LOG_DEBUG("Final wave_efficiency %.4f, tiling_efficiency %.4f\n",
-            wave_efficiency, tiling_efficiency);
-
-        CUDA_LOG_DEBUG("Invoking kernel<<<(%d, %d, %d), (%d.y,%d.x), %d, %lld>>>(), %d SM occupancy, %d split_k\n",
-            config.grid.x, config.grid.y, config.grid.z,
-            config.block.y, config.block.x,
-            dynamic_smem_bytes,
-            (long long) stream,
-            max_sm_occupancy,
-            k_split.split_k);
-    }
-
-    // Construct parameter-pack
-    param_pack<value_t, accum_t, epilogue_op_t> pack(
-        m,
-        n,
-        k,
-        k_split,
-        epilogue_op,
-        d_a,
-        d_b,
-        d_c);
-
-    // Prepare k-split coordinator
-    if (CUDA_PERROR_DEBUG(config.result = k_split.prepare(stream, debug_synchronous)))
-    {
-        return config;
-    }
-
-    // Invoke kernel
-    kernel_ptr<<< config.grid, config.block, dynamic_smem_bytes, stream >>>(pack);
-
-    // Check for failure to launch
-    if (CUDA_PERROR_DEBUG(config.result = cudaPeekAtLastError()))
-        return config;
-
-    // Sync the stream if specified to flush runtime errors
-    if (debug_synchronous && (CUDA_PERROR_DEBUG(config.result = cudaStreamSynchronize(stream))))
-        return config;
-
-    return config;
-}
-
-
-/******************************************************************************
- * GEMM
- ******************************************************************************/
-
-/**
- * Computes gemm on device matrices
- */
-template <
-    tiling_strategy::kind_t      TilingStrategy,    ///< Tile-sizing classification
-    math_operation_class_t      math_op,        ///< Indicates which class of math operation to select
-    matrix_transform_t::kind_t  TransformA,     ///< Transformation op for matrix A
-    int                         LdgAlignA,      ///< Alignment (in bytes) of A operand
-    matrix_transform_t::kind_t  TransformB,     ///< Transformation op for matrix B
-    int                         LdgAlignB,      ///< Alignment (in bytes) of B operand
-    typename                    value_t,        ///< Multiplicand value type (matrices A and B)
-    typename                    accum_t,        ///< Accumulator value type (matrix C and scalars)
-    typename                    epilogue_op_t,  ///< Epilogue operation to update matrix C
-    int                         LdgAlignC>      ///< Alignment (in bytes) of C operand
-launch_configuration device_gemm(
-    int             m,                          ///< Height in rows of op(A) and C
-    int             n,                          ///< Width in columns of op(B) and C
-    int             k,                          ///< Width in columns of op(A) and height in rows of op(B)
-    epilogue_op_t   epilogue_op,                ///< Epilogue operation to update matrix C
-    value_t         *d_a,                       ///< Device pointer to matrix A array values
-    value_t         *d_b,                       ///< Device pointer to matrix B array values
-    accum_t         *d_c,                       ///< Device pointer to matrix C array values
-    cudaStream_t    stream = 0,                 ///< CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool            debug_synchronous = false)  ///< Whether or not to synchronize the stream after every kernel launch to
-                                                ///  check for errors.  Also causes launch configurations to be printed to
-                                                ///  the console if DEBUG is defined.  Default is \p false.
-{
-    // Parameterize an task policy type
-    // (TODO: use a policy dispatch mechanism based upon SM version)
-    typedef gemm_policy<value_t, accum_t, TransformA, TransformB, TilingStrategy> block_task_policy_t;
-
-    // AllowRaggedTiles-tile check
-    if ((m % block_task_policy_t::BlockItemsY != 0) ||
-        (n % block_task_policy_t::BlockItemsX != 0) ||
-        (k % block_task_policy_t::BlockItemsK != 0))
-    {
-        // Needs ragged tile-handling
-        static const bool AllowRaggedTiles = true;
-
-        return dispatch<math_op, block_task_policy_t, TransformA, LdgAlignA, TransformB, LdgAlignB, value_t, accum_t, epilogue_op_t, LdgAlignC, AllowRaggedTiles>(
-            kernel<math_op,block_task_policy_t, TransformA, LdgAlignA, TransformB, LdgAlignB, value_t, accum_t, epilogue_op_t, LdgAlignC, AllowRaggedTiles>,
-            m,
-            n,
-            k,
-            epilogue_op,
-            d_a,
-            d_b,
-            d_c,
-            stream,
-            debug_synchronous);
-    }
-    else
-    {
-        // Does not need ragged tile-handling
-        static const bool AllowRaggedTiles = false;
-
-        return dispatch<math_op, block_task_policy_t, TransformA, LdgAlignA, TransformB, LdgAlignB, value_t, accum_t, epilogue_op_t, LdgAlignC, AllowRaggedTiles>(
-            kernel<math_op,block_task_policy_t, TransformA, LdgAlignA, TransformB, LdgAlignB, value_t, accum_t, epilogue_op_t, LdgAlignC, AllowRaggedTiles>,
-            m,
-            n,
-            k,
-            epilogue_op,
-            d_a,
-            d_b,
-            d_c,
-            stream,
-            debug_synchronous);
-    }
-
-
-}
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/gemm/dispatch_policies.h
+++ b/cutlass/gemm/dispatch_policies.h
@ -1,653 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Architecture-specific GEMM block_task policies
- */
-
-#include <stdint.h>
-
-#include "../util/util.h"
-#include "block_task.h"
-#include "grid_raster.h"
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * tiling_strategy
- ******************************************************************************/
-
-/**
- * Enumeration of tile-sizing granularities
- */
-struct tiling_strategy : printable_t
-{
-    /// \brief Enumerants
-    enum kind_t
-    {
-        Unknown,
-        Small,
-        Medium,
-        Large,
-        Tall,
-        Wide,
-        Huge,
-    };
-
-    /// Enumerant value
-    kind_t kind;
-
-    /// Default constructor
-    tiling_strategy() : kind(Unknown) {}
-
-    /// Copy constructor
-    tiling_strategy(const kind_t &other_kind) : kind(other_kind) {}
-
-    /// Cast to kind_t
-    operator kind_t() const { return kind; }
-
-    /// Returns the instance as a string
-    __host__ __device__ inline
-    char const* to_string() const
-    {
-        switch (kind)
-        {
-            case Small:     return "small";
-            case Medium:    return "medium";
-            case Large:     return "large";
-            case Tall:      return "tall";
-            case Wide:      return "wide";
-            case Huge:      return "huge";
-            case Unknown:
-            default:        return "unknown";
-        }
-    }
-
-    /// Insert the formatted instance into the output stream
-    void print(std::ostream& out) const { out << to_string(); }
-};
-
-
-/******************************************************************************
- * GEMM
- ******************************************************************************/
-
-/**
- * GEMM task policy specialization for sgemm
- */
-template <
-    typename value_t,
-    typename accum_t,
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB,      ///< Transformation op for matrix B
-    tiling_strategy::kind_t TilingStrategy>     ///< Tile-sizing classification
-struct gemm_policy;
-
-
-/******************************************************************************
- * SGEMM
- ******************************************************************************/
-
-/**
- * GEMM task policy specialization for small sgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<float, float, TransformA, TransformB, tiling_strategy::Small> :
-    block_task_policy<
-        16,     // _BlockItemsY
-        16,     // _BlockItemsX
-        16,     // _BlockItemsK
-        2,      // _ThreadItemsY
-        2,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-
-/**
- * GEMM task policy specialization for medium sgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<float, float, TransformA, TransformB, tiling_strategy::Medium> :
-    block_task_policy<
-        32,     // _BlockItemsY
-        32,     // _BlockItemsX
-        8,      // _BlockItemsK
-        4,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for large sgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<float, float, TransformA, TransformB, tiling_strategy::Large> :
-    block_task_policy<
-        64,     // _BlockItemsY
-        64,     // _BlockItemsX
-        8,      // _BlockItemsK
-        8,      // _ThreadItemsY
-        8,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for tall sgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<float, float, TransformA, TransformB, tiling_strategy::Tall> :
-    block_task_policy<
-        128,    // _BlockItemsY
-        32,     // _BlockItemsX
-        8,      // _BlockItemsK
-        8,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for wide sgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<float, float, TransformA, TransformB, tiling_strategy::Wide> :
-    block_task_policy<
-        32,     // _BlockItemsY
-        128,    // _BlockItemsX
-        8,      // _BlockItemsK
-        4,      // _ThreadItemsY
-        8,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for huge sgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<float, float, TransformA, TransformB, tiling_strategy::Huge> :
-    block_task_policy<
-        128,    // _BlockItemsY
-        128,    // _BlockItemsX
-        8,      // _BlockItemsK
-        8,      // _ThreadItemsY
-        8,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-
-/******************************************************************************
- * DGEMM
- ******************************************************************************/
-
-/**
- * GEMM task policy specialization for small dgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<double, double, TransformA, TransformB, tiling_strategy::Small> :
-    block_task_policy<
-        16,     // _BlockItemsY
-        16,     // _BlockItemsX
-        16,     // _BlockItemsK
-        2,      // _ThreadItemsY
-        2,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-
-/**
- * GEMM task policy specialization for medium dgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<double, double, TransformA, TransformB, tiling_strategy::Medium> :
-    block_task_policy<
-        32,     // _BlockItemsY
-        32,     // _BlockItemsX
-        16,     // _BlockItemsK
-        4,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for large dgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<double, double, TransformA, TransformB, tiling_strategy::Large> :
-    block_task_policy<
-        64,     // _BlockItemsY
-        64,     // _BlockItemsX
-        8,      // _BlockItemsK
-        4,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for tall dgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<double, double, TransformA, TransformB, tiling_strategy::Tall> :
-    block_task_policy<
-        128,    // _BlockItemsY
-        32,     // _BlockItemsX
-        8,      // _BlockItemsK
-        8,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for wide dgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<double, double, TransformA, TransformB, tiling_strategy::Wide> :
-    block_task_policy<
-        32,     // _BlockItemsY
-        128,    // _BlockItemsX
-        8,      // _BlockItemsK
-        4,      // _ThreadItemsY
-        8,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for huge dgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<double, double, TransformA, TransformB, tiling_strategy::Huge> :
-    block_task_policy<
-        64,     // _BlockItemsY
-        128,    // _BlockItemsX
-        8,      // _BlockItemsK
-        8,      // _ThreadItemsY
-        8,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-
-/******************************************************************************
- * HGEMM
- ******************************************************************************/
-
-/**
- * GEMM task policy specialization for small hgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<__half, __half, TransformA, TransformB, tiling_strategy::Small> :
-    block_task_policy<
-        32,     // _BlockItemsY
-        32,     // _BlockItemsX
-        8,      // _BlockItemsK
-        4,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for medium hgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<__half, __half, TransformA, TransformB, tiling_strategy::Medium> :
-    block_task_policy<
-        32,     // _BlockItemsY
-        32,     // _BlockItemsX
-        16,     // _BlockItemsK
-        8,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for large hgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<__half, __half, TransformA, TransformB, tiling_strategy::Large> :
-    block_task_policy<
-        64,     // _BlockItemsY
-        64,     // _BlockItemsX
-        8,      // _BlockItemsK
-        16,     // _ThreadItemsY
-        8,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for tall hgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<__half, __half, TransformA, TransformB, tiling_strategy::Tall> :
-    block_task_policy<
-        128,    // _BlockItemsY
-        32,     // _BlockItemsX
-        8,      // _BlockItemsK
-        16,     // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-
-/**
- * GEMM task policy specialization for wide hgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<__half, __half, TransformA, TransformB, tiling_strategy::Wide> :
-    block_task_policy<
-        32,     // _BlockItemsY
-        128,    // _BlockItemsX
-        8,      // _BlockItemsK
-        8,      // _ThreadItemsY
-        8,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for huge hgemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<__half, __half, TransformA, TransformB, tiling_strategy::Huge> :
-    block_task_policy<
-        128,    // _BlockItemsY
-        128,    // _BlockItemsX
-        8,      // _BlockItemsK
-        16,     // _ThreadItemsY
-        8,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-
-/******************************************************************************
- * IGEMM
- ******************************************************************************/
-
-/**
- * GEMM task policy specialization for small igemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<int8_t, int32_t, TransformA, TransformB, tiling_strategy::Small> :
-    block_task_policy<
-        16,     // _BlockItemsY
-        32,     // _BlockItemsX
-        32,     // _BlockItemsK
-        4,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-
-/**
- * GEMM task policy specialization for medium igemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<int8_t, int32_t, TransformA, TransformB, tiling_strategy::Medium> :
-    block_task_policy<
-        32,     // _BlockItemsY
-        32,     // _BlockItemsX
-        32,     // _BlockItemsK
-        4,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for large igemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<int8_t, int32_t, TransformA, TransformB, tiling_strategy::Large> :
-    block_task_policy<
-        64,     // _BlockItemsY
-        64,     // _BlockItemsX
-        32,     // _BlockItemsK
-        8,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>  // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for large igemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<int8_t, int32_t, TransformA, TransformB, tiling_strategy::Tall> :
-    block_task_policy<
-        128,    // _BlockItemsY
-        64,     // _BlockItemsX
-        64,     // _BlockItemsK
-        8,      // _ThreadItemsY
-        4,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>   // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for large igemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<int8_t, int32_t, TransformA, TransformB, tiling_strategy::Wide> :
-    block_task_policy<
-        64,     // _BlockItemsY
-        128,    // _BlockItemsX
-        64,     // _BlockItemsK
-        4,      // _ThreadItemsY
-        8,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>  // _RasterStrategy
-{};
-
-/**
- * GEMM task policy specialization for huge igemm
- */
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<int8_t, int32_t, TransformA, TransformB, tiling_strategy::Huge> :
-    block_task_policy<
-        128,    // _BlockItemsY
-        128,    // _BlockItemsX
-        32,     // _BlockItemsK
-        8,      // _ThreadItemsY
-        8,      // _ThreadItemsX
-        false,  // _UseDoubleScratchTiles
-        grid_raster_strategy::Default>  // _RasterStrategy
-{};
-
-
-/******************************************************************************
- * WMMA GEMM
- ******************************************************************************/
-
-// WMMA is a preview feature in CUDA. Conditionally enable wmma_gemm policies.
-#if defined(WMMA)
-
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<half, float, TransformA, TransformB, tiling_strategy::Small> :
-    gemm::block_task_wmma_policy<
-        16,     // _BlockItemsY
-        16,     // _BlockItemsX
-        16,     // _BlockItemsK
-        16,     // _WarpItemsY
-        16,     // _WarpItemsX
-        16,     // _WmmaItemsY
-        16,     // _WmmaItemsX
-        16,     // _WmmaItemsK
-        false,  // _UseDoubleScratchTiles
-        gemm::grid_raster_strategy::Default>    // _RasterStrategy
-{};
-
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy<half, float, TransformA, TransformB, tiling_strategy::Medium> :
-    gemm::block_task_wmma_policy<
-        32,     // _BlockItemsY
-        32,     // _BlockItemsX
-        32,     // _BlockItemsK
-        32,     // _WarpItemsY
-        32,     // _WarpItemsX
-        16,     // _WmmaItemsY
-        16,     // _WmmaItemsX
-        16,     // _WmmaItemsK
-        false,  // _UseDoubleScratchTiles
-        gemm::grid_raster_strategy::Default>    // _RasterStrategy
-{};
-
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy< half, float, TransformA, TransformB, tiling_strategy::Large> :
-    gemm::block_task_wmma_policy<
-        64,     // _BlockItemsY
-        64,     // _BlockItemsX
-        32,     // _BlockItemsK
-        32,     // _WarpItemsY
-        64,     // _WarpItemsX
-        16,     // _WmmaItemsY
-        16,     // _WmmaItemsX
-        16,     // _WmmaItemsK
-        false,  // _UseDoubleScratchTiles
-        gemm::grid_raster_strategy::Default>    // _RasterStrategy
-{};
-
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy< half, float, TransformA, TransformB, tiling_strategy::Tall> :
-    gemm::block_task_wmma_policy<
-        128,    // _BlockItemsY
-        64,     // _BlockItemsX
-        64,     // _BlockItemsK
-        32,     // _WarpItemsY
-        64,     // _WarpItemsX
-        16,     // _WmmaItemsY
-        16,     // _WmmaItemsX
-        16,     // _WmmaItemsK
-        false,  // _UseDoubleScratchTiles
-        gemm::grid_raster_strategy::Default>    // _RasterStrategy
-{};
-
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy< half, float, TransformA, TransformB, tiling_strategy::Wide> :
-    gemm::block_task_wmma_policy<
-        64,     // _BlockItemsY
-        128,    // _BlockItemsX
-        64,     // _BlockItemsK
-        32,     // _WarpItemsY
-        64,     // _WarpItemsX
-        16,     // _WmmaItemsY
-        16,     // _WmmaItemsX
-        16,     // _WmmaItemsK
-        false,  // _UseDoubleScratchTiles
-        gemm::grid_raster_strategy::Default>    // _RasterStrategy
-{};
-
-template <
-    matrix_transform_t::kind_t TransformA,      ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB>      ///< Transformation op for matrix B
-struct gemm_policy< half, float, TransformA, TransformB, tiling_strategy::Huge> :
-    gemm::block_task_wmma_policy<
-        128,    // _BlockItemsY
-        128,    // _BlockItemsX
-        64,     // _BlockItemsK
-        32,     // _WarpItemsY
-        64,     // _WarpItemsX
-        16,     // _WmmaItemsY
-        16,     // _WmmaItemsX
-        16,     // _WmmaItemsK
-        false,  // _UseDoubleScratchTiles
-        gemm::grid_raster_strategy::Default>    // _RasterStrategy
-{};
-
-#endif
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/gemm/dp_accummulate.h
+++ b/cutlass/gemm/dp_accummulate.h
@ -1,215 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Abstraction for exposing architecture-specific "dot-product-accumulate"
- * ISA operations
- */
-
-#include <stdint.h>
-
-#include "../util/util.h"
-
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * dp_accummulate
- ******************************************************************************/
-
-
-/**
- * \brief Abstraction for exposing architecture-specific "dot-product-accumulate"
- * ISA operations
- *
- * Given two K-component vectors a and b having type value_t[K] and an addend c
- * of type accum_t, the "dot-product-accumulate" of type accum_t is computed
- * as d = x[0]*y[0] + x[1]*y[1] + ...  + x[K-1]*y[K-1] + c.
- *
- * We use the notation "dpK" to connote a K-component dot-product-accumulate.
- * For example, "dp1" is a simple multiply-add.
- *
- * For given pairing of value_t and accum_t types, the corresponding
- * dp_accummulate class will:
- *
- * - Define the member-type dp_vector_t as the appropriate K-component vector
- *   type needed to leverage architecture-specific "dot-product accumulate"
- *   ISA operations.
- * - Implement the corresponding dot-product operation between two dp_vector_t
- *   inputs a and b.
- *
- */
-template <
-    typename value_t,       ///< Component value type
-    typename accum_t>       ///< Accumulator value type
-struct dp_accummulate;
-
-
-
-/// Default "dp1" dot-product-accumulate traits specialization for value_t->accum_t
-template <
-    typename value_t,       ///< Component value type
-    typename accum_t>       ///< Accumulator value type
-struct dp_accummulate
-{
-    /// Single-component "dp1" dot-product vector type
-    typedef value_t dp_vector_t;
-
-
-    /// Compute "dp1" float->float
-    inline __device__
-    static void mad(
-        float &d,
-        const float &a,
-        const float &b,
-        const float &c)
-    {
-        asm volatile ( "fma.rn.f32 %0, %1, %2, %3;\n"
-            : "=f"(d) : "f"(a), "f"(b), "f"(c));
-    }
-
-
-    /// Compute "dp1" double->double
-    inline __device__
-    static void mad(
-        double &d,
-        const double &a,
-        const double &b,
-        const double &c)
-    {
-        asm volatile ("fma.rn.f64 %0, %1, %2, %3;\n"
-            : "=d"(d) : "d"(a), "d"(b), "d"(c));
-    }
-
-
-    /// Compute "dp1" int16_t->int32_t
-    inline __device__
-    static void mad(
-        int32_t &d,
-        const int16_t &a,
-        const int16_t &b,
-        const int32_t &c)
-    {
-        asm volatile ("mad.wide.s16 %0, %1, %2, %3;\n"
-            : "=r"(d) : "h"(a), "h"(b), "r"(c));
-    }
-
-
-    /// Compute "dp1" uint16_t->uint32_t
-    inline __device__
-    static void mad(
-        uint32_t &d,
-        const uint16_t &a,
-        const uint16_t &b,
-        const uint32_t &c)
-    {
-        asm volatile ("mad.wide.u16 %0, %1, %2, %3;\n"
-            : "=r"(d) : "h"(a), "h"(b), "r"(c));
-    }
-
-
-    /// Compute "dp1" int32_t->int32_t
-    inline __device__
-    static void mad(
-        int32_t &d,
-        const int32_t &a,
-        const int32_t &b,
-        const int32_t &c)
-    {
-        asm volatile ("mad.lo.s32 %0, %1, %2, %3;\n"
-            : "=r"(d) : "r"(a), "r"(b), "r"(c));
-    }
-
-
-    /// Compute "dp1" uint32_t->uint32_t
-    inline __device__
-    static void mad(
-        uint32_t &d,
-        const uint32_t &a,
-        const uint32_t &b,
-        const uint32_t &c)
-    {
-        asm volatile ("mad.lo.u32 %0, %1, %2, %3;\n"
-            : "=r"(d) : "r"(a), "r"(b), "r"(c));
-    }
-
-};
-
-
-
-#if (CUTLASS_ARCH >= 610)   // Specializations only enabled for Pascal SM610+
-
-
-/// "dp4" dot-product-accumulate traits specialization for int8_t->int32_t
-template <>
-struct dp_accummulate<
-    int8_t,                 ///< Component value type
-    int32_t>                ///< Accumulator value type
-{
-    /// Four-component signed "idp4"
-    typedef int32_t dp_vector_t;
-
-    /// Compute "dp4" int16_t->int32_t
-    inline __device__
-    static void mad(
-        int32_t &d,
-        const int32_t &a,
-        const int32_t &b,
-        const int32_t &c)
-    {
-        asm volatile ( "dp4a.s32.s32 %0, %1, %2, %3;\n"
-            : "=r"(d) : "r"(a), "r"(b), "r"(c));
-    }
-};
-
-
-/// "dp4" dot-product-accumulate traits specialization for uint8_t->uint32_t
-template <>
-struct dp_accummulate<
-    uint8_t,                ///< Component value type
-    uint32_t>               ///< Accumulator value type
-{
-    /// Four-component unsigned "idp4"
-    typedef uint32_t dp_vector_t;
-
-    /// Compute "dp4" uint16_t->uint32_t
-    inline __device__
-    static void mad(
-        uint32_t &d,
-        const uint32_t &a,
-        const uint32_t &b,
-        const uint32_t &c)
-    {
-        asm volatile ( "dp4a.u32.u32 %0, %1, %2, %3;\n"
-            : "=r"(d) : "r"(a), "r"(b), "r"(c));
-    }
-};
-
-
-#endif // Specializations only enabled for Pascal SM610+
-
-
-} // namespace gemm
-} // namespace cutlass
-
--- a/cutlass/gemm/epilogue_function.h
+++ b/cutlass/gemm/epilogue_function.h
@ -1,96 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Epilogue operation to compute final output
- */
-
-namespace cutlass {
-namespace gemm {
-
-    //// Used by GEMM to compute the final result C <= alpha * accumulator + beta * C
-    template <
-        typename accum_t,
-        typename output_t,
-        typename scalar_t
-    >
-    class blas_scaled_epilogue
-    {
-    public:
-
-        scalar_t alpha;
-        scalar_t beta;
-
-        inline __device__ __host__
-        blas_scaled_epilogue(
-            scalar_t alpha,
-            scalar_t beta)
-        :
-            alpha(alpha),
-            beta(beta)
-        {}
-
-
-        /// Epilogue operator
-        inline __device__ __host__
-        output_t operator()(
-            accum_t accumulator,
-            output_t c,
-            size_t idx) const
-        {
-            return output_t(alpha * scalar_t(accumulator) + beta * scalar_t(c));
-        }
-
-
-        /// Epilogue operator
-        inline __device__ __host__
-        output_t operator()(
-            accum_t accumulator,
-            size_t idx) const
-        {
-            return output_t(alpha * scalar_t(accumulator));
-        }
-
-        /**
-         * Configure epilogue as to whether the thread block is a secondary
-         * accumulator in an inter-block k-splitting scheme
-         */
-        inline __device__
-        void set_secondary_accumulator()
-        {
-            beta = scalar_t(1);
-        }
-
-
-        /// Return whether the beta-scaled addend needs initialization
-        inline __device__
-        bool must_init_addend()
-        {
-            return (beta != scalar_t(0));
-        }
-    };
-
-
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/gemm/grid_raster.h
+++ b/cutlass/gemm/grid_raster.h
@ -1,428 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Abstraction for enumerating \p block_task within an input matrix
- */
-
-#include <stdint.h>
-
-#include "../util/util.h"
-
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * grid_raster_strategy
- ******************************************************************************/
-
-/**
- * \brief Strategies for enumerating \p block_task within an input matrix
- */
-struct grid_raster_strategy
-{
-    /// \brief Enumerants
-    enum kind_t
-    {
-        /**
-         * Default \p block_task assignment (currently ColumnMajor for N*,
-         * RowMajor for TT, and TiledCohort for TN)
-         */
-        Default,
-
-        /**
-         * Column-major \p block_task assignment
-         */
-        ColumnMajor,
-
-        /**
-         * Row-major \p block_task assignment
-         */
-        RowMajor,
-
-        /**
-         * Two-level \p block_task assignment (both column-major)
-         */
-        TiledCohort,
-    };
-};
-
-
-
-/******************************************************************************
- * grid_raster
- ******************************************************************************/
-
-/**
- * \brief Abstraction for enumerating \p block_task within an input matrix
- *
- * NB: This generic class is not directly constructible.  Algorithm-specific
- * template specializations will provide the API functionality prescribed here.
- */
-template <
-    int                             BlockItemsY,    ///< Height in rows of a block-wide tile in matrix C
-    int                             BlockItemsX,    ///< Width in columns of a block-wide tile in matrix C
-    matrix_transform_t::kind_t      TransformA,     ///< View transform enumerant for matrix A
-    matrix_transform_t::kind_t      TransformB,     ///< View transform enumerant for matrix B
-    grid_raster_strategy::kind_t    RasterStrategy> ///< Strategy for enumerating \p block_task within an input matrix
-struct grid_raster
-{
-    //-------------------------------------------------------------------------
-    // Device API
-    //-------------------------------------------------------------------------
-
-    /// Thread block's base item coordinates (x, y) in matrix C
-    int2 block_item_coords;
-
-    /// Constructor
-    grid_raster();
-
-    /// Whether the thread block base coordinates are out-of-bounds for an m*n matrix C
-    bool is_block_oob(int m, int n);
-
-
-    //-------------------------------------------------------------------------
-    // Grid launch API
-    //-------------------------------------------------------------------------
-
-    /// Compute the kernel grid extents (in thread blocks) for consuming an m*n matrix C
-    static dim3 grid_dims(int m, int n);
-};
-
-
-
-/******************************************************************************
- * grid_raster (ColumnMajor specialization)
- ******************************************************************************/
-
-/**
- * \brief Abstraction for enumerating \p block_task within an input matrix
- * (ColumnMajor specialization)
- *
- * Maps thread blocksin column-major fashion
- */
-template <
-    int                         BlockItemsY,          ///< Height in rows of a block-wide tile in matrix C
-    int                         BlockItemsX,          ///< Width in columns of a block-wide tile in matrix C
-    matrix_transform_t::kind_t  TransformA,         ///< View transform enumerant for matrix A
-    matrix_transform_t::kind_t  TransformB>         ///< View transform enumerant for matrix B
-struct grid_raster<
-    BlockItemsY,
-    BlockItemsX,
-    TransformA,
-    TransformB,
-    grid_raster_strategy::ColumnMajor>                   ///< Strategy for enumerating \p block_task within an input matrix
-{
-    //-------------------------------------------------------------------------
-    // Device API
-    //-------------------------------------------------------------------------
-
-    /// Thread block's base item coordinates (x, y) in matrix C
-    int2 block_item_coords;
-
-    /// Constructor
-    inline __device__
-    grid_raster()
-    {
-        // blockDim.x is the fastest changing grid dim on current architectures
-        block_item_coords = make_int2(
-            BlockItemsX * blockIdx.y,
-            BlockItemsY * blockIdx.x);
-    }
-
-    /// Whether the base \p block_item_coords are out-of-bounds for an m*n matrix C
-    inline __device__
-    bool is_block_oob(int m, int n)
-    {
-        // ColumnMajor never rasterizes fully out-of-bounds thread blocks
-        return false;
-    }
-
-    //-------------------------------------------------------------------------
-    // Grid launch API
-    //-------------------------------------------------------------------------
-
-    /// Compute the kernel grid extents (in thread blocks) for consuming an m*n matrix C
-    inline __host__ __device__
-    static dim3 grid_dims(int m, int n)
-    {
-        // blockDim.x is the fastest changing grid dim on current architectures
-        return dim3(
-            (m + BlockItemsY - 1) / BlockItemsY,
-            (n + BlockItemsX - 1) / BlockItemsX);
-    }
-};
-
-
-
-/******************************************************************************
- * grid_raster (RowMajor specialization)
- ******************************************************************************/
-
-/**
- * \brief Abstraction for enumerating \p block_task within an input matrix
- * (RowMajor specialization)
- *
- * Enumerates \p block_task in row-major fashion
- */
-template <
-    int                         BlockItemsY,    ///< Height in rows of a block-wide tile in matrix C
-    int                         BlockItemsX,    ///< Width in columns of a block-wide tile in matrix C
-    matrix_transform_t::kind_t  TransformA,     ///< View transform enumerant for matrix A
-    matrix_transform_t::kind_t  TransformB>     ///< View transform enumerant for matrix B
-struct grid_raster<
-    BlockItemsY,
-    BlockItemsX,
-    TransformA,
-    TransformB,
-    grid_raster_strategy::RowMajor>             ///< Strategy for enumerating \p block_task within an input matrix
-{
-    //-------------------------------------------------------------------------
-    // Device API
-    //-------------------------------------------------------------------------
-
-    /// Thread block's base item coordinates (x, y) in matrix C
-    int2 block_item_coords;
-
-    /// Constructor
-    inline __device__
-    grid_raster()
-    {
-        // blockDim.x is the fastest changing grid dim on current architectures
-        block_item_coords = make_int2(
-            BlockItemsX * blockIdx.x,
-            BlockItemsY * blockIdx.y);
-    }
-
-    /// Whether the base \p block_item_coords are out-of-bounds for an m*n matrix C
-    inline __device__
-    bool is_block_oob(int m, int n)
-    {
-        // RowMajor never rasterizes fully out-of-bounds thread blocks
-        return false;
-    }
-
-    //-------------------------------------------------------------------------
-    // Grid launch API
-    //-------------------------------------------------------------------------
-
-    /// Compute the kernel grid extents (in thread blocks) for consuming an m*n matrix C
-    inline __host__ __device__
-    static dim3 grid_dims(int m, int n)
-    {
-        // blockDim.x is the fastest changing grid dim on current architectures
-        return dim3(
-            (n + BlockItemsX - 1) / BlockItemsX,
-            (m + BlockItemsY - 1) / BlockItemsY);
-    }
-
-};
-
-
-
-/******************************************************************************
- * grid_raster (TiledCohort specialization)
- ******************************************************************************/
-
-/**
- * \brief Abstraction for enumerating \p block_task within an input matrix
- * (TiledCohort specialization)
- *
- * Enumerates \p block_task in column-major fashion across "cohort" tiles (where
- * cohorts are CohortBlocksY high and CohortBlocksX wide), and enumerates cohorts
- * across the matrix in column-major fashion.
- *
- * Grid layout:
- *  - gridDim.y is the height of the grid in cohorts
- *  - gridDim.x is the width of the grid in cohorts multiplied by the number of
- *    thread blocks per cohort
- */
-template <
-    int                         BlockItemsY,    ///< Height in rows of a block-wide tile in matrix C
-    int                         BlockItemsX,    ///< Width in columns of a block-wide tile in matrix C
-    matrix_transform_t::kind_t  TransformA,     ///< View transform enumerant for matrix A
-    matrix_transform_t::kind_t  TransformB>     ///< View transform enumerant for matrix B
-struct grid_raster<
-    BlockItemsY,
-    BlockItemsX,
-    TransformA,
-    TransformB,
-    grid_raster_strategy::TiledCohort>       ///< Strategy for enumerating \p block_task within an input matrix
-{
-    enum
-    {
-        /// Height in thread blocks of a grid rasterization cohort
-        CohortBlocksY = 2,
-
-        /// Width in thread blocks of a grid rasterization cohort
-        CohortBlocksX = 2,
-
-        /// Number of thread blocks per cohort
-        BlocksPerCohort = CohortBlocksY * CohortBlocksX,
-
-        /// Height in items of a grid rasterization cohort
-        CohortItemsY = CohortBlocksY * BlockItemsY,
-
-        /// Width in items of a grid rasterization cohort
-        CohortItemsX = CohortBlocksX * BlockItemsX,
-
-    };
-
-    //-------------------------------------------------------------------------
-    // Device API
-    //-------------------------------------------------------------------------
-
-    /// Thread block's base item coordinates (x, y) in matrix C
-    int2 block_item_coords;
-
-    /// Constructor
-    inline __device__
-    grid_raster()
-    {
-        int block_idx_cohort = blockIdx.x % BlocksPerCohort;
-        int2 cohort_coords_grid = make_int2(
-            blockIdx.x / BlocksPerCohort,
-            blockIdx.y);
-
-        // Cohort is rastered in column-major order
-        int2 block_coords_cohort = make_int2(
-            block_idx_cohort / CohortBlocksY,
-            block_idx_cohort % CohortBlocksY);
-
-        block_item_coords = make_int2(
-            ((cohort_coords_grid.x * CohortBlocksX) + block_coords_cohort.x) * BlockItemsX,
-            ((cohort_coords_grid.y * CohortBlocksY) + block_coords_cohort.y) * BlockItemsY);
-    }
-
-    /// Whether the base \p block_item_coords are out-of-bounds for an m*n matrix C
-    inline __device__
-    bool is_block_oob(int m, int n)
-    {
-        /// thread blocks within the cohort may be fully out-of-bounds
-        return (block_item_coords.x >= n) || (block_item_coords.y >= m);
-    }
-
-    //-------------------------------------------------------------------------
-    // Grid launch API
-    //-------------------------------------------------------------------------
-
-    /// Compute the kernel grid extents (in thread blocks) for consuming an m*n matrix C
-    inline __host__ __device__
-    static dim3 grid_dims(int m, int n)
-    {
-        // Extents of C matrix in cohorts
-        int2 grid_cohort_dims = make_int2(
-            (n + CohortItemsX - 1) / CohortItemsX,
-            (m + CohortItemsY - 1) / CohortItemsY);
-
-        return dim3(
-            grid_cohort_dims.x * BlocksPerCohort,   // gridDim.x is width of grid in cohorts * size of cohort in blocks
-            grid_cohort_dims.y,                     // gridDim.y is height of grid in cohorts
-            1);                                     // gridDim.z is reserved for optional k-splitting
-    }
-};
-
-
-/******************************************************************************
- * grid_raster (Default specializations)
- ******************************************************************************/
-
-/**
- * \brief Abstraction for enumerating \p block_task within an input matrix
- * (Default N* specialization)
- *
- * Maps thread blocksin column-major fashion
- */
-template <
-    int BlockItemsY,                        ///< Height in rows of a block-wide tile in matrix C
-    int BlockItemsX,                        ///< Width in columns of a block-wide tile in matrix C
-    matrix_transform_t::kind_t TransformB>  ///< View transform enumerant for matrix B
-struct grid_raster<
-    BlockItemsY,
-    BlockItemsX,
-    matrix_transform_t::NonTranspose,       ///< View transform enumerant for matrix A
-    TransformB,
-    grid_raster_strategy::Default>         ///< Strategy for enumerating \p block_task within an input matrix
-:
-    grid_raster<
-        BlockItemsY,
-        BlockItemsX,
-        matrix_transform_t::NonTranspose,
-        TransformB,
-        grid_raster_strategy::ColumnMajor>
-{};
-
-
-/**
- * \brief Abstraction for enumerating \p block_task within an input matrix
- * (Default TT specialization)
- *
- * Maps thread blocksin row-major fashion
- */
-template <
-    int BlockItemsY,                    ///< Height in rows of a block-wide tile in matrix C
-    int BlockItemsX>                    ///< Width in columns of a block-wide tile in matrix C
-struct grid_raster<
-    BlockItemsY,
-    BlockItemsX,
-    matrix_transform_t::Transpose,      ///< View transform enumerant for matrix A
-    matrix_transform_t::Transpose,      ///< View transform enumerant for matrix B
-    grid_raster_strategy::Default>     ///< Strategy for enumerating \p block_task within an input matrix
-:
-    grid_raster<
-        BlockItemsY,
-        BlockItemsX,
-        matrix_transform_t::Transpose,
-        matrix_transform_t::Transpose,
-        grid_raster_strategy::RowMajor>
-{};
-
-
-/**
- * \brief Abstraction for enumerating \p block_task within an input matrix
- * (Default TN specialization)
- *
- * Maps thread blocksin blocked cohorts
- */
-template <
-    int BlockItemsY,                    ///< Height in rows of a block-wide tile in matrix C
-    int BlockItemsX>                    ///< Width in columns of a block-wide tile in matrix C
-struct grid_raster<
-    BlockItemsY,
-    BlockItemsX,
-    matrix_transform_t::Transpose,      ///< View transform enumerant for matrix A
-    matrix_transform_t::NonTranspose,   ///< View transform enumerant for matrix B
-    grid_raster_strategy::Default>     ///< Strategy for enumerating \p block_task within an input matrix
-:
-    grid_raster<
-        BlockItemsY,
-        BlockItemsX,
-        matrix_transform_t::Transpose,
-        matrix_transform_t::NonTranspose,
-        grid_raster_strategy::TiledCohort>
-{};
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/gemm/k_split_control.h
+++ b/cutlass/gemm/k_split_control.h
@ -1,302 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Abstraction for coordinating inter-block k-splitting
- */
-
-#include <stdint.h>
-
-#include "../util/util.h"
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * Storage and initialization
- ******************************************************************************/
-
-enum
-{
-    NumFlagsSplitK = 4096
-};
-
-
-/**
- * Global K-split semaphore flags
- *
- * TODO: use demand-allocated storage to provide copies for concurrent streams
- */
-__device__ int d_flags_split_k[NumFlagsSplitK];
-
-
-/**
- * Preparation kernel for zero-initializing semaphore flags
- */
-__global__ void prepare_kernel(int *d_flags_split_k)
-{
-    int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
-    if (tid < NumFlagsSplitK)
-        d_flags_split_k[tid] = 0;
-}
-
-
-/******************************************************************************
- * k_split_control
- ******************************************************************************/
-
-/**
- * \brief Abstraction for coordinating inter-block k-splitting
- */
-struct k_split_control
-{
-    /// Extent of a thread block's partition along the GEMM K-axis
-    int split_k;
-
-    /// Whether or not to use a semaphore for inter-block k-splitting.
-    bool use_semaphore;
-
-    /// Pointer to semaphore
-    int *d_flags;
-
-
-
-    //-------------------------------------------------------------------------
-    // Device API
-    //-------------------------------------------------------------------------
-
-    /**
-     * Return the thread block's starting coordinate (k) within the
-     * multiplicand matrices
-     */
-    inline __device__
-    int block_begin_item_k()
-    {
-        return blockIdx.z * split_k;
-    }
-
-
-    /**
-     * Return the thread block's ending coordinate (k) within the multiplicand
-     * matrices (one-past)
-     */
-    inline __device__
-    int block_end_item_k(int dim_k)
-    {
-        int next_start_k = block_begin_item_k() + split_k;
-        return __NV_STD_MIN(next_start_k, dim_k);
-    }
-
-
-    /**
-     * Whether the thread block is a secondary accumulator in an inter-block
-     * k-splitting scheme
-     */
-    inline __device__
-    bool is_secondary_accumulator()
-    {
-        return (blockIdx.z > 0);
-    }
-
-
-    /**
-     * Wait for predecessor thread block(s) to produce the exclusive
-     * partial-sums for this block-wide tile
-     */
-    inline __device__
-    void wait()
-    {
-        // Wait on semaphore
-        if ((use_semaphore) && (blockIdx.z > 0))
-        {
-            if (threadIdx.x == 0)
-            {
-                int bid = (blockIdx.y * gridDim.x) + blockIdx.x;
-                int hash = bid % NumFlagsSplitK;
-                int found;
-                int looking = blockIdx.z;
-                while (true)
-                {
-                    asm volatile ("ld.global.cg.u32 %0, [%1];\n" : "=r"(found) : "l"(d_flags + hash));
-
-                    if (found == looking)
-                        break;
-
-                    /// Fence to keep load from being hoisted from the loop
-                    __syncwarp(0x00000001);
-                }
-            }
-
-            __syncthreads();
-        }
-    }
-
-
-    /**
-     * Signal the successor thread_block(s) that the inclusive partial-sums
-     * from this block-wide tile are available
-     */
-    inline __device__
-    void signal()
-    {
-        if (use_semaphore)
-        {
-            __syncthreads();
-
-            if (threadIdx.x == 0)
-            {
-                int bid = (blockIdx.y * gridDim.x) + blockIdx.x;
-                int hash = bid % NumFlagsSplitK;
-                int val = blockIdx.z + 1;
-
-                asm volatile ("st.global.cg.u32 [%0], %1;\n" : : "l"(d_flags + hash), "r"(val));
-            }
-        }
-    }
-
-
-    //-------------------------------------------------------------------------
-    // Grid launch API
-    //-------------------------------------------------------------------------
-
-    /**
-     * Constructor
-     */
-    inline
-    k_split_control(
-        int     *d_flags,
-        int     sm_count,
-        int     max_sm_occupancy,
-        int     dim_k,
-        int     block_tile_items_k,
-        dim3    block_dims,
-        dim3    &grid_dims)         ///< [in,out]
-    :
-        d_flags(d_flags),
-        split_k(dim_k)
-    {
-        // Compute wave efficiency
-        float wave_efficiency = get_wave_efficiency(
-            sm_count,
-            max_sm_occupancy,
-            block_dims,
-            grid_dims);
-
-        // Update split-k if wave efficiency is less than some threshold
-        if (wave_efficiency < 0.9)
-        {
-            int num_threadblocks = grid_dims.x * grid_dims.y * grid_dims.z;
-
-            // Ideal number of thread blocks in grid
-            int ideal_threadblocks = lcm(sm_count, num_threadblocks);
-
-            // Desired number of partitions to split K-axis into
-            int num_partitions = ideal_threadblocks / num_threadblocks;
-
-            // Compute new k-split share
-            int new_split_k = (dim_k + num_partitions - 1) / num_partitions;
-
-            // Round split_k share to the nearest block_task_policy_t::BlockItemsK
-            new_split_k = round_nearest(new_split_k, block_tile_items_k);
-
-            // Recompute k-splitting factor with new_split_k
-            num_partitions = (dim_k + new_split_k - 1) / new_split_k;
-
-            // Update grid dims and k if we meet the minimum number of iterations worth the overhead of splitting
-            int min_iterations_k = 8;
-
-            if (((new_split_k / block_tile_items_k) > min_iterations_k) &&    // We're going to go through at least this many k iterations
-                (sm_count * max_sm_occupancy < NumFlagsSplitK))             // We have enough semaphore flags allocated
-            {
-                grid_dims.z = num_partitions;
-                split_k = new_split_k;
-            }
-        }
-
-        use_semaphore = (grid_dims.z > 1);
-    }
-
-
-    /**
-     * Initializer
-     */
-    cudaError_t prepare(
-        cudaStream_t    stream,             ///< CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool            debug_synchronous)  ///< Whether or not to synchronize the stream after every kernel launch to check for errors.  Also causes launch configurations to be printed to the console if DEBUG is defined.  Default is \p false.
-
-    {
-        cudaError error = cudaSuccess;
-
-        if (use_semaphore)
-        {
-            int block_threads = 128;
-            int grid_dims = (NumFlagsSplitK + block_threads - 1) / block_threads;
-
-            prepare_kernel<<<grid_dims, block_threads, 0, stream>>>(d_flags);
-
-            // Check for failure to launch
-            if (CUDA_PERROR_DEBUG(error = cudaPeekAtLastError()))
-                return error;
-
-            // Sync the stream if specified to flush runtime errors
-            if (debug_synchronous && (CUDA_PERROR_DEBUG(error = cudaStreamSynchronize(stream))))
-                return error;
-        }
-
-        return error;
-    }
-
-
-    /**
-     * Compute the efficiency of dispatch wave quantization
-     */
-    float get_wave_efficiency(
-        int     sm_count,
-        int     max_sm_occupancy,
-        dim3    block_dims,
-        dim3    grid_dims)
-    {
-        // Heuristic for how many warps are needed to saturate an SM for a given
-        // multiply-accumulate genre.  (NB: We could make this more rigorous by
-        // specializing on data types and SM width)
-        int saturating_warps_per_sm = 16;
-
-        int num_threadblocks                = grid_dims.x * grid_dims.y * grid_dims.z;
-        int threads_per_threadblock         = block_dims.x * block_dims.y;
-        int warps_per_threadblock           = threads_per_threadblock / 32;
-        int saturating_threadblocks_per_sm  = (saturating_warps_per_sm + warps_per_threadblock - 1) / warps_per_threadblock;
-
-        int saturating_residency    = sm_count * saturating_threadblocks_per_sm;
-        int full_waves              = num_threadblocks / saturating_residency;
-        int remainder_threadblocks          = num_threadblocks % saturating_residency;
-        int total_waves             = (remainder_threadblocks == 0) ? full_waves : full_waves + 1;
-
-        float last_wave_saturating_efficiency = float(remainder_threadblocks) / saturating_residency;
-
-        return (float(full_waves) + last_wave_saturating_efficiency) / total_waves;
-    }
-};
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/gemm/thread_accumulator.h
+++ b/cutlass/gemm/thread_accumulator.h
@ -1,461 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Thread-level multiply-accumulate abstraction
- */
-
-#include "../util/util.h"
-#include "dp_accummulate.h"
-
-
-namespace cutlass {
-namespace gemm {
-
-
-/******************************************************************************
- * thread_accumulator (generic specialization)
- ******************************************************************************/
-
-/**
- * \brief Thread-level multiply-accumulate abstraction (generic specialization)
- *
- * The thread_accumulator class maintains a MxN tile of accumulators in
- * registers to which MxNxK matrix products of two thread tiles A (MxK)
- * and B (KxN) can be added, where:
- *   M = ThreadItemsY
- *   N = ThreadItemsX
- *   K = sizeof(dp_vector_t) / sizeof(value_t).
- *
- * In order to leverage architecture-specific "dot-product accumulate" ISA
- * operations, K is dictated by the thread_accumulator class in the form of
- * the member-type dp_vector_t, which defines a K-component vector of value_t.
- * The multiplicand inputs A and B are provided as arrays of dp_vector_t having
- * extents ThreadItemsY and ThreadItemsX, respectively. (In the single
- * component "dp1" scenario where dp_vector_t == value_t and thus K == 1, the
- * multiplication is simply the outer product of two vectors.)
- *
- * The accumulators are zero-initialized in a two-phase process (construction +
- * initialization) that requires shared storage in the form of the member-type
- * scratch_storage_t during construction. (A single scratch_storage_t instance
- * can be uniformly referenced across all threads in the block during
- * construction *if* the block is synchronized between construction and
- * initialization.)
- *
- * NB: This generic class is not directly constructible.  Architecture- and
- * algorithm-specific template specializations will provide the API
- * functionality prescribed here.
- */
-template <
-    int ThreadItemsY,                   ///< Height of thread tile in accum_t
-    int ThreadItemsX,                   ///< Width of thread tile in accum_t
-    typename value_t,                   ///< Multiplicand value type
-    typename accum_t,                   ///< Accumulator value type
-    int ACCUM_BYTES =                   ///< Size in bytes of accum_t
-        sizeof(accum_t),
-    arch_family_t::kind_t ArchFamily =  ///< Architectural family enumerant
-        CUTLASS_ARCH_FAMILY>
-struct thread_accumulator
-{
-protected:
-
-    //-------------------------------------------------------------------------
-    // Constants and types
-    //-------------------------------------------------------------------------
-
-    /// Specialized dot-product traits type
-    typedef dp_accummulate<value_t, accum_t> dp_accum_traits_t;
-
-
-public:
-
-    //-------------------------------------------------------------------------
-    // Member types
-    //-------------------------------------------------------------------------
-
-    /// Dot-product vector type
-    typedef typename dp_accum_traits_t::dp_vector_t dp_vector_t;
-
-    /// Scratch storage layout
-    struct scratch_storage_t {};
-
-
-protected:
-
-    //-------------------------------------------------------------------------
-    // Data members
-    //-------------------------------------------------------------------------
-
-    /// Thread's tile of accumulators
-    accum_t accumulators[ThreadItemsY][ThreadItemsX];
-
-
-    //-------------------------------------------------------------------------
-    // Utility methods
-    //-------------------------------------------------------------------------
-
-    /**
-     * Compute a multiply-add at accumulator coordinates (x, y)
-     */
-    inline __device__
-    void mad_xy(
-        dp_vector_t (&tile_a)[ThreadItemsY],
-        dp_vector_t (&tile_b)[ThreadItemsX],
-        int x,
-        int y)
-    {
-        dp_accum_traits_t::mad(
-            accumulators[y][x],
-            tile_a[y],
-            tile_b[x],
-            accumulators[y][x]);
-    }
-
-public:
-
-    //-------------------------------------------------------------------------
-    // Constructor API
-    //-------------------------------------------------------------------------
-
-    /// Constructor
-    inline __device__
-    thread_accumulator(
-        scratch_storage_t &scratch)
-    {}
-
-
-    //-------------------------------------------------------------------------
-    // Accumulator API
-    //-------------------------------------------------------------------------
-
-    /**
-     * \brief Zero-initialize thread accumulators.
-     *
-     * If a common reference to a single block-wide shared instance of scratch_storage_t
-     * is used during construction, the block must be synchronized after construction
-     * but prior to the invocation of init().
-     */
-    inline __device__
-    void init()
-    {
-        #pragma unroll
-        for (int y = 0; y < ThreadItemsY; ++y) {
-            #pragma unroll
-            for (int x = 0; x < ThreadItemsX; ++x)
-            {
-                accumulators[y][x] = accum_t(0);
-            }
-        }
-    }
-
-
-    /**
-     * Retrieve the accumulator at thread tile coordinates (x, y)
-     */
-    inline __device__
-    accum_t get(int x, int y)
-    {
-        // Accumulators are row-major
-        return accumulators[y][x];
-    }
-
-
-    /**
-     * \brief Compute the product of tile_a and tile_b and add the result to
-     * the tile of accumulators.
-     */
-    inline __device__
-    void multiply_accumulate(
-        dp_vector_t (&tile_a)[ThreadItemsY],
-        dp_vector_t (&tile_b)[ThreadItemsX])
-    {
-        // Simply traverse the accumulator tile in row-major order
-        #pragma unroll
-        for (int y = 0; y < ThreadItemsY; ++y)
-        {
-            #pragma unroll
-            for (int x = 0; x < ThreadItemsX; ++x)
-            {
-                mad_xy(tile_a, tile_b, x, y);
-            }
-        }
-    }
-};
-
-
-
-
-/******************************************************************************
- * thread_accumulator (__half->__half specialization)
- ******************************************************************************/
-
-/**
- * \brief Thread-level multiply-accumulate abstraction (__half->__half specialization)
- *
- * NB: Because we use the 2-item SIMD instruction HFMA2:
- *   - ThreadItemsX must be an even multiple of 2
- *   - ThreadItemsY must be an even multiple of 2
- *
- */
-template <
-    int ThreadItemsY,                   ///< Height in rows of thread tile in C
-    int ThreadItemsX,                   ///< Width in columns of thread tile in C
-    arch_family_t::kind_t ArchFamily>   ///< Architectural family enumerant
-struct thread_accumulator<
-    ThreadItemsY,
-    ThreadItemsX,
-    __half,                             ///< Multiplicand value type (matrices A and B)
-    __half,                             ///< Accumulator value type (matrix C and scalars)
-    2,                                  ///< Size in bytes of accum_t
-    ArchFamily>
-{
-protected:
-
-    //-------------------------------------------------------------------------
-    // Constants and types
-    //-------------------------------------------------------------------------
-
-    /// Constants
-    enum
-    {
-        /// Height of thread tile in column-major uint32_t SIMD pairs along Y dimension
-        ThreadTilePairsY = divide_assert<ThreadItemsY, 2>::value,
-
-        /// Width of thread tile in column-major uint32_t SIMD pairs along X dimension
-        ThreadTilePairsX = ThreadItemsX,
-
-        /// Number of SIMD pairs in thread's slice of block-wide tile multiplicand A
-        ThreadPairsA = divide_assert<ThreadItemsY, 2>::value,
-
-        /// Number of SIMD pairs in thread's slice of block-wide tile multiplicand B
-        ThreadPairsB = divide_assert<ThreadItemsX, 2>::value,
-    };
-
-public:
-
-    //-------------------------------------------------------------------------
-    // Member types
-    //-------------------------------------------------------------------------
-
-    /// Dot-product vector type
-    typedef __half dp_vector_t;
-
-    /// Scratch storage layout
-    struct scratch_storage_t {};
-
-
-private:
-
-    //-------------------------------------------------------------------------
-    // Members
-    //-------------------------------------------------------------------------
-
-    /// Thread's tile of C accumulator pairs (the uint32_t SIMD pairs are
-    /// column-major, the 2D tile layout is also column-major)
-    uint32_t accumulator_pairs[ThreadTilePairsX][ThreadTilePairsY];
-
-
-    //-------------------------------------------------------------------------
-    // Utility methods
-    //-------------------------------------------------------------------------
-
-    /**
-     * Compute an HFMA2 MAD
-     */
-    inline __device__ void mad(
-        uint32_t &d,
-        const uint32_t &a,
-        const uint32_t &b,
-        const uint32_t &c)
-    {
-
-        asm volatile ("fma.rn.f16x2 %0, %1, %2, %3;\n"
-            : "=r"(d) : "r"(a), "r"(b), "r"(c));
-    }
-
-
-    /**
-     * Compute an HFMA2 MAD with replicated b.lo:
-     *   d{hi} = a{hi} * b{lo} + c{hi};
-     *   d{lo} = a{lo} * b{lo} + c{lo};
-     */
-    inline __device__ void mad_replicate_low(
-        uint32_t &d,
-        const uint32_t &a,
-        const uint32_t &b,
-        const uint32_t &c)
-    {
-        // Replicate low halves of b
-        uint32_t replicate;
-        asm volatile (
-            "{"
-            "  .reg .b16 b_low,b_high;\n"
-            "  mov.b32 {b_low,b_high}, %1;\n"
-            "  mov.b32 %0, {b_low,b_low};\n"
-            "}" : "=r"(replicate) : "r"(b));
-
-        mad(d, a, replicate, c);
-    }
-
-
-    /**
-     * Compute an HFMA2 MAD with replicated b.hi:
-     *   d{hi} = a{hi} * b{hi} + c{hi};
-     *   d{lo} = a{lo} * b{hi} + c{lo};
-     */
-    inline __device__ void mad_replicate_high(
-        uint32_t &d,
-        const uint32_t &a,
-        const uint32_t &b,
-        const uint32_t &c)
-    {
-        // Replicate high halves of b
-        uint32_t replicate;
-        asm volatile (
-            "{"
-            "  .reg .b16 b_low,b_high;\n"
-            "  mov.b32 {b_low,b_high}, %1;\n"
-            "  mov.b32 %0, {b_high,b_high};\n"
-            "}" : "=r"(replicate) : "r"(b));
-
-        mad(d, a, replicate, c);
-    }
-
-
-    /**
-     * Compute a multiply-add at accumulator SIMD-pair coordinates (pair_x, pair_y)
-     */
-    inline __device__
-    void mad_xy_even(
-        uint32_t (&pairs_tile_a)[ThreadPairsA],
-        uint32_t (&pairs_tile_b)[ThreadPairsB],
-        int pair_x,
-        int pair_y)
-    {
-        // Even column: use low half of the b pair
-        mad_replicate_low(
-            accumulator_pairs[pair_x][pair_y],
-            pairs_tile_a[pair_y],
-            pairs_tile_b[pair_x / 2],
-            accumulator_pairs[pair_x][pair_y]);
-    }
-
-
-    /**
-     * Compute a multiply-add at accumulator SIMD-pair coordinates (pair_x, pair_y)
-     */
-    inline __device__
-    void mad_xy_odd(
-        uint32_t (&pairs_tile_a)[ThreadPairsA],
-        uint32_t (&pairs_tile_b)[ThreadPairsB],
-        int pair_x,
-        int pair_y)
-    {
-        // Odd column: use high half of the b pair
-        mad_replicate_high(
-            accumulator_pairs[pair_x][pair_y],
-            pairs_tile_a[pair_y],
-            pairs_tile_b[pair_x / 2],
-            accumulator_pairs[pair_x][pair_y]);
-    }
-
-
-public:
-
-    //-------------------------------------------------------------------------
-    // Constructor API
-    //-------------------------------------------------------------------------
-
-    /// Constructor
-    inline __device__
-    thread_accumulator(
-        scratch_storage_t &scratch)
-    {}
-
-
-    //-------------------------------------------------------------------------
-    // Accumulator API
-    //-------------------------------------------------------------------------
-
-    /**
-     * Zero-initialize thread accumulators.
-     */
-    inline __device__
-    void init()
-    {
-        #pragma unroll
-        for (int y = 0; y < ThreadTilePairsY; ++y)
-        {
-            #pragma unroll
-            for (int x = 0; x < ThreadTilePairsX; ++x)
-            {
-                accumulator_pairs[x][y] = 0;
-            }
-        }
-    }
-
-
-    /**
-     * Retrieve the accumulator at thread tile coordinates (x, y)
-     */
-    inline __device__
-    __half get(int x, int y)
-    {
-        // SIMD pairs are column-major
-        uint32_t pair = accumulator_pairs[x][y / 2];
-
-        return reinterpret_cast<__half (&)[2]>(pair)[y % 2];
-    }
-
-
-    /**
-     * \brief Compute the product of pairs_tile_a and pairs_tile_b and add the result to
-     * the tile of accumulators.
-     */
-    inline __device__
-    void multiply_accumulate(
-        dp_vector_t (&tile_a)[ThreadItemsY],
-        dp_vector_t (&tile_b)[ThreadItemsX])
-    {
-        typedef uint32_t pairs_tile_a_t[ThreadPairsA];
-        typedef uint32_t pairs_tile_b_t[ThreadPairsB];
-
-        // Alias slices in pairs
-        pairs_tile_a_t &pairs_tile_a = reinterpret_cast<pairs_tile_a_t&>(tile_a);
-        pairs_tile_b_t &pairs_tile_b = reinterpret_cast<pairs_tile_b_t&>(tile_b);
-
-        // Simply traverse the accumulator tile in column-major order
-        #pragma unroll
-        for (int x = 0; x < ThreadTilePairsX; ++x)
-        {
-            #pragma unroll
-            for (int y = 0; y < ThreadTilePairsY; ++y)
-            {
-                mad_xy_even(pairs_tile_a, pairs_tile_b, x, y);
-            }
-        }
-    }
-};
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/gemm/wmma_accumulator.h
+++ b/cutlass/gemm/wmma_accumulator.h
@ -1,207 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Thread-level multiply-accumulate abstraction
- * (Volta 4B accum_t specialization)
- */
-
-#include <mma.h>
-
-#include "../util/util.h"
-#include "dp_accummulate.h"
-
-namespace cutlass {
-namespace gemm {
-
-
-/*!
- *\brief  matrix_layout to perform conversion between Cutlass types and WMMA types
- */
-template <matrix_transform_t::kind_t>
-struct matrix_layout;
-
-/// Maps matrix_transform_t::NonTranspose to nvcuda::wmma::mem_col_major
-template <>
-struct matrix_layout<matrix_transform_t::NonTranspose>
-{
-    /// Type tag in nvcuda::wmma namespace
-    typedef nvcuda::wmma::col_major tag;
-
-    /// Column major layout
-    static const nvcuda::wmma::layout_t kind = nvcuda::wmma::mem_col_major;
-
-    /// Cutlass matrix transform kind
-    static const matrix_transform_t::kind_t cutlass_kind = matrix_transform_t::NonTranspose;
-};
-
-/// Maps matrix_transform_t::NonTranspose to nvcuda::wmma::mem_row_major
-template <>
-struct matrix_layout<matrix_transform_t::Transpose>
-{
-    /// Type tag in nvcuda::wmma namespace
-    typedef nvcuda::wmma::row_major tag;
-
-    /// Column major layout
-    static const nvcuda::wmma::layout_t kind = nvcuda::wmma::mem_row_major;
-
-    /// Cutlass matrix transform kind
-    static const matrix_transform_t::kind_t cutlass_kind = matrix_transform_t::Transpose;
-};
-
-/*!
- * \brief Warp-synchronous matrix multiply-accumulate abstraction
- *
- * wmma_accumulator maps the CUDA WMMA API onto the GEMM structure
- */
-template <
-    int WarpItemsY,                         /// Number of rows of the warp's accumulator tile
-    int WarpItemsX,                         /// Number of columns of the warp's accumulator tile
-    int WmmaItemsY,                         /// Number of rows in a single WMMA operation
-    int WmmaItemsX,                         /// Number of columns in a single WMMA operation
-    int WmmaItemsK,                         /// Inner dimension of WMMA operation
-    typename value_a_t,                     /// Type of A operand
-    typename value_b_t,                     /// Type of B operand
-    typename accum_t,                       /// Type of source and destination accumulators
-    matrix_transform_t::kind_t TransformA,  /// Layout of A operand
-    matrix_transform_t::kind_t TransformB   /// Layout of B operand
->
-struct wmma_accumulator
-{
-public:
-
-    //-------------------------------------------------------------------------
-    // Constants and types
-    //-------------------------------------------------------------------------
-
-    enum
-    {
-        /// Number of WMMA blocks in warp row
-        WmmaBlocksX = divide_assert<WarpItemsX, WmmaItemsX>::value,
-
-        /// Number of WMMA blocks in a warp column
-        WmmaBlocksY = divide_assert<WarpItemsY, WmmaItemsY>::value,
-    };
-
-    /// Fragment type for matrix operand A
-    typedef nvcuda::wmma::fragment<
-            nvcuda::wmma::matrix_a,
-            WmmaItemsY,
-            WmmaItemsX,
-            WmmaItemsK,
-            value_a_t,
-            typename matrix_layout<TransformA>::tag>
-        fragment_a_t;
-
-    /// Fragment type for matrix operand B
-    typedef nvcuda::wmma::fragment<
-            nvcuda::wmma::matrix_b,
-            WmmaItemsY,
-            WmmaItemsX,
-            WmmaItemsK,
-            value_b_t,
-            typename matrix_layout<TransformB>::tag>
-        fragment_b_t;
-
-    /// Fragment type for accumulator
-    typedef nvcuda::wmma::fragment<
-            nvcuda::wmma::accumulator,
-            WmmaItemsY,
-            WmmaItemsX,
-            WmmaItemsK,
-            accum_t>
-        accumulator_t;
-
-    /// Scratch storage layout
-    struct scratch_storage_t
-    {
-        /// Initialization vector
-        uint4 zero_slab;
-    };
-
-public:
-    //-------------------------------------------------------------------------
-    // Data members
-    //-------------------------------------------------------------------------
-
-    /// Thread's tile of accumulators
-    accumulator_t accumulators[WmmaBlocksX][WmmaBlocksY];
-
-public:
-
-    //-------------------------------------------------------------------------
-    // Constructor API
-    //-------------------------------------------------------------------------
-
-    /// Constructor initializes accumulators to zero
-    inline __device__
-    wmma_accumulator()
-    {
-        init();
-    }
-
-
-    //-------------------------------------------------------------------------
-    // Accumulator API
-    //-------------------------------------------------------------------------
-
-    /**
-     * \brief Zero-initialize thread accumulators.
-     */
-    inline __device__
-    void init()
-    {
-        #pragma unroll
-        for (int x = 0; x < WmmaBlocksX; ++x)
-        {
-            #pragma unroll
-            for (int y = 0; y < WmmaBlocksY; ++y)
-            {
-                nvcuda::wmma::fill_fragment(accumulators[x][y], accum_t(0));
-            }
-        }
-    }
-
-    /**
-     * \brief Compute the product of tile_a and tile_b and add the result to
-     * the tile of accumulators.
-     */
-    inline __device__
-    void multiply_accumulate(
-        fragment_a_t (&tile_a)[WmmaBlocksY],
-        fragment_b_t (&tile_b)[WmmaBlocksX])
-    {
-        #pragma unroll
-        for (int x = 0; x < WmmaBlocksX; ++x)
-        {
-            #pragma unroll
-            for (int y = 0; y < WmmaBlocksY; ++y)
-            {
-                nvcuda::wmma::mma_sync(accumulators[x][y], tile_a[y], tile_b[x], accumulators[x][y]);
-            }
-        }
-    }
-};
-
-
-} // namespace gemm
-} // namespace cutlass
--- a/cutlass/util/debug.h
+++ b/cutlass/util/debug.h
@ -1,112 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * \brief Debugging and logging functionality
- */
-
-#include <stdio.h>
-
-namespace cutlass {
-
-/******************************************************************************
- * Debug and logging macros
- ******************************************************************************/
-
-/**
- * Formats and prints the given message to stdout
- */
-#if !defined(CUDA_LOG)
-    #if !defined(__CUDA_ARCH__)
-        #define CUDA_LOG(format, ...) printf(format,__VA_ARGS__)
-    #else
-        #define CUDA_LOG(format, ...) printf("[block (%d,%d,%d), thread (%d,%d,%d)]: " format, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, __VA_ARGS__);
-    #endif
-#endif
-
-
-/**
- * Formats and prints the given message to stdout only if DEBUG is defined
- */
-#if !defined(CUDA_LOG_DEBUG)
-    #ifdef DEBUG
-        #define CUDA_LOG_DEBUG(format, ...) CUDA_LOG(format, __VA_ARGS__)
-    #else
-        #define CUDA_LOG_DEBUG(format, ...)
-    #endif
-#endif
-
-
-/**
- * \brief The corresponding error message is printed to \p stderr (or \p stdout in device code) along with the supplied source context.
- *
- * \return The CUDA error.
- */
-__host__ __device__ inline cudaError_t cuda_perror_impl(
-    cudaError_t     error,
-    const char*     filename,
-    int             line)
-{
-    (void)filename;
-    (void)line;
-    if (error)
-    {
-#if !defined(__CUDA_ARCH__)
-        fprintf(stderr, "CUDA error %d [%s, %d]: %s\n", error, filename, line, cudaGetErrorString(error));
-        fflush(stderr);
-#else
-        printf("CUDA error %d [%s, %d]\n", error, filename, line);
-#endif
-    }
-    return error;
-}
-
-
-/**
- * \brief Perror macro
- */
-#ifndef CUDA_PERROR
-    #define CUDA_PERROR(e) cuda_perror_impl((cudaError_t) (e), __FILE__, __LINE__)
-#endif
-
-
-/**
- * \brief Perror macro with exit
- */
-#ifndef CUDA_PERROR_EXIT
-    #define CUDA_PERROR_EXIT(e) if (cuda_perror_impl((cudaError_t) (e), __FILE__, __LINE__)) { exit(1); }
-#endif
-
-
-/**
- * \brief Perror macro only if DEBUG is defined
- */
-#ifndef CUDA_PERROR_DEBUG
-    #ifdef DEBUG
-        #define CUDA_PERROR_DEBUG(e) CUDA_PERROR(e)
-    #else
-        #define CUDA_PERROR_DEBUG(e) (e)
-    #endif
-#endif
-
-
-} // namespace cutlass
--- a/cutlass/util/device_introspection.h
+++ b/cutlass/util/device_introspection.h
@ -1,216 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * \brief Utilities for device introspection
- */
-
-#include "debug.h"
-#include "nv_std.h"
-#include "printable.h"
-
-namespace cutlass {
-
-
-/******************************************************************************
- * math_operation_class_t
- *
- * Enumeration to select the appropriate math operation
- *
- * The assumption is multiple math operations may be used to compute GEMM
- * for a given selection of operand and accumulator types.
- *
- ******************************************************************************/
-
-/// Math operation
-enum class math_operation_class_t
-{
-    scalar,                     // scalar (and vector) multiply-accumulate operations
-    matrix                      // Volta tensor operations
-};
-
-/******************************************************************************
- * arch_family_t
- ******************************************************************************/
-
-/**
- * \brief Enumeration of NVIDIA GPU architectural families
- */
-struct arch_family_t
-{
-    /// \brief Enumerants
-    enum kind_t
-    {
-        Unsupported     = 0,
-        Kepler      = 3,
-        Maxwell     = 5,
-        Volta       = 7,
-    };
-
-    /// Enumerant value
-    kind_t kind;
-
-    /// Default constructor
-    arch_family_t() : kind(Unsupported) {}
-
-    /// Copy constructor
-    arch_family_t(const kind_t &other_kind) : kind(other_kind) {}
-
-    /// Cast to kind_t
-    operator kind_t() const { return kind; }
-
-    /// Returns the instance as a string
-    __host__ __device__ inline
-    char const* to_string() const
-    {
-        switch (kind)
-        {
-            case Kepler:    return "Kepler";
-            case Maxwell:   return "Maxwell";
-            case Volta:     return "Volta";
-            case Unsupported:
-            default: return "Unsupported";
-        }
-    }
-
-    /// Insert the formatted instance into the output stream
-    void print(std::ostream& out) const { out << to_string(); }
-
-};
-
-
-/**
- * Macro for architecture targeted by the current compiler pass
- */
-#if defined(__CUDA_ARCH__)
-    #define CUTLASS_ARCH __CUDA_ARCH__
-#else
-    #define CUTLASS_ARCH 0
-#endif
-
-
-/**
- * Macro for architecture family targeted by the current compiler pass
- */
-#define CUTLASS_ARCH_FAMILY                         \
-    (                                               \
-        (CUTLASS_ARCH < 300) ?                      \
-            arch_family_t::Unsupported :            \
-            (CUTLASS_ARCH < 500) ?                  \
-                arch_family_t::Kepler :             \
-                (CUTLASS_ARCH < 700) ?              \
-                    arch_family_t::Maxwell :        \
-                    arch_family_t::Volta            \
-    )
-
-
-
-
-/******************************************************************************
- * Device introspection
- ******************************************************************************/
-
-/**
- * Empty kernel for querying PTX manifest metadata (e.g., version) for the current device
- */
-template <typename T>
-__global__ void empty_kernel(void) { }
-
-
-
-/**
- * \brief Retrieves the PTX version that will be used on the current device (major * 100 + minor * 10)
- */
-cudaError_t ptx_version(int &version)
-{
-    struct Dummy
-    {
-        /// Type definition of the empty_kernel kernel entry point
-        typedef void (*EmptyKernelPtr)();
-
-        /// Force empty_kernel<void> to be generated if this class is used
-        EmptyKernelPtr Empty()
-        {
-            return empty_kernel<void>;
-        }
-    };
-
-    cudaError_t error = cudaSuccess;
-    do
-    {
-        cudaFuncAttributes empty_kernel_attrs;
-        if (CUDA_PERROR_DEBUG(error = cudaFuncGetAttributes(&empty_kernel_attrs, empty_kernel<void>))) break;
-        version = empty_kernel_attrs.ptxVersion * 10;
-    }
-    while (0);
-
-    return error;
-}
-
-
-/**
- * \brief Retrieves the SM version (major * 100 + minor * 10) for the current device
- */
-cudaError_t get_sm_version(int &sm_version)
-{
-    cudaError_t error = cudaSuccess;
-
-    // Get device ordinal
-    int device_ordinal;
-    if (CUDA_PERROR_DEBUG(error = cudaGetDevice(&device_ordinal)))
-        return error;
-
-    // Fill in SM version
-    int major, minor;
-    if (CUDA_PERROR_DEBUG(error = cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device_ordinal)))
-        return error;
-    if (CUDA_PERROR_DEBUG(error = cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, device_ordinal)))
-        return error;
-    sm_version = major * 100 + minor * 10;
-
-    return error;
-}
-
-
-/**
- * \brief Retrieves the count for the current device
- */
-cudaError_t get_sm_count(int &sm_count)
-{
-    cudaError_t error = cudaSuccess;
-
-    // Get device ordinal
-    int device_ordinal;
-    if (CUDA_PERROR_DEBUG(error = cudaGetDevice(&device_ordinal)))
-        return error;
-
-    // Get SM count
-    if (CUDA_PERROR_DEBUG(error = cudaDeviceGetAttribute (&sm_count, cudaDevAttrMultiProcessorCount, device_ordinal)))
-        return error;
-
-    return error;
-}
-
-
-} // namespace cutlass
-
-
--- a/cutlass/util/io_intrinsics.h
+++ b/cutlass/util/io_intrinsics.h
@ -1,484 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * \brief I/O device intrinsics
- */
-
-#include <stdint.h>
-#include <cuda_fp16.h>
-
-#include "nv_std.h"
-#include "math.h"
-
-namespace cutlass {
-
-
-
-
-/******************************************************************************
- * io_vector
- ******************************************************************************/
-
-/**
- * Base aligned storage for IO vector
- */
-template <typename value_t, int VectorItems, int AlignBytes> struct io_vector_base;
-template <typename value_t, int VectorItems> struct __align__(1) io_vector_base<value_t, VectorItems, 1> { value_t buff[VectorItems]; };
-template <typename value_t, int VectorItems> struct __align__(2) io_vector_base<value_t, VectorItems, 2> { value_t buff[VectorItems]; };
-template <typename value_t, int VectorItems> struct __align__(4) io_vector_base<value_t, VectorItems, 4> { value_t buff[VectorItems]; };
-template <typename value_t, int VectorItems> struct __align__(8) io_vector_base<value_t, VectorItems, 8> { value_t buff[VectorItems]; };
-template <typename value_t, int VectorItems> struct __align__(16) io_vector_base<value_t, VectorItems, 16> { value_t buff[VectorItems]; };
-
-
-/**
- * \brief Aligned vector type for coarsening data movement instructions
- *
- * Exposes the member constant \p VectorItems, the actual number of component
- * values comprising the io_vector
- */
-template <
-    typename value_t,                                                           ///< Component value type
-    int MaxVectorItems,                                                         ///< Maximum allowable component values
-    int MaxAlignBytes                                                           ///< Maximum allowable alignment
-            = __NV_STD_MIN(16, MaxVectorItems * sizeof(value_t)),
-    int AlignBytes                                                              ///< Actual alignment
-            = __NV_STD_MIN(sizeof(value_t) * MaxVectorItems, MaxAlignBytes),
-    int VectorItems                                                             ///< Actual number of component values
-            = divide_assert<AlignBytes, sizeof(value_t)>::value,
-    bool MustAlias                                                              ///< Whether we need to alias during loads/stores
-            = (VectorItems > 4)>
-struct io_vector;
-
-
-/**
- * IO vector (specialization for VectorItems <= 4)
- */
-template <
-    typename value_t,
-    int MaxVectorItems,
-    int MaxAlignBytes,
-    int _AlignBytes,
-    int _VectorItems>
-struct io_vector <
-    value_t,
-    MaxVectorItems,
-    MaxAlignBytes,
-    _AlignBytes,
-    _VectorItems,
-    false>
-:
-    io_vector_base<value_t, _VectorItems, _AlignBytes>
-{
-    enum
-    {
-        VectorItems = _VectorItems,
-        AlignBytes = _AlignBytes
-    };
-
-    static_assert(is_pow2<AlignBytes>::value, "I/O vector alignment must be a power-of-two.");
-    static_assert((AlignBytes <= 16), "I/O vector alignment must <= 16B.");
-
-    inline __device__
-    void load(const io_vector *ptr)
-    {
-        *this = *ptr;
-    }
-
-    inline __device__
-    void load(const value_t *ptr)
-    {
-        *this = *reinterpret_cast<const io_vector*>(ptr);
-    }
-
-
-    inline __device__
-    void store(io_vector *ptr) const
-    {
-        *ptr = *this;
-    }
-
-    inline __device__
-    void store(value_t *ptr) const
-    {
-        *reinterpret_cast<io_vector*>(ptr) = *this;
-    }
-};
-
-
-/**
- * IO vector (specialization for VectorItems > 4)
- *
- * NB: Workaround for NVCC not generating 128-bit loads/stores for aligned
- * structures having component types < 32b
- */
-template <
-    typename value_t,
-    int MaxVectorItems,
-    int MaxAlignBytes,
-    int _AlignBytes,
-    int _VectorItems>
-struct io_vector <
-    value_t,
-    MaxVectorItems,
-    MaxAlignBytes,
-    _AlignBytes,
-    _VectorItems,
-    true>
-:
-    io_vector_base<value_t, _VectorItems, _AlignBytes>
-{
-    enum
-    {
-        VectorItems = _VectorItems,
-        AlignBytes = _AlignBytes
-    };
-
-    static_assert(is_pow2<AlignBytes>::value, "I/O vector alignment must be a power-of-two.");
-    static_assert((AlignBytes <= 16), "I/O vector alignment must <= 16B.");
-
-    typedef typename nv_std::conditional<(AlignBytes == 8),
-            uint2,  // Use 8B load
-            uint4>  // Use 16B load
-        ::type align_t;
-
-    inline __device__
-    void load(const io_vector *ptr)
-    {
-        *reinterpret_cast<align_t*>(this) = *reinterpret_cast<const align_t*>(ptr);
-    }
-
-    inline __device__
-    void load(const value_t *ptr)
-    {
-        *reinterpret_cast<align_t*>(this) = *reinterpret_cast<const align_t*>(ptr);
-    }
-
-
-    inline __device__
-    void store(io_vector *ptr) const
-    {
-        *reinterpret_cast<align_t*>(ptr) = *reinterpret_cast<const align_t*>(this);
-    }
-
-    inline __device__
-    void store(value_t *ptr) const
-    {
-        *reinterpret_cast<align_t*>(ptr) = *reinterpret_cast<const align_t*>(this);
-    }
-
-};
-
-
-
-
-
-
-
-/******************************************************************************
- * Macro expansions for vector loads
- ******************************************************************************/
-
-/**
- * Define vector-4 LD specialization for the given load modifier
- */
-#define CUTLASS_LD_V4(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint) \
-    template <typename ptr_t>                                                                   \
-    inline __device__                                                                           \
-    void f_name(                                                                                \
-        value_t (&dest)[4],                                                                     \
-        ptr_t ptr)                                                                              \
-    {                                                                                           \
-        asm volatile ("ld."#load_modifier".v4."#ptx_type" {%0, %1, %2, %3}, [%4];\n"            \
-            :                                                                                   \
-                "="#val_constraint(dest[0]),                                                    \
-                "="#val_constraint(dest[1]),                                                    \
-                "="#val_constraint(dest[2]),                                                    \
-                "="#val_constraint(dest[3])                                                     \
-            :                                                                                   \
-                #ptr_constraint(ptr));                                                          \
-    }
-
-/**
- * Define vector-2 LD specialization for the given load modifier
- */
-#define CUTLASS_LD_V2(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint) \
-    template <typename ptr_t>                                                                   \
-    inline __device__                                                                           \
-    void f_name(                                                                                \
-        value_t (&dest)[2],                                                                     \
-        ptr_t ptr)                                                                              \
-    {                                                                                           \
-        asm volatile ("ld."#load_modifier".v2."#ptx_type" {%0, %1}, [%2];\n"                    \
-            :                                                                                   \
-                "="#val_constraint(dest[0]),                                                    \
-                "="#val_constraint(dest[1])                                                     \
-            :                                                                                   \
-                #ptr_constraint(ptr));                                                          \
-    }
-
-
-/**
- * Define vector-1 LD specialization for the given load modifier
- */
-#define CUTLASS_LD_V1(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint) \
-    template <typename ptr_t>                                                                   \
-    inline __device__                                                                           \
-    void f_name(                                                                                \
-        value_t (&dest)[1],                                                                     \
-        ptr_t ptr)                                                                              \
-    {                                                                                           \
-        asm volatile ("ld."#load_modifier"."#ptx_type" %0, [%1];\n"                             \
-            :                                                                                   \
-                "="#val_constraint(dest[0])                                                     \
-            :                                                                                   \
-                #ptr_constraint(ptr));                                                          \
-    }
-
-
-/**
- * Define powers-of-two vector LD specializations
- */
-#define CUTLASS_LD_ALL(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint)    \
-    CUTLASS_LD_V4(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint)         \
-    CUTLASS_LD_V2(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint)         \
-    CUTLASS_LD_V1(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint)
-
-
-/******************************************************************************
- * Macro expansions for vector stores
- ******************************************************************************/
-
-/**
- * Define vector-4 ST specialization for the given load modifier
- */
-#define CUTLASS_ST_V4(f_name, value_t, store_modifier, ptx_type, val_constraint, ptr_constraint)    \
-    template <typename ptr_t>                                                                       \
-    inline __device__                                                                               \
-    void f_name(                                                                                    \
-        ptr_t ptr,                                                                                  \
-        const value_t (&src)[4])                                                                    \
-    {                                                                                               \
-        asm volatile ("st."#store_modifier".v4."#ptx_type" [%0], {%1, %2, %3, %4};\n"               \
-            : :                                                                                     \
-                #ptr_constraint(ptr),                                                               \
-                #val_constraint(src[0]),                                                            \
-                #val_constraint(src[1]),                                                            \
-                #val_constraint(src[2]),                                                            \
-                #val_constraint(src[3]));                                                           \
-    }
-
-
-/**
- * Define vector-2 ST specialization for the given load modifier
- */
-#define CUTLASS_ST_V2(f_name, value_t, store_modifier, ptx_type, val_constraint, ptr_constraint)    \
-    template <typename ptr_t>                                                                       \
-    inline __device__                                                                               \
-    void f_name(                                                                                    \
-        ptr_t ptr,                                                                                  \
-        const value_t (&src)[2])                                                                    \
-    {                                                                                               \
-        asm volatile ("st."#store_modifier".v2."#ptx_type" [%0], {%1, %2};\n"                       \
-            : :                                                                                     \
-                #ptr_constraint(ptr),                                                               \
-                #val_constraint(src[0]),                                                            \
-                #val_constraint(src[1]));                                                           \
-    }
-
-/**
- * Define vector-1 ST specialization for the given load modifier
- */
-#define CUTLASS_ST_V1(f_name, value_t, store_modifier, ptx_type, val_constraint, ptr_constraint)    \
-    template <typename ptr_t>                                                                       \
-    inline __device__                                                                               \
-    void f_name(                                                                                    \
-        ptr_t ptr,                                                                                  \
-        const value_t (&src)[1])                                                                    \
-    {                                                                                               \
-        asm volatile ("st."#store_modifier"."#ptx_type" [%0], %1;\n"                                \
-            : :                                                                                     \
-                #ptr_constraint(ptr),                                                               \
-                #val_constraint(src[0]));                                                           \
-    }
-
-
-/**
- * Define powers-of-two vector LD specializations
- */
-#define CUTLASS_ST_ALL(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint)    \
-    CUTLASS_ST_V4(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint)         \
-    CUTLASS_ST_V2(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint)         \
-    CUTLASS_ST_V1(f_name, value_t, load_modifier, ptx_type, val_constraint, ptr_constraint)
-
-
-
-/******************************************************************************
- * Macro expansions for vector IO
- ******************************************************************************/
-
-/**
- * Define global and shared LD specializations
- */
-#define CUTLASS_IO(value_t, ptx_type, val_constraint)                                       \
-    CUTLASS_LD_ALL(ldg_cg_internal, value_t, global.cg, ptx_type, val_constraint, l)        \
-    CUTLASS_ST_ALL(stg_cg_internal, value_t, global.cg, ptx_type, val_constraint, l)
-
-
-// Define IO for useful types
-CUTLASS_IO(double,     f64, d)
-CUTLASS_IO(float,      f32, f)
-CUTLASS_IO(int64_t,    b64, l)
-CUTLASS_IO(int32_t,    b32, r)
-CUTLASS_IO(int16_t,    b16, h)
-
-
-// Macro cleanup
-#undef CUTLASS_IO
-#undef CUTLASS_LD_ALL
-#undef CUTLASS_LD_V4
-#undef CUTLASS_LD_V2
-#undef CUTLASS_LD_V1
-#undef CUTLASS_ST_ALL
-#undef CUTLASS_ST_V4
-#undef CUTLASS_ST_V2
-#undef CUTLASS_ST_V1
-
-
-/******************************************************************************
- * I/O cast types
- ******************************************************************************/
-
-/// Provides the type for which to reinterpret-cast a given vector
-template <
-    typename value_t,
-    int IoVecDim,
-    int ValueBytes = sizeof(value_t)>
-struct io_cast
-{
-    typedef value_t type[IoVecDim];
-};
-
-
-/// Provides the type for which to reinterpret-cast a vector of 1B types
-template <
-    typename value_t,
-    int IoVecDim>
-struct io_cast<value_t, IoVecDim, 1>
-{
-    typedef typename nv_std::conditional<
-            (IoVecDim < 2),
-            int8_t[1],                                 // Use 8b load
-            typename nv_std::conditional<
-                (IoVecDim < 4),
-                int16_t[1],                            // Use 16b load
-                int32_t[IoVecDim / 4]>::type>::type    // Use up to 128b load
-        type;
-};
-
-
-/// Provides the type for which to reinterpret-cast a vector of 2B types
-template <
-    typename value_t,
-    int IoVecDim>
-struct io_cast<value_t, IoVecDim, 2>
-{
-    typedef typename nv_std::conditional<
-            (IoVecDim < 2),
-            int16_t[1],                                // Use 16b load
-            int32_t[IoVecDim / 2]>::type               // Use up to 128b load
-        type;
-};
-
-
-
-/******************************************************************************
- * ldg_cg intrinsics
- ******************************************************************************/
-
-/// Load from global (cache-global modifier)
-template <typename value_t, typename ptr_t>
-inline __device__
-void ldg_cg(
-    value_t &dest,
-    ptr_t d_in)
-{
-    // Cast dest to a different array type if necessary
-    ldg_cg_internal(
-        reinterpret_cast<typename io_cast<value_t, 1>::type &>(dest),
-        d_in);
-}
-
-/// Load from global (cache-global modifier)
-template <typename value_t, int IoVecDim, typename ptr_t>
-inline __device__
-void ldg_cg(
-    value_t (&dest)[IoVecDim],
-    ptr_t d_in)
-{
-    static_assert(is_pow2<IoVecDim>::value, "I/O vectors must be a power-of-two.");
-
-    // Cast dest to a different array type if necessary
-    ldg_cg_internal(
-        reinterpret_cast<typename io_cast<value_t, IoVecDim>::type &>(dest),
-        d_in);
-}
-
-
-/******************************************************************************
- * stg_cg intrinsics
- ******************************************************************************/
-
-/// Store to global (cache-global modifier)
-template <typename ptr_t, typename value_t>
-inline __device__
-void stg_cg(
-    ptr_t dest,
-    const value_t &src)
-{
-    // Cast src to a different array type if necessary
-    stg_cg_internal(
-        dest,
-        reinterpret_cast<const typename io_cast<value_t, 1>::type &>(src));
-}
-
-/// Store to global (cache-global modifier)
-template <typename ptr_t, int IoVecDim, typename value_t>
-inline __device__
-void stg_cg(
-    ptr_t dest,
-    const value_t (&src)[IoVecDim])
-{
-    static_assert(is_pow2<IoVecDim>::value, "I/O vectors must be a power-of-two.");
-
-    // Cast src to a different array type if necessary
-    stg_cg_internal(
-        dest,
-        reinterpret_cast<const typename io_cast<value_t, IoVecDim>::type &>(src));
-}
-
-
-
-
-
-} // namespace cutlass
-
--- a/cutlass/util/math.h
+++ b/cutlass/util/math.h
@ -1,189 +0,0 @@
-/*
- * Copyright 1993-2017 NVIDIA Corporation.  All rights reserved.
- *
- * NOTICE TO LICENSEE:
- *
- * This source code and/or documentation ("Licensed Deliverables") are
- * subject to NVIDIA intellectual property rights under U.S. and
- * international Copyright laws.
- *
- * These Licensed Deliverables contained herein is PROPRIETARY and
- * CONFIDENTIAL to NVIDIA and is being provided under the terms and
- * conditions of a form of NVIDIA software license agreement by and
- * between NVIDIA and Licensee ("License Agreement") or electronically
- * accepted by Licensee.  Notwithstanding any terms or conditions to
- * the contrary in the License Agreement, reproduction or disclosure
- * of the Licensed Deliverables to any third party without the express
- * written consent of NVIDIA is prohibited.
- *
- * NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
- * LICENSE AGREEMENT, NVIDIA MAKES NO REPRESENTATION ABOUT THE
- * SUITABILITY OF THESE LICENSED DELIVERABLES FOR ANY PURPOSE.  IT IS
- * PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND.
- * NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THESE LICENSED
- * DELIVERABLES, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY,
- * NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
- * NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
- * LICENSE AGREEMENT, IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY
- * SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY
- * DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
- * WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
- * ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
- * OF THESE LICENSED DELIVERABLES.
- *
- * U.S. Government End Users.  These Licensed Deliverables are a
- * "commercial item" as that term is defined at 48 C.F.R. 2.101 (OCT
- * 1995), consisting of "commercial computer software" and "commercial
- * computer software documentation" as such terms are used in 48
- * C.F.R. 12.212 (SEPT 1995) and is provided to the U.S. Government
- * only as a commercial end item.  Consistent with 48 C.F.R.12.212 and
- * 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all
- * U.S. Government End Users acquire the Licensed Deliverables with
- * only those rights set forth herein.
- *
- * Any use of the Licensed Deliverables in individual and commercial
- * software must include, in the user documentation and internal
- * comments to the code, the above Disclaimer and U.S. Government End
- * Users Notice.
- */
-
-#pragma once
-
-/**
- * \file
- * \brief Math utilities
- */
-
-#include "nv_std.h"
-
-namespace cutlass {
-
-
-/******************************************************************************
- * Static math utilities
- ******************************************************************************/
-
-/**
- * Statically determine if N is a power-of-two
- */
-template <int N>
-struct is_pow2 : nv_std::integral_constant<bool, (N & (N - 1)) == 0>
-{};
-
-
-
-
-
-/**
- * Statically determine log2(N), rounded down
- */
-template <int N, int CurrentVal = N, int Count = 0>
-struct log2_down
-{
-    /// Static logarithm value
-    enum { value = log2_down<N, (CurrentVal >> 1), Count + 1>::value };
-};
-
-// Base case
-template <int N, int Count>
-struct log2_down<N, 1, Count>
-{
-    enum { value = Count };
-};
-
-
-
-
-/**
- * Statically determine log2(N), rounded up
- */
-template <int N, int CurrentVal = N, int Count = 0>
-struct log2_up
-{
-    /// Static logarithm value
-    enum { value = log2_up<N, (CurrentVal >> 1), Count + 1>::value };
-};
-
-// Base case
-template <int N, int Count>
-struct log2_up<N, 1, Count>
-{
-    enum { value = ((1 << Count) < N) ? Count + 1 : Count };
-};
-
-
-
-/**
- * Statically estimate sqrt(N) to the nearest power-of-two
- */
-template <int N>
-struct sqrt_est
-{
-    enum { value = 1 << (log2_up<N>::value / 2) };
-};
-
-
-
-/**
- * For performing a constant-division with a compile-time assertion that the
- * Divisor evenly-divides the Dividend.
- */
-template <int Dividend, int Divisor>
-struct divide_assert
-{
-    enum { value =  Dividend / Divisor};
-
-    static_assert((Dividend % Divisor == 0), "Not an even multiple");
-};
-
-
-
-
-
-/******************************************************************************
- * Rounding
- ******************************************************************************/
-
-/**
- * Round dividend up to the nearest multiple of divisor
- */
-template <typename dividend_t, typename divisor_t>
-inline __host__ __device__
-dividend_t round_nearest(dividend_t dividend, divisor_t divisor)
-{
-    return ((dividend + divisor - 1) / divisor) * divisor;
-}
-
-
-/**
- * Greatest common divisor
- */
-template <typename value_t>
-inline __host__ __device__
-value_t gcd(value_t a, value_t b)
-{
-    for (;;)
-    {
-        if (a == 0) return b;
-        b %= a;
-        if (b == 0) return a;
-        a %= b;
-    }
-}
-
-
-/**
- * Least common multiple
- */
-template <typename value_t>
-inline __host__ __device__
-value_t lcm(value_t a, value_t b)
-{
-    value_t temp = gcd(a, b);
-
-    return temp ? (a / temp * b) : 0;
-}
-
-
-} // namespace cutlass
-
--- a/cutlass/util/matrix_transform.h
+++ b/cutlass/util/matrix_transform.h
@ -1,94 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * \brief Enumeration of dense matrix view transformations
- */
-
-#include "printable.h"
-
-namespace cutlass {
-
-
-/******************************************************************************
- * matrix_transform_t
- ******************************************************************************/
-
-/**
- * \brief Enumeration of dense matrix view transformations
- *
- * These enumerators (and corresponding tag types) describe which view
- * transformation needs to be applied prior to operation upon a given dense
- * matrix. Its values correspond to Fortran characters 'n' (non-transpose),
- * 't'(transpose) and 'c'(conjugate transpose) that are often
- * used as parameters to legacy BLAS implementations
- */
-struct matrix_transform_t : printable_t
-{
-    /// \brief Enumerants (same as CUBLAS)
-    enum kind_t
-    {
-        /// Invalid view
-        Invalid             = -1,
-
-        /// Non-transpose view
-        NonTranspose        = 0,
-
-        /// Transpose view
-        Transpose           = 1,
-
-        /// Conjugate transpose view
-        ConjugateTranpose   = 2,
-    };
-
-    /// Enumerant value
-    kind_t kind;
-
-    /// Default constructor
-    matrix_transform_t() : kind(Invalid) {}
-
-    /// Copy constructor
-    matrix_transform_t(const kind_t &other_kind) : kind(other_kind) {}
-
-    /// Cast to kind_t
-    operator kind_t() const { return kind; }
-
-    /// Returns the instance as a string
-    __host__ __device__ inline
-    char const* to_string() const
-    {
-        switch (kind)
-        {
-            case NonTranspose:      return "NonTranspose";
-            case Transpose:         return "Transpose";
-            case ConjugateTranpose: return "ConjugateTranpose";
-            default:                return "Invalid";
-        }
-    }
-
-    /// Insert the formatted instance into the output stream
-    void print(std::ostream& out) const { out << to_string(); }
-
-};
-
-
-} // namespace cutlass
--- a/cutlass/util/nv_std.h
+++ b/cutlass/util/nv_std.h
@ -1,727 +0,0 @@
-/*
- * Copyright 1993-2017 NVIDIA Corporation.  All rights reserved.
- *
- * NOTICE TO LICENSEE:
- *
- * This source code and/or documentation ("Licensed Deliverables") are
- * subject to NVIDIA intellectual property rights under U.S. and
- * international Copyright laws.
- *
- * These Licensed Deliverables contained herein is PROPRIETARY and
- * CONFIDENTIAL to NVIDIA and is being provided under the terms and
- * conditions of a form of NVIDIA software license agreement by and
- * between NVIDIA and Licensee ("License Agreement") or electronically
- * accepted by Licensee.  Notwithstanding any terms or conditions to
- * the contrary in the License Agreement, reproduction or disclosure
- * of the Licensed Deliverables to any third party without the express
- * written consent of NVIDIA is prohibited.
- *
- * NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
- * LICENSE AGREEMENT, NVIDIA MAKES NO REPRESENTATION ABOUT THE
- * SUITABILITY OF THESE LICENSED DELIVERABLES FOR ANY PURPOSE.  IT IS
- * PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND.
- * NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THESE LICENSED
- * DELIVERABLES, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY,
- * NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
- * NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
- * LICENSE AGREEMENT, IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY
- * SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY
- * DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
- * WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
- * ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
- * OF THESE LICENSED DELIVERABLES.
- *
- * U.S. Government End Users.  These Licensed Deliverables are a
- * "commercial item" as that term is defined at 48 C.F.R. 2.101 (OCT
- * 1995), consisting of "commercial computer software" and "commercial
- * computer software documentation" as such terms are used in 48
- * C.F.R. 12.212 (SEPT 1995) and is provided to the U.S. Government
- * only as a commercial end item.  Consistent with 48 C.F.R.12.212 and
- * 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all
- * U.S. Government End Users acquire the Licensed Deliverables with
- * only those rights set forth herein.
- *
- * Any use of the Licensed Deliverables in individual and commercial
- * software must include, in the user documentation and internal
- * comments to the code, the above Disclaimer and U.S. Government End
- * Users Notice.
- */
-
-#pragma once
-
-/**
- * \file
- * \brief C++ features that may be otherwise unimplemented for CUDA device functions.
- *
- * This file has three components:
- *
- *   (1) Macros:
- *       - Empty macro defines for C++ keywords not supported by the current
- *         version of C++. These simply allow compilation to proceed (but do
- *         not provide the added semantics).
- *           - \p noexcept
- *           - \p constexpr
- *           - \p nullptr
- *           - \p static_assert
- *
- *       - Macro functions that we need in constant expressions because the
- *         C++ equivalents require constexpr compiler support.  These are
- *         prefixed with \p __NV_STD_*
- *           - \p __NV_STD_MAX
- *           - \p __NV_STD_MIN
- *
- *   (2) Re-implementations of STL functions and types:
- *       - C++ features that need the \p __device__ annotation.  These are
- *         placed into the \p nv_std namespace.
- *           - \p plus
- *           - \p less
- *           - \p greater
- *           - \p min
- *           - \p max
- *           - \p methods on std::pair (==, !=, <, <=, >, >=, and make_pair())
- *
- *   (3) Stop-gap implementations of unsupported STL functions and types:
- *       - STL functions and types defined by C++ 11/14/17/etc. that are not
- *         provided by the current version of C++. These are placed into the
- *         \p nv_std namespace
- *           - \p integral_constant
- *           - \p nullptr_t
- *           - \p true_type
- *           - \p false_type
- *           - \p bool_constant
- *           - \p enable_if
- *           - \p conditional
- *           - \p is_same
- *           - \p is_base_of
- *           - \p remove_const
- *           - \p remove_volatile
- *           - \p remove_cv
- *           - \p is_volatile
- *           - \p is_pointer
- *           - \p is_void
- *           - \p is_integral
- *           - \p is_floating_point
- *           - \p is_arithmetic
- *           - \p is_fundamental
- *           - \p is_trivially_copyable
- *           - \p alignment_of
- *           - \p aligned_storage
- *
- *   (4) Functions and types that are STL-like (but aren't in the STL):
- *           - \p TODO: min and max functors?
- *
- * The idea is that, as we drop support for older compilers, we can simply #define
- * the \p __NV_STD_XYZ macros and \p nv_std namespace to alias their C++
- * counterparts (or trivially find-and-replace their occurrences in code text).
- */
-
-
-//-----------------------------------------------------------------------------
-// Include STL files that nv_std provides functionality for
-//-----------------------------------------------------------------------------
-
-#include <cstddef>          // nullptr_t
-#include <algorithm>        // Minimum/maximum operations
-#include <functional>       // Arithmetic operations
-#include <utility>          // For methods on std::pair
-#if (!defined(_MSC_VER) && (__cplusplus >= 201103L)) || (defined(_MSC_VER) && (_MS_VER >= 1500))
-    #include <type_traits>  // For integral constants, conditional metaprogramming, and type traits
-#endif
-
-
-
-/******************************************************************************
- * Macros
- ******************************************************************************/
-//-----------------------------------------------------------------------------
-// Keywords
-//-----------------------------------------------------------------------------
-
-/// noexcept, constexpr
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1900))
-    #ifndef noexcept
-        #define noexcept
-    #endif
-    #ifndef constexpr
-        #define constexpr
-    #endif
-#endif
-
-/// nullptr
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1310 ))
-    #ifndef nullptr
-        #define nullptr 0
-    #endif
-#endif
-
-/// static_assert
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1600 ))
-    #ifndef static_assert
-        #define __nv_std_cat_(a, b) a ## b
-        #define __nv_std_cat(a, b) __nv_std_cat_(a, b)
-        #define static_assert(__e, __m) typedef int __nv_std_cat(AsSeRt, __LINE__)[(__e) ? 1 : -1]
-    #endif
-#endif
-
-
-//-----------------------------------------------------------------------------
-// Functions
-//-----------------------------------------------------------------------------
-
-/// Select maximum(a, b)
-#ifndef __NV_STD_MAX
-    #define __NV_STD_MAX(a, b) (((b) > (a)) ? (b) : (a))
-#endif
-
-/// Select minimum(a, b)
-#ifndef __NV_STD_MIN
-    #define __NV_STD_MIN(a, b) (((b) < (a)) ? (b) : (a))
-#endif
-
-
-
-
-/******************************************************************************
- * Re-implementations
- ******************************************************************************/
-
-namespace nv_std {
-
-    //-----------------------------------------------------------------------------
-    // Arithmetic operations, comparisons <functional>
-    //-----------------------------------------------------------------------------
-
-    /// nv_std::plus
-    template <typename T>
-    struct plus
-    {
-        inline __host__ __device__
-        constexpr T operator()(const T &lhs, const T &rhs) const
-        {
-            return lhs + rhs;
-        }
-    };
-
-
-    /// std::less
-    template <typename T>
-    struct less
-    {
-        inline __host__ __device__
-        constexpr bool operator()(const T &lhs, const T &rhs) const
-        {
-            return lhs < rhs;
-        }
-    };
-
-    /// std::greater
-    template <typename T>
-    struct greater
-    {
-        inline __host__ __device__
-        constexpr bool operator()(const T &lhs, const T &rhs) const
-        {
-            return lhs > rhs;
-        }
-    };
-
-
-    //-----------------------------------------------------------------------------
-    // Minimum/maximum operations <algorithm>
-    //-----------------------------------------------------------------------------
-
-    /// std::min
-    template <typename T>
-    inline __host__ __device__
-    constexpr const T& min(
-        const T& a,
-        const T& b)
-    {
-        return (b < a) ? b : a;
-    }
-
-    /// std::max
-    template <typename T>
-    inline __host__ __device__
-    constexpr const T& max(
-        const T& a,
-        const T& b)
-    {
-        return (a < b) ? b : a;
-    }
-
-
-    //-----------------------------------------------------------------------------
-    // Methods on std::pair
-    //-----------------------------------------------------------------------------
-
-    using std::pair;
-
-    template< class T1, class T2 >
-    inline __host__ __device__
-    constexpr bool operator==( const pair<T1,T2>& lhs, const pair<T1,T2>& rhs )
-    {
-        return (lhs.first == rhs.first) && (lhs.second == rhs.second);
-    }
-
-    template< class T1, class T2 >
-    inline __host__ __device__
-    constexpr bool operator!=( const pair<T1,T2>& lhs, const pair<T1,T2>& rhs )
-    {
-        return (lhs.first != rhs.first) && (lhs.second != rhs.second);
-    }
-
-    template< class T1, class T2 >
-    inline __host__ __device__
-    constexpr bool operator<( const pair<T1,T2>& lhs, const pair<T1,T2>& rhs )
-    {
-        return (lhs.first < rhs.first) ?
-            true :
-            (rhs.first < lhs.first) ?
-                false :
-                (lhs.second < rhs.second);
-    }
-
-    template< class T1, class T2 >
-    inline __host__ __device__
-    constexpr bool operator<=( const pair<T1,T2>& lhs, const pair<T1,T2>& rhs )
-    {
-        return !(rhs < lhs);
-    }
-
-    template< class T1, class T2 >
-    inline __host__ __device__
-    constexpr bool operator>( const pair<T1,T2>& lhs, const pair<T1,T2>& rhs )
-    {
-        return (rhs < lhs);
-    }
-
-    template< class T1, class T2 >
-    inline __host__ __device__
-    constexpr bool operator>=( const pair<T1,T2>& lhs, const pair<T1,T2>& rhs )
-    {
-        return !(lhs < rhs);
-    }
-
-    template< class T1, class T2 >
-    inline __host__ __device__
-    std::pair<T1,T2> make_pair( T1 t, T2 u )
-    {
-        std::pair<T1,T2> retval;
-        retval.first = t;
-        retval.second = u;
-        return retval;
-    }
-
-}   // namespace nv_std
-
-
-
-/******************************************************************************
- * Implementations of C++ 11/14/17/... STL features
- ******************************************************************************/
-
-namespace nv_std {
-
-//-----------------------------------------------------------------------------
-// Integral constant helper types <type_traits>
-//-----------------------------------------------------------------------------
-
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1500))
-
-    /// std::integral_constant
-    template <typename value_t, value_t V>
-    struct integral_constant;
-
-    /// std::integral_constant
-    template <typename value_t, value_t V>
-    struct integral_constant
-    {
-        static const value_t value = V;
-
-        typedef value_t                           value_type;
-        typedef integral_constant<value_t, V>     type;
-
-        inline __host__ __device__ operator value_type() const
-        {
-             return value;
-        }
-
-        inline __host__ __device__ const value_type operator()() const
-        {
-            return value;
-        }
-    };
-
-
-#else
-
-    using std::integral_constant;
-    using std::pair;
-
-#endif
-
-    /// The type used as a compile-time boolean with true value.
-    typedef integral_constant<bool, true>   true_type;
-
-    /// The type used as a compile-time boolean with false value.
-    typedef integral_constant<bool, false>  false_type;
-
-
-#if (!defined(_MSC_VER) && (__cplusplus < 201402L)) || (defined(_MSC_VER) && (_MSC_VER < 1900))
-
-    /// std::bool_constant
-    template <bool V>
-    struct bool_constant : nv_std::integral_constant<bool, V>
-    {};
-
-#else
-
-    using std::bool_constant;
-
-#endif
-
-
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1700))
-
-    /// std::nullptr_t
-    struct nullptr_t {};
-
-#else
-
-    using std::nullptr_t;
-
-#endif
-
-
-
-    //-----------------------------------------------------------------------------
-    // Conditional metaprogramming <type_traits>
-    //-----------------------------------------------------------------------------
-
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1600))
-
-    /// std::enable_if (true specialization)
-    template<bool C, typename T = void>
-    struct enable_if {
-      typedef T type;
-    };
-
-    /// std::enable_if (false specialization)
-    template<typename T>
-    struct enable_if<false, T> { };
-
-
-    /// std::conditional (true specialization)
-    template<bool B, class T, class F>
-    struct conditional { typedef T type; };
-
-    /// std::conditional (false specialization)
-    template<class T, class F>
-    struct conditional<false, T, F> { typedef F type; };
-
-#else
-
-    using std::enable_if;
-    using std::conditional;
-
-#endif
-
-
-
-    //-----------------------------------------------------------------------------
-    // Const/volatility specifiers <type_traits>
-    //-----------------------------------------------------------------------------
-
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1500))
-
-    /// std::remove_const (non-const specialization)
-    template <typename T> struct remove_const                   { typedef T type; };
-
-    /// std::remove_const (const specialization)
-    template <typename T> struct remove_const<const T>          { typedef T type; };
-
-
-
-    /// std::remove_volatile (non-volatile specialization)
-    template <typename T> struct remove_volatile                { typedef T type; };
-
-    /// std::remove_volatile (volatile specialization)
-    template <typename T> struct remove_volatile<volatile T>    { typedef T type; };
-
-
-
-    /// std::remove_cv
-    template <typename T>
-    struct remove_cv {
-        typedef typename remove_volatile<typename remove_const<T>::type>::type type;
-    };
-
-#else
-
-    using std::remove_const;
-    using std::remove_volatile;
-    using std::remove_cv;
-
-#endif
-
-
-    //-----------------------------------------------------------------------------
-    // Type relationships <type_traits>
-    //-----------------------------------------------------------------------------
-
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1500))
-
-    /// std::is_same (false specialization)
-    template <typename A, typename B>
-    struct is_same : false_type
-    {};
-
-    /// std::is_same (true specialization)
-    template <typename A>
-    struct is_same<A, A> : true_type
-    {};
-
-
-    /// Helper for std::is_base_of
-    template<typename BaseT, typename DerivedT>
-    struct is_base_of_helper
-    {
-        typedef char (&yes)[1];
-        typedef char (&no)[2];
-
-        template<typename B, typename D>
-        struct dummy
-        {
-            operator B*() const;
-            operator D*();
-        };
-
-        template<typename T>
-        static yes check(DerivedT*, T);
-
-        static no check(BaseT*, int);
-
-        static const bool value = sizeof(check(dummy<BaseT, DerivedT>(), int())) == sizeof(yes);
-    };
-
-    /// std::is_base_of
-    template <typename BaseT, typename DerivedT>
-    struct is_base_of : integral_constant<
-        bool,
-        (is_base_of_helper<typename remove_cv<BaseT>::type, typename remove_cv<DerivedT>::type>::value) ||
-            (is_same<typename remove_cv<BaseT>::type, typename remove_cv<DerivedT>::type>::value)>
-    {};
-
-
-#else
-
-    using std::is_same;
-    using std::is_base_of;
-
-#endif
-
-
-
-    //-----------------------------------------------------------------------------
-    // Type properties <type_traits>
-    //-----------------------------------------------------------------------------
-
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1500))
-
-    /// std::is_volatile
-    template <typename T> struct is_volatile             : false_type {};
-    template <typename T> struct is_volatile<volatile T> : true_type {};
-
-
-    /// Helper for std::is_pointer (false specialization)
-    template <typename T> struct is_pointer_helper        : false_type {};
-
-    /// Helper for std::is_pointer (true specialization)
-    template <typename T> struct is_pointer_helper<T*>    : true_type {};
-
-    /// std::is_pointer
-    template <typename T> struct is_pointer : is_pointer_helper<typename remove_cv<T>::type> {};
-
-
-
-    /// std::is_void
-    template <typename T>
-    struct is_void : is_same<void, typename remove_cv<T>::type>
-    {};
-
-
-
-    /// std::is_integral
-    template <typename T> struct is_integral            : false_type {};
-    template <> struct is_integral<char>                : true_type {};
-    template <> struct is_integral<signed char>         : true_type {};
-    template <> struct is_integral<unsigned char>       : true_type {};
-    template <> struct is_integral<short>               : true_type {};
-    template <> struct is_integral<unsigned short>      : true_type {};
-    template <> struct is_integral<int>                 : true_type {};
-    template <> struct is_integral<unsigned int>        : true_type {};
-    template <> struct is_integral<long>                : true_type {};
-    template <> struct is_integral<unsigned long>       : true_type {};
-    template <> struct is_integral<long long>           : true_type {};
-    template <> struct is_integral<unsigned long long>  : true_type {};
-    template <typename T> struct is_integral<volatile T>        : is_integral<T> {};
-    template <typename T> struct is_integral<const T>           : is_integral<T> {};
-    template <typename T> struct is_integral<const volatile T>  : is_integral<T> {};
-
-
-
-    /// std::is_floating_point
-    template <typename T>
-    struct is_floating_point : integral_constant<
-        bool,
-        (is_same<float, typename remove_cv<T>::type>::value ||
-            is_same<double, typename remove_cv<T>::type>::value)>
-    {};
-
-
-
-    /// std::is_arithmetic
-    template <typename T>
-    struct is_arithmetic :
-        integral_constant<bool, (is_integral<T>::value || is_floating_point<T>::value)>
-    {};
-
-
-    /// std::is_fundamental
-    template <typename T>
-    struct is_fundamental : integral_constant<
-        bool, (is_arithmetic<T>::value ||
-            is_void<T>::value ||
-            is_same<nullptr_t, typename remove_cv<T>::type>::value)>
-    {};
-
-
-
-
-#else
-
-    using std::is_volatile;
-    using std::is_pointer;
-    using std::is_void;
-    using std::is_integral;
-    using std::is_floating_point;
-    using std::is_arithmetic;
-    using std::is_fundamental;
-
-#endif
-
-
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) ||      \
-    (defined(_MSC_VER) && (_MSC_VER < 1800)) ||             \
-    (defined(__GNUG__) && (__GNUC__ < 5))
-
-    /**
-     * std::is_trivially_copyable
-     *
-     * This implementation only evaluates true if T is fundamental or pointer
-     *
-     * Without help from partial template specializations provided by the user for
-     * a specific class or struct, this trait will never report that the specified
-     * class or struct  is trivially-copyable ; this is always safe,
-     * if possibly sub-optimal.
-     */
-    template <typename T>
-    struct is_trivially_copyable :
-        integral_constant<bool, (is_fundamental<T>::value || is_pointer<T>::value)>
-    {};
-
-#else
-
-    using std::is_trivially_copyable;
-
-#endif
-
-
-
-
-    //-----------------------------------------------------------------------------
-    // Alignment and layout utilities
-    //-----------------------------------------------------------------------------
-
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1500))
-
-
-    /// std::alignment_of
-    template <typename value_t>
-    struct alignment_of
-    {
-        struct pad
-        {
-            value_t val;
-            char    byte;
-        };
-
-        enum
-        {
-            value = sizeof(pad) - sizeof(value_t)
-        };
-    };
-
-#else
-
-    template <typename value_t>
-    struct alignment_of : std::alignment_of<value_t> {};
-
-#endif
-
-    /* 16B specializations where 32-bit Win32 host compiler disagrees with device compiler */
-    template <> struct alignment_of<int4>                 { enum { value = 16 }; };
-    template <> struct alignment_of<uint4>                { enum { value = 16 }; };
-    template <> struct alignment_of<float4>               { enum { value = 16 }; };
-    template <> struct alignment_of<long4>                { enum { value = 16 }; };
-    template <> struct alignment_of<ulong4>               { enum { value = 16 }; };
-    template <> struct alignment_of<longlong2>            { enum { value = 16 }; };
-    template <> struct alignment_of<ulonglong2>           { enum { value = 16 }; };
-    template <> struct alignment_of<double2>              { enum { value = 16 }; };
-    template <> struct alignment_of<longlong4>            { enum { value = 16 }; };
-    template <> struct alignment_of<ulonglong4>           { enum { value = 16 }; };
-    template <> struct alignment_of<double4>              { enum { value = 16 }; };
-
-    // Specializations for volatile/const qualified types
-    template <typename value_t> struct alignment_of<volatile value_t>       : alignment_of<value_t> {};
-    template <typename value_t> struct alignment_of<const value_t>          : alignment_of<value_t> {};
-    template <typename value_t> struct alignment_of<const volatile value_t> : alignment_of<value_t> {};
-
-
-
-#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1800))
-
-    template<size_t Align> struct aligned_chunk;
-    template<> struct __align__(1) aligned_chunk<1> { uint8_t buff; };
-    template<> struct __align__(2) aligned_chunk<2> { uint16_t buff; };
-    template<> struct __align__(4) aligned_chunk<4> { uint32_t buff; };
-    template<> struct __align__(8) aligned_chunk<8> { uint32_t buff[2]; };
-    template<> struct __align__(16) aligned_chunk<16> { uint32_t buff[4]; };
-    template<> struct __align__(32) aligned_chunk<32> { uint32_t buff[8]; };
-    template<> struct __align__(64) aligned_chunk<64> { uint32_t buff[16]; };
-    template<> struct __align__(128) aligned_chunk<128> { uint32_t buff[32]; };
-    template<> struct __align__(256) aligned_chunk<256> { uint32_t buff[64]; };
-    template<> struct __align__(512) aligned_chunk<512> { uint32_t buff[128]; };
-    template<> struct __align__(1024) aligned_chunk<1024> { uint32_t buff[256]; };
-    template<> struct __align__(2048) aligned_chunk<2048> { uint32_t buff[512]; };
-    template<> struct __align__(4096) aligned_chunk<4096> { uint32_t buff[1024]; };
-
-    /// std::aligned_storage
-    template <size_t Len, size_t Align>
-    struct aligned_storage
-    {
-        typedef aligned_chunk<Align> type[Len / sizeof(aligned_chunk<Align>)];
-    };
-
-#else
-
-    using std::aligned_storage;
-
-#endif
-
-
-
-
-}; // namespace nv_std
-
--- a/cutlass/util/printable.h
+++ b/cutlass/util/printable.h
@ -1,64 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * \brief Pure virtual base class for printable types
- */
-
-#include <iostream>
-
-
-namespace cutlass {
-
-
-/******************************************************************************
- * printable_t
- ******************************************************************************/
-
-/**
- * Pure virtual base class for printable types
- */
-struct printable_t
-{
-    /// Returns the instance as a string
-    __host__ __device__ inline
-    virtual char const* to_string() const = 0;
-
-    /// Insert the formatted instance into the output stream
-    virtual void print(std::ostream& out) const = 0;
-
-    /// Destructor
-    virtual ~printable_t() {}
-};
-
-
-/// Insert the formatted \p printable into the output stream
-std::ostream& operator<<(
-    std::ostream& out,
-    printable_t const& printable)
-{
-    printable.print(out);
-    return out;
-}
-
-
-} // namespace cutlass
--- a/cutlass/util/util.h
+++ b/cutlass/util/util.h
@ -1,74 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * \brief Umbrella header file for utilities
- */
-
-#include "debug.h"
-#include "device_introspection.h"
-#include "io_intrinsics.h"
-#include "math.h"
-#include "nv_std.h"
-#include "printable.h"
-#include "matrix_transform.h"
-
-
-
-namespace cutlass {
-
-
-/******************************************************************************
- * int_constant
- ******************************************************************************/
-
-/**
- * Shorthand for nv_std::integral_constant of int32_t type
- */
-template <int V>
-struct int_constant : nv_std::integral_constant<int32_t, V>
-{};
-
-
-/******************************************************************************
- * Uninitialized
- ******************************************************************************/
-
-/**
- * \brief A storage-backing wrapper that allows types with non-trivial constructors to be aliased in unions
- */
-template <typename T>
-struct __align__(16) uninitialized
-{
-    /// Backing storage
-    uint8_t storage[sizeof(T)];
-
-    /// Alias
-    __host__ __device__ __forceinline__ T& alias()
-    {
-        return reinterpret_cast<T&>(*this);
-    }
-};
-
-
-
-} // namespace cutlass
--- a/cutlass_test/.gitignore
+++ b/cutlass_test/.gitignore
@ -1,7 +0,0 @@
-/bin/
-/gemm-GPU.csv
-/gemm-REF.csv
-/a.csv
-/b.csv
-/gp100_schmoo/
-/ignore/
--- a/cutlass_test/Makefile
+++ b/cutlass_test/Makefile
@ -1,172 +0,0 @@
-#/******************************************************************************
-# * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
-# *
-# * Redistribution and use in source and binary forms, with or without
-# * modification, are not permitted.
-# *
-# * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-# * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-# * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-# * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-# * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-# * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-# * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-# * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-# * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-# *
-#******************************************************************************/
-
-
-#-------------------------------------------------------------------------------
-#
-# Makefile usage
-#
-# make <target> sm=<XX[,YY,ZZ,..]> [transpose=<nn*|nt|tn|tt>] [verbose=<0*|1>] [keep=<0*|1>]
-#
-#   * : default
-#
-#-------------------------------------------------------------------------------
-
-TEST_DIR := $(dir $(lastword $(MAKEFILE_LIST)))
-
-include ../common.mk
-
-
-#-------------------------------------------------------------------------------
-# Commandline Options
-#-------------------------------------------------------------------------------
-
-ifdef transpose
-	TRANSPOSE := $(transpose)
-else
-    TRANSPOSE := nn
-endif
-
-# If defined, GEMMs only compiled with specified alignment restrictions on A and B
-# matrices. Otherwise, kernels are compiled for all feasible alignment options, and
-# the appropriate kernel is selected.
-ifdef alignment
-	DEFINES += -DGEMM_ALIGNMENT=$(alignment)
-endif
-
-# If defined as false, ragged handling can be disabled.
-ifdef ragged
-	DEFINES += -DGEMM_RAGGED=$(ragged)
-endif
-
-#-------------------------------------------------------------------------------
-# Include and Library paths
-#-------------------------------------------------------------------------------
-
-INC += -I$(TEST_DIR)
-INC += -I$(BASE_DIR)
-
-LIBS += -lcublas
-
-#-------------------------------------------------------------------------------
-# Preprocessor definitions
-#-------------------------------------------------------------------------------
-
-ifeq (nt, $(TRANSPOSE))
-	DEFINES += -DTRANSPOSE_B
-else ifeq (tn, $(TRANSPOSE))
-	DEFINES += -DTRANSPOSE_A
-
-else ifeq (tt, $(TRANSPOSE))
-	DEFINES += -DTRANSPOSE_A
-	DEFINES += -DTRANSPOSE_B
-endif
-
-NVCCFLAGS += -std=c++11
-
-
-#-------------------------------------------------------------------------------
-# Dependency Lists
-#-------------------------------------------------------------------------------
-
-DEPS := $(call rwildcard, $(BASE_DIR),*.h) \
-        $(call rwildcard, $(BASE_DIR)cgl,*.h) \
-        $(BASE_DIR)common.mk \
-        $(TEST_DIR)Makefile
-
-
-ALL :=  sgemm \
-        dgemm \
-        hgemm \
-        igemm
-
-
-#-------------------------------------------------------------------------------
-# make default
-#-------------------------------------------------------------------------------
-
-default:
-
-
-#-------------------------------------------------------------------------------
-# make clean
-#-------------------------------------------------------------------------------
-
-clean :
-	rm -f bin/*
-	rm -f *.i* *.cubin *.cu.c *.cudafe* *.fatbin.c *.ptx *.hash *.cu.cpp *.o *.obj* *dlink.* *.res *.fatbin *.module_id
-
-
-#-------------------------------------------------------------------------------
-# make all
-#-------------------------------------------------------------------------------
-
-all : $(ALL)
-
-
-#-------------------------------------------------------------------------------
-# make sgemm
-#-------------------------------------------------------------------------------
-
-sgemm: bin/sgemm_$(TRANSPOSE)_$(BIN_SUFFIX)
-
-bin/sgemm_$(TRANSPOSE)_$(BIN_SUFFIX) : gemm.cu $(DEPS)
-	mkdir -p bin
-	$(NVCC) -DTEST_SGEMM $(DEFINES) $(SM_TARGETS) -o $@ gemm.cu $(NVCCFLAGS) $(CPU_ARCH) $(INC) $(LIBINC) $(LIBS)
-
-#-------------------------------------------------------------------------------
-# make dgemm
-#-------------------------------------------------------------------------------
-
-dgemm: bin/dgemm_$(TRANSPOSE)_$(BIN_SUFFIX)
-
-bin/dgemm_$(TRANSPOSE)_$(BIN_SUFFIX) : gemm.cu $(DEPS)
-	mkdir -p bin
-	$(NVCC) -DTEST_DGEMM $(DEFINES) $(SM_TARGETS) -o $@ gemm.cu $(NVCCFLAGS) $(CPU_ARCH) $(INC) $(LIBINC) $(LIBS)
-
-#-------------------------------------------------------------------------------
-# make hgemm
-#-------------------------------------------------------------------------------
-
-hgemm: bin/hgemm_$(TRANSPOSE)_$(BIN_SUFFIX)
-
-bin/hgemm_$(TRANSPOSE)_$(BIN_SUFFIX) : gemm.cu $(DEPS)
-	mkdir -p bin
-	$(NVCC) -DTEST_HGEMM $(DEFINES) $(SM_TARGETS) -o $@ gemm.cu $(NVCCFLAGS) $(CPU_ARCH) $(INC) $(LIBINC) $(LIBS)
-
-#-------------------------------------------------------------------------------
-# make igemm
-#-------------------------------------------------------------------------------
-
-igemm: bin/igemm_$(TRANSPOSE)_$(BIN_SUFFIX)
-
-bin/igemm_$(TRANSPOSE)_$(BIN_SUFFIX) : gemm.cu $(DEPS)
-	mkdir -p bin
-	$(NVCC) -DTEST_IGEMM $(DEFINES) $(SM_TARGETS) -o $@ gemm.cu $(NVCCFLAGS) $(CPU_ARCH) $(INC) $(LIBINC) $(LIBS)
-
-#-------------------------------------------------------------------------------
-# make wgemm
-#-------------------------------------------------------------------------------
-
-wgemm: bin/wgemm_$(TRANSPOSE)_$(BIN_SUFFIX)
-
-bin/wgemm_$(TRANSPOSE)_$(BIN_SUFFIX) : gemm.cu $(DEPS)
-	mkdir -p bin
-	$(NVCC) -DTEST_WGEMM -DWMMA $(DEFINES) $(SM_TARGETS) -o $@ gemm.cu $(NVCCFLAGS) $(CPU_ARCH) $(INC) $(LIBINC) $(LIBS)
-
--- a/cutlass_test/cublas_dispatch.h
+++ b/cutlass_test/cublas_dispatch.h
@ -1,292 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * C++ interface for dispatching CUBLAS GEMM calls
- */
-
-#include <cublas_v2.h>
-
-namespace cutlass {
-
-
-/******************************************************************************
- * cuBLAS dispatch entrypoints
- ******************************************************************************/
-
-/**
- * Dispatch cuBLAS igemm
- */
-cublasStatus_t cublas_gemm_dispatch(
-    cublasHandle_t    cublas_handle,            ///< CUBLAS handle
-    cublasOperation_t transform_a,               ///< Transform op(A) that is non- or (conj.) transpose.
-    cublasOperation_t transform_b,               ///< Transform op(B) that is non- or (conj.) transpose.
-    int             m,                          ///< Height in rows of op(A) and C
-    int             n,                          ///< Width in columns of op(B) and C
-    int             k,                          ///< Width in columns of op(A) and height in rows of op(B)
-    int32_t         alpha,                      ///< Scalar used for multiplicands
-    int8_t          *d_a,                       ///< Device pointer to matrix A array values
-    int8_t          *d_b,                       ///< Device pointer to matrix B array values
-    int32_t         beta,                       ///< Scalar used for addend
-    int32_t         *d_c,                       ///< Device pointer to matrix C array values
-    cudaStream_t    stream = 0,                 ///< CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool            debug_synchronous = false)  ///< Whether or not to synchronize the stream after every kernel launch to check for errors.
-{
-    return cublasGemmEx(
-        cublas_handle,
-        transform_a,
-        transform_b,
-        m,
-        n,
-        k,
-        (void*) &alpha,
-        (void*) d_a,
-        CUDA_R_8I,
-        (transform_a == CUBLAS_OP_N) ? m : k,
-        (void*) d_b,
-        CUDA_R_8I,
-        (transform_b == CUBLAS_OP_N) ? k : n,
-        (void*) &beta,
-        (void*) d_c,
-        CUDA_R_32I,
-        m,
-        CUDA_R_32I,
-        CUBLAS_GEMM_DFALT);
-}
-
-
-/**
- * Dispatch cuBLAS hgemm
- */
-cublasStatus_t cublas_gemm_dispatch(
-    cublasHandle_t    cublas_handle,            ///< CUBLAS handle
-    cublasOperation_t transform_a,               ///< Transform op(A) that is non- or (conj.) transpose.
-    cublasOperation_t transform_b,               ///< Transform op(B) that is non- or (conj.) transpose.
-    int             m,                          ///< Height in rows of op(A) and C
-    int             n,                          ///< Width in columns of op(B) and C
-    int             k,                          ///< Width in columns of op(A) and height in rows of op(B)
-    __half          alpha,                      ///< Scalar used for multiplicands
-    __half          *d_a,                       ///< Device pointer to matrix A array values
-    __half          *d_b,                       ///< Device pointer to matrix B array values
-    __half          beta,                       ///< Scalar used for addend
-    __half          *d_c,                       ///< Device pointer to matrix C array values
-    cudaStream_t    stream = 0,                 ///< CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool            debug_synchronous = false)  ///< Whether or not to synchronize the stream after every kernel launch to check for errors.
-{
-    return cublasHgemm(
-        cublas_handle, transform_a, transform_b,
-        m, n, k,
-        &alpha,
-        d_a,
-        (transform_a == CUBLAS_OP_N) ? m : k,
-        d_b,
-        (transform_b == CUBLAS_OP_N) ? k : n,
-        &beta,
-        d_c,
-        m);
-
-}
-
-
-/**
- * Dispatch cuBLAS sgemm
- */
-cublasStatus_t cublas_gemm_dispatch(
-    cublasHandle_t    cublas_handle,            ///< CUBLAS handle
-    cublasOperation_t transform_a,               ///< Transform op(A) that is non- or (conj.) transpose.
-    cublasOperation_t transform_b,               ///< Transform op(B) that is non- or (conj.) transpose.
-    int             m,                          ///< Height in rows of op(A) and C
-    int             n,                          ///< Width in columns of op(B) and C
-    int             k,                          ///< Width in columns of op(A) and height in rows of op(B)
-    float           alpha,                      ///< Scalar used for multiplicands
-    float           *d_a,                       ///< Device pointer to matrix A array values
-    float           *d_b,                       ///< Device pointer to matrix B array values
-    float           beta,                       ///< Scalar used for addend
-    float           *d_c,                       ///< Device pointer to matrix C array values
-    cudaStream_t    stream = 0,                 ///< CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool            debug_synchronous = false)  ///< Whether or not to synchronize the stream after every kernel launch to check for errors.
-{
-    return cublasSgemm(
-        cublas_handle, transform_a, transform_b,
-        m, n, k,
-        &alpha,
-        d_a,
-        (transform_a == CUBLAS_OP_N) ? m : k,
-        d_b,
-        (transform_b == CUBLAS_OP_N) ? k : n,
-        &beta,
-        d_c,
-        m);
-}
-
-
-/**
- * Dispatch cuBLAS dgemm
- */
-cublasStatus_t cublas_gemm_dispatch(
-    cublasHandle_t    cublas_handle,            ///< CUBLAS handle
-    cublasOperation_t transform_a,               ///< Transform op(A) that is non- or (conj.) transpose.
-    cublasOperation_t transform_b,               ///< Transform op(B) that is non- or (conj.) transpose.
-    int             m,                          ///< Height in rows of op(A) and C
-    int             n,                          ///< Width in columns of op(B) and C
-    int             k,                          ///< Width in columns of op(A) and height in rows of op(B)
-    double          alpha,                      ///< Scalar used for multiplicands
-    double          *d_a,                       ///< Device pointer to matrix A array values
-    double          *d_b,                       ///< Device pointer to matrix B array values
-    double          beta,                       ///< Scalar used for addend
-    double          *d_c,                       ///< Device pointer to matrix C array values
-    cudaStream_t    stream = 0,                 ///< CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool            debug_synchronous = false)  ///< Whether or not to synchronize the stream after every kernel launch to check for errors.
-{
-    return cublasDgemm(
-        cublas_handle, transform_a, transform_b,
-        m, n, k,
-        &alpha,
-        d_a, (transform_a == CUBLAS_OP_N) ? m : k,
-        d_b, (transform_b == CUBLAS_OP_N) ? k : n,
-        &beta,
-        d_c, m);
-}
-
-/**
- * Dispatch cuBLAS Tensor Cores GEMM
- */
-cublasStatus_t cublas_gemm_dispatch(
-    cublasHandle_t    cublas_handle,            ///< CUBLAS handle
-    cublasOperation_t transform_a,               ///< Transform op(A) that is non- or (conj.) transpose.
-    cublasOperation_t transform_b,               ///< Transform op(B) that is non- or (conj.) transpose.
-    int             m,                          ///< Height in rows of op(A) and C
-    int             n,                          ///< Width in columns of op(B) and C
-    int             k,                          ///< Width in columns of op(A) and height in rows of op(B)
-    float           alpha,                      ///< Scalar used for multiplicands
-    half            *d_a,                       ///< Device pointer to matrix A array values
-    half            *d_b,                       ///< Device pointer to matrix B array values
-    float           beta,                       ///< Scalar used for addend
-    float           *d_c,                       ///< Device pointer to matrix C array values
-    cudaStream_t    stream = 0,                 ///< CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-    bool            debug_synchronous = false)  ///< Whether or not to synchronize the stream after every kernel launch to check for errors.
-{
-    return cublasGemmEx(
-        cublas_handle,
-        transform_a,
-        transform_b,
-        m,
-        n,
-        k,
-        (void*) &alpha,
-        (void*) d_a,
-        CUDA_R_16F,
-        (transform_a == CUBLAS_OP_N) ? m : k,
-        (void*) d_b,
-        CUDA_R_16F,
-        (transform_b == CUBLAS_OP_N) ? k : n,
-        (void*) &beta,
-        (void*) d_c,
-        CUDA_R_32F,
-        m,
-        CUDA_R_32F,
-        CUBLAS_GEMM_DFALT_TENSOR_OP);
-}
-
-
-/**
- * Uses cuBLAS to compute gemm on device matrices (unspecialized)
- */
-template <
-    gemm::tiling_strategy::kind_t            _TilingStrategy,   ///< Tile-sizing classification category
-    math_operation_class_t                  _math_op,
-    matrix_transform_t::kind_t              _TransformA,    ///< Transformation op for matrix A
-    matrix_transform_t::kind_t              _TransformB,    ///< Transformation op for matrix B
-    typename                                _value,         ///< Multiplicand value type (matrices A and B)
-    typename                                _accum          ///< Accumulator value type (matrix C and scalars)
->
-struct cublas_gemm
-{
-    //
-    // Type alias definitions
-    //
-
-    static const gemm::tiling_strategy::kind_t   TilingStrategy = _TilingStrategy;
-    static const math_operation_class_t         math_op = _math_op;
-    static const matrix_transform_t::kind_t     TransformA = _TransformA;
-    static const matrix_transform_t::kind_t     TransformB = _TransformB;
-
-    using value_t = _value;
-    using accum_t = _accum;
-
-    /// Launches a GEMM
-    gemm::launch_configuration operator()(
-        cublasHandle_t                          cublas_handle,              ///< CUBLAS handle
-        int m,
-        int n,
-        int k,
-        value_t                                 *A,                         ///< A matrix
-        value_t                                 *B,                         ///< B matrix
-        accum_t                                 *C,                         ///< C matrix
-        accum_t                                 alpha,                      ///< Scalar used for multiplicands
-        accum_t                                 beta,                       ///< Scalar used for addend
-        cudaStream_t                            stream = 0,                 ///< CUDA stream to launch kernels within.  Default is stream<sub>0</sub>.
-        bool                                    debug_synchronous = false)  ///< Whether or not to synchronize the stream after every kernel launch to check for errors.
-    {
-        cublasStatus_t cublas_error = cublas_gemm_dispatch(
-            cublas_handle,
-            (cublasOperation_t) TransformA,
-            (cublasOperation_t) TransformB,
-            m,
-            n,
-            k,
-            alpha,
-            A,
-            B,
-            beta,
-            C,
-            stream,
-            debug_synchronous);
-
-        cudaError_t error;
-        if (cublas_error != CUBLAS_STATUS_SUCCESS)
-        {
-            if (cublas_error == CUBLAS_STATUS_NOT_SUPPORTED) {
-                return gemm::launch_configuration(cudaErrorInvalidValue);
-            }
-
-            error = cudaGetLastError();
-            if (error == cudaSuccess) {
-                return gemm::launch_configuration(cudaErrorUnknown);
-            }
-            return error;
-        }
-
-        // Check for failure to launch
-        if (CUDA_PERROR_DEBUG(error = cudaPeekAtLastError()))
-            return gemm::launch_configuration(error);
-
-        // Sync the stream if specified to flush runtime errors
-        if (debug_synchronous && (CUDA_PERROR_DEBUG(error = cudaStreamSynchronize(stream))))
-            return gemm::launch_configuration(error);
-
-        return gemm::launch_configuration(error);
-    }
-};
-
-
-} // namespace cutlass
--- a/cutlass_test/cutlass_dispatch.h
+++ b/cutlass_test/cutlass_dispatch.h
@ -1,253 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file Dispatch routines for CUTLASS GEMM kernels
- */
-
-// CUDA includes
-#include <cublas_v2.h>
-
-// Cutlass GEMM API
-#include <cutlass/util/util.h>
-#include <cutlass/gemm/dispatch.h>
-#include <cutlass/gemm/epilogue_function.h>
-
-// Test utilities
-#include "util/type_conversion.h"
-
-namespace cutlass {
-
-
-
-/******************************************************************************
- * Cutlass dispatch entrypoints
- ******************************************************************************/
-
-//
-// Compile-time overrides for alignment and ragged handling.
-//
-
-// If zero, all feasible alignment options are supported.
-#ifndef GEMM_ALIGNMENT
-#define GEMM_ALIGNMENT 0
-#endif
-
-// If true, kernels are compiled with ragged handling enabled.
-#ifndef GEMM_RAGGED
-	#define GEMM_RAGGED true
-#endif
-
-//
-// Dispatch logic given problem size specialization, math operation class, layout
-// and type of operands, and epilogue operation.
-//
-
-/**
- * Cutlass GEMM dispatch
- */
-template <
-    gemm::tiling_strategy::kind_t            _TilingStrategy,        ///< Tile-sizing classification category
-    math_operation_class_t                  _math_op,            // Indicates
-    matrix_transform_t::kind_t              _TransformA,         ///< Transformation op for matrix A
-    matrix_transform_t::kind_t              _TransformB,         ///< Transformation op for matrix B
-    typename                                _value,              ///< Multiplicand value type (matrices A and B)
-    typename                                _accum,              ///< Accumulator value type (matrix C and scalars)
-    typename                                _epilogue_op_t       ///< Epilogue opeartion to update matrix C
-        = gemm::blas_scaled_epilogue<_accum, _accum, _accum>
->
-struct cutlass_gemm_dispatch
-{
-    //
-    // Type alias definitions
-    //
-
-    static const gemm::tiling_strategy::kind_t       TilingStrategy = _TilingStrategy;
-    static const math_operation_class_t             math_op = _math_op;
-    static const matrix_transform_t::kind_t         TransformA = _TransformA;
-    static const matrix_transform_t::kind_t         TransformB = _TransformB;
-
-    using value_t = _value;
-    using accum_t = _accum;
-    using epilogue_op_t = _epilogue_op_t;
-
-    //
-    // Methods
-    //
-
-    /// Returns leading dimension for A matrix operand
-    int leading_dim_a(int m, int k) const
-    {
-        return (TransformA == matrix_transform_t::NonTranspose ? m : k);
-    }
-
-    /// Returns leading dimension for B matrix operand
-    int leading_dim_b(int k, int n) const
-    {
-        return (TransformB == matrix_transform_t::NonTranspose ? k : n);
-    }
-
-    /// Launches a GEMM
-    template <int operand_alignment, int accumulator_alignment>
-    gemm::launch_configuration launch(
-        int                                     m,
-        int                                     n,
-        int                                     k,
-        epilogue_op_t                           epilogue_op,
-        value_t                                 *A,
-        value_t                                 *B,
-        accum_t                                 *C,
-        cudaStream_t                            stream = 0,
-        bool                                    debug_synchronous = false)
-    {
-        return gemm::device_gemm<
-                TilingStrategy,
-                math_op,
-                TransformA,
-                operand_alignment,
-                TransformB,
-                operand_alignment,
-                value_t,
-                accum_t,
-                epilogue_op_t,
-                accumulator_alignment>
-            (
-                m,
-                n,
-                k,
-                epilogue_op,
-                A,
-                B,
-                C,
-                stream,
-                debug_synchronous);
-    }
-
-    /// Dispatches a CUTLASS GEMM
-    gemm::launch_configuration operator()(
-        cublasHandle_t  handle,                     ///< CUBLAS handle
-        int             m,                          ///< Rows of GEMM problem
-        int             n,                          ///< Columns of GEMM problem
-        int             k,                          ///< Inner dimension of GEMM problem
-        value_t         *A,                         ///< A matrix
-        value_t         *B,                         ///< B matrix
-        accum_t         *C,                         ///< C matrix
-        accum_t         alpha,                      ///< Scalar used for multiplicands
-        accum_t         beta,                       ///< Scalar used for addend
-        cudaStream_t    stream = 0,                 ///< CUDA stream to launch kernels within.
-        bool            debug_synchronous = false)  ///< Whether or not to synchronize the stream
-                                                    ///  after every kernel launch to check for errors.
-    {
-
-        // Forces kernel selection to choose specific alignment (in bytes)
-        int const force_operand_alignment = GEMM_ALIGNMENT;
-
-        // Problem size must be multiple of the smallest vector load size
-        typedef value_t operand_load_t;
-        int const accumulator_alignment = sizeof(accum_t);
-
-        int const lda = leading_dim_a(m, k);
-        int const ldb = leading_dim_b(k, n);
-
-        epilogue_op_t epilogue(alpha, beta);
-
-        // TODO: opportunity for metaprogramming loop
-
-        // Prefer the largest granularity of vector load that is compatible with
-        // problem size and data alignment.
-        if ((!force_operand_alignment || force_operand_alignment == 16) &&
-            !((sizeof(operand_load_t) * lda) % 16) &&
-            !((sizeof(operand_load_t) * ldb) % 16))
-        {
-            #if !(GEMM_ALIGNMENT) || (GEMM_ALIGNMENT == 16)
-                return launch<__NV_STD_MAX(16, sizeof(value_t)), accumulator_alignment>(
-                        m,
-                        n,
-                        k,
-                        epilogue,
-                        A,
-                        B,
-                        C,
-                        stream,
-                        debug_synchronous);
-            #endif
-        }
-        else if ((!force_operand_alignment || force_operand_alignment == 8) &&
-            !((sizeof(operand_load_t) * lda) % 8) &&
-            !((sizeof(operand_load_t) * ldb) % 8))
-        {
-            #if !(GEMM_ALIGNMENT) || (GEMM_ALIGNMENT == 8)
-                return launch<__NV_STD_MAX(8, sizeof(value_t)), accumulator_alignment>(
-                        m,
-                        n,
-                        k,
-                        epilogue,
-                        A,
-                        B,
-                        C,
-                        stream,
-                        debug_synchronous);
-            #endif
-        }
-        else if ((!force_operand_alignment || force_operand_alignment == 4) &&
-            !((sizeof(operand_load_t) * lda) % 4) &&
-            !((sizeof(operand_load_t) * ldb) % 4))
-        {
-            #if !(GEMM_ALIGNMENT) || (GEMM_ALIGNMENT == 4)
-                return launch<__NV_STD_MAX(4, sizeof(value_t)), accumulator_alignment>(
-                        m,
-                        n,
-                        k,
-                        epilogue,
-                        A,
-                        B,
-                        C,
-                        stream,
-                        debug_synchronous);
-            #endif
-        }
-        else if ((!force_operand_alignment || force_operand_alignment == 2) &&
-            !((sizeof(operand_load_t) * lda) % 2) &&
-            !((sizeof(operand_load_t) * ldb) % 2))
-        {
-            // 16-bit alignment only supported for HGEMM
-            #if defined(TEST_HGEMM) || defined(TEST_WGEMM)
-                #if !(GEMM_ALIGNMENT) || (GEMM_ALIGNMENT == 2)
-                    return launch<__NV_STD_MAX(2, sizeof(value_t)), accumulator_alignment>(
-                            m,
-                            n,
-                            k,
-                            epilogue,
-                            A,
-                            B,
-                            C,
-                            stream,
-                            debug_synchronous);
-                #endif
-            #endif
-        }
-
-        return gemm::launch_configuration(cudaErrorInvalidValue);
-    }
-};
-
-
-} // namespace cutlass
--- a/cutlass_test/gemm.cu
+++ b/cutlass_test/gemm.cu
@ -1,564 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-/**
- * \file gemm.cu
- * GEMM test driver
- *
- */
-
-#include <iostream>
-#include <typeinfo>
-#include <random>
-#include <stdint.h>
-
-// CUBLAS GEMM API
-#include <cublas_v2.h>
-
-// Set Cutlass debug macro to enable console printing of library errors
-#define DEBUG
-
-#if defined(WMMA)
-// Conditionally include WMMA headers (CUDA 9 Preview Feature)
-#include <mma.h>
-#endif
-
-// Cutlass GEMM API
-#include <cutlass/util/util.h>
-#include <cutlass/gemm/dispatch.h>
-#include <cutlass/gemm/epilogue_function.h>
-
-// Test utilities
-#include "util/command_line.h"
-#include "util/half.h"
-#include "util/matrix.h"
-#include "util/timer.h"
-#include "util/type_conversion.h"
-
-// Dispatch routines to CUBLAS and CUTLASS
-#include "cublas_dispatch.h"
-#include "cutlass_dispatch.h"
-
-/******************************************************************************
- * Globals, constants and typedefs
- ******************************************************************************/
-
-using namespace cutlass;
-
-/// CUBLAS handle
-cublasHandle_t g_cublas_handle;
-
-/// The device-id of the current device
-int g_device_id = -1;
-
-/// The number of timing iterations to invoke
-int g_timing_iterations = -1;
-
-/// The number of randomly-sized problems to schmoo
-int g_schmoo = 0;
-
-
-/******************************************************************************
- * Number generation
- ******************************************************************************/
-
-/**
- * Simple low-integer generator
- */
-struct simple_gen
-{
-    std::default_random_engine generator;
-    std::uniform_int_distribution<int> distribution;
-
-    /// Constructor
-    simple_gen(int max) : distribution(max * -1, max)
-    {}
-
-    /// Functor
-    int operator()()
-    {
-        return distribution(generator);
-    }
-};
-
-
-
-
-/******************************************************************************
- * Test execution
- ******************************************************************************/
-
-
-/**
- * Compute C = (alpha * A * B) + (beta * C)
- */
-template <
-    typename                    test_func_t,    ///< Test function type
-    matrix_transform_t::kind_t  TransformA,     ///< Transformation op for matrix A
-    matrix_transform_t::kind_t  TransformB,     ///< Transformation op for matrix B
-    typename                    value_t,        ///< Multiplicand value type (matrices A and B)
-    typename                    accum_t>        ///< Accumulator value type (matrix C and scalars)
-bool test(
-    int m,          ///< Height of C in rows
-    int n,          ///< Width of C in columns
-    int k,          ///< Width (height) of A (B)
-    accum_t alpha,  ///< Multiplicand scalar
-    accum_t beta)   ///< Addend scalar
-{
-    cudaStream_t stream = 0;
-
-    //
-    // Initialize matrices
-    //
-
-    matrix<value_t> A(
-        (TransformA == matrix_transform_t::NonTranspose) ? m : k,
-        (TransformA == matrix_transform_t::NonTranspose) ? k : m);
-
-    matrix<value_t> B(
-        (TransformB == matrix_transform_t::NonTranspose) ? k : n,
-        (TransformB == matrix_transform_t::NonTranspose) ? n : k);
-
-    matrix<accum_t> C(m, n);
-
-    // initialized matrices with small values precisely representable as integers
-    simple_gen a_gen(3);
-    simple_gen b_gen(5);
-    A.fill_random(a_gen);
-    B.fill_random(b_gen);
-    C.fill_ramp(0,0);
-
-//    // Alternatively, initialize with procedural values to simplify debugging incorrect results
-//    A.fill_ramp(1,2);
-//    B.fill_ramp(1,1);
-
-    // Sync to device
-    A.sync_device();
-    B.sync_device();
-    C.sync_device();
-
-    CUDA_PERROR(cudaPeekAtLastError());
-    CUDA_PERROR(cudaDeviceSynchronize());
-
-    //
-    // Run test once with debug-synchronous enabled and check result
-    //
-
-    if (!g_schmoo) printf("\n");
-
-    test_func_t test_func;
-
-    C.fill_ramp(0, 0);
-    C.sync_device();
-
-    cudaError_t error = test_func(
-        g_cublas_handle,
-        m,
-        n,
-        k,
-        A.d_data(),
-        B.d_data(),
-        C.d_data(),
-        alpha,
-        beta,
-        stream,
-        !g_schmoo).result;
-
-    bool not_applicable = (error == cudaErrorInvalidValue);
-    bool is_failed = false;
-    if (not_applicable)
-    {
-        printf(", NA");
-    }
-    else
-    {
-        CUDA_PERROR(error);
-
-        // Compute reference check if wont take too long on CPU
-        if ((!g_schmoo) && (m * n <= 1024 * 1024))
-        {
-            matrix<accum_t> ref_C(m, n);
-            ref_C.fill_ramp(0, 0);
-            ref_C.gemm(TransformA, TransformB, alpha, A, B, beta);
-            C.sync_host();
-
-            is_failed = (C != ref_C);
-
-            if (!g_schmoo)
-            {
-                if (is_failed)
-                {
-                    printf("FAIL, ");
-                    std::ofstream file_a("a.csv");
-                    A.write_matrix(file_a);
-                    std::ofstream file_b("b.csv");
-                    B.write_matrix(file_b);
-                    std::ofstream file_d("gemm-REF.csv");
-                    ref_C.write_matrix(file_d);
-                    std::ofstream file_c("gemm-GPU.csv");
-                    C.write_matrix(file_c);
-                }
-                else
-                {
-                    printf("PASS, ");
-                }
-            }
-        }
-        fflush(stdout);
-
-        //
-        // Warmup and timing iterations
-        //
-
-        if (g_timing_iterations > 0)
-        {
-            // Warmup for 1/100 of the timing iterations (minimum of 2)
-            for (int i = 0; i < __NV_STD_MAX(2, (g_timing_iterations + 99) / 100); ++i)
-            {
-                CUDA_PERROR(test_func(
-                    g_cublas_handle,
-                    m,
-                    n,
-                    k,
-                    A.d_data(),
-                    B.d_data(),
-                    C.d_data(),
-                    alpha,
-                    beta,
-                    stream,
-                    false).result);
-            }
-        }
-
-        // Conduct timing iterations
-        double elapsed_ms = 0;
-        gpu_timer timer;
-        timer.start();
-
-        for (int i = 0; i < g_timing_iterations; i++)
-        {
-            CUDA_PERROR(test_func(
-                g_cublas_handle,
-                m,
-                n,
-                k,
-                A.d_data(),
-                B.d_data(),
-                C.d_data(),
-                alpha,
-                beta,
-                stream,
-                false).result);
-        }
-
-        timer.stop();
-        elapsed_ms += timer.elapsed_millis();
-        double avg_ms = elapsed_ms / g_timing_iterations;
-
-        // Display performance
-        if (g_timing_iterations > 0)
-        {
-            int64_t num_flops      = (2 * int64_t(m) * int64_t(n) * int64_t(k)) + (2 * int64_t(m) * int64_t(n));
-            double gflops_per_sec   = double(num_flops) / avg_ms / 1.0e6;
-
-            if (g_schmoo)
-            {
-                if (is_failed)
-                    printf("F");
-
-                printf(", %.3f", gflops_per_sec);
-
-                // Sleep for a few milliseconds to cool
-                sleep_millis(10);
-            }
-            else
-            {
-                printf("Avg runtime: %.3f ms, total flops: %lld, GFLOP/s: %.2f\n",
-                    avg_ms,
-                    num_flops,
-                    gflops_per_sec);
-            }
-            fflush(stdout);
-        }
-    }
-
-    return is_failed;
-}
-
-/**
- * Compute C = (alpha * A * B) + (beta * C)
- */
-template <
-    math_operation_class_t     math_op,
-    matrix_transform_t::kind_t TransformA,  ///< Transformation op for matrix A
-    matrix_transform_t::kind_t TransformB,  ///< Transformation op for matrix B
-    typename value_t,                       ///< Multiplicand value type (matrices A and B)
-    typename accum_t>                       ///< Accumulator value type (matrix C and scalars)
-bool test(
-    int m,          ///< Height of C in rows
-    int n,          ///< Width of C in columns
-    int k,          ///< Width (height) of A (B)
-    accum_t alpha,  ///< Multiplicand scalar
-    accum_t beta)   ///< Addend scalar
-{
-    uint64_t flop_base = 1ull << 41;
-    int max_timing_iterations = 10000;
-    int min_timing_iterations = 10;
-
-    bool test_error = false;
-
-    // Scale the number of timing iterations with respect to problem size (if not specified on commandline)
-    if ((g_timing_iterations < 0) || g_schmoo)
-    {
-        uint64_t num_flops = (2 * uint64_t(m) * uint64_t(n) * uint64_t(k)) + (2 * uint64_t(m) * uint64_t(n));
-        g_timing_iterations = (int) ((flop_base / sizeof(value_t)) / num_flops);
-
-        g_timing_iterations = (int) __NV_STD_MIN(max_timing_iterations, g_timing_iterations);
-        g_timing_iterations = (int) __NV_STD_MAX(min_timing_iterations, g_timing_iterations);
-    }
-
-    if (g_schmoo)
-    {
-        printf("%d, %d, %d, %c%c, %d, %d",
-            m, n, k,
-            (TransformA == matrix_transform_t::NonTranspose) ? 'n' : 't',
-            (TransformB == matrix_transform_t::NonTranspose) ? 'n' : 't',
-            m * n,
-            g_timing_iterations);
-    }
-    else
-    {
-        printf("\n------------------------------------------------------------\n");
-        printf("%dx%dx%d, GEMM_%c%c, %d C elements, %d timing iterations\n",
-            m, n, k,
-            (TransformA == matrix_transform_t::NonTranspose) ? 'n' : 't',
-            (TransformB == matrix_transform_t::NonTranspose) ? 'n' : 't',
-            m * n,
-            g_timing_iterations);
-    }
-    fflush(stdout);
-
-    // CUBLAS
-    test_error |= test<
-        cublas_gemm<gemm::tiling_strategy::Unknown, math_op, TransformA, TransformB, value_t, accum_t>,
-        TransformA,
-        TransformB,
-        value_t,
-        accum_t>(m, n, k, accum_t(alpha), accum_t(beta));
-
-    // CUTLASS
-    test_error |= test<
-        cutlass_gemm_dispatch<gemm::tiling_strategy::Small, math_op, TransformA, TransformB, value_t, accum_t>,
-        TransformA,
-        TransformB,
-        value_t,
-        accum_t>(m, n, k, accum_t(alpha), accum_t(beta));
-
-    test_error |= test<
-        cutlass_gemm_dispatch<gemm::tiling_strategy::Medium, math_op, TransformA, TransformB, value_t, accum_t>,
-        TransformA,
-        TransformB,
-        value_t,
-        accum_t>(m, n, k, accum_t(alpha), accum_t(beta));
-
-    test_error |= test<
-        cutlass_gemm_dispatch<gemm::tiling_strategy::Large, math_op, TransformA, TransformB, value_t, accum_t>,
-        TransformA,
-        TransformB,
-        value_t,
-        accum_t>(m, n, k, accum_t(alpha), accum_t(beta));
-
-    test_error |= test<
-        cutlass_gemm_dispatch<gemm::tiling_strategy::Tall, math_op, TransformA, TransformB, value_t, accum_t>,
-        TransformA,
-        TransformB,
-        value_t,
-        accum_t>(m, n, k, accum_t(alpha), accum_t(beta));
-
-    test_error |= test<
-        cutlass_gemm_dispatch<gemm::tiling_strategy::Wide, math_op, TransformA, TransformB, value_t, accum_t>,
-        TransformA,
-        TransformB,
-        value_t,
-        accum_t>(m, n, k, accum_t(alpha), accum_t(beta));
-
-    test_error |= test<
-        cutlass_gemm_dispatch<gemm::tiling_strategy::Huge, math_op, TransformA, TransformB, value_t, accum_t>,
-        TransformA,
-        TransformB,
-        value_t,
-        accum_t>(m, n, k, accum_t(alpha), accum_t(beta));
-
-    return test_error;
-}
-
-
-
-
-/******************************************************************************
- * Main
- ******************************************************************************/
-
-
-/**
- * Main
- */
-int main(int argc, const char **argv)
-{
-    //
-    // Problem type (compiler-supplied so we don't compile everything)
-    //
-
-    // Define value_t and accum_t (multiplicand and accumulator types, respectively)
-#if defined(TEST_SGEMM)
-    typedef float       value_t;
-    typedef float       accum_t;
-    const math_operation_class_t math_op = math_operation_class_t::scalar;
-#elif defined(TEST_DGEMM)
-    typedef double      value_t;
-    typedef double      accum_t;
-    const math_operation_class_t math_op = math_operation_class_t::scalar;
-#elif defined(TEST_HGEMM)
-    typedef __half      value_t;
-    typedef __half      accum_t;
-    const math_operation_class_t math_op = math_operation_class_t::scalar;
-#elif defined(TEST_IGEMM)
-    typedef int8_t      value_t;
-    typedef int32_t     accum_t;
-    const math_operation_class_t math_op = math_operation_class_t::scalar;
-#elif defined(TEST_WGEMM)
-    typedef half        value_t;
-    typedef float       accum_t;
-    const math_operation_class_t math_op = math_operation_class_t::matrix;
-#else
-    #error Unknown GEMM type requested.
-#endif
-
-
-    // Define transpose constants
-#ifdef TRANSPOSE_A
-    static const matrix_transform_t::kind_t TransformA = matrix_transform_t::Transpose;
-#else
-    static const matrix_transform_t::kind_t TransformA = matrix_transform_t::NonTranspose;
-#endif
-
-#ifdef TRANSPOSE_B
-    static const matrix_transform_t::kind_t TransformB = matrix_transform_t::Transpose;
-#else
-    static const matrix_transform_t::kind_t TransformB = matrix_transform_t::NonTranspose;
-#endif
-
-
-    //
-    // Commandline parsing
-    //
-
-    // Initialize command line
-    command_line args(argc, argv);
-
-    int m_factor    = args.device_prop.multiProcessorCount * 128;
-    int m           = round_nearest(4096, m_factor);
-    int k           = 4096;
-    int n           = 4096;
-    float alpha     = 1.0;
-    float beta      = 0.0;
-
-    g_device_id = args.device_id;
-    args.get_cmd_line_argument("m", m);
-    args.get_cmd_line_argument("n", n);
-    args.get_cmd_line_argument("k", k);
-    args.get_cmd_line_argument("i", g_timing_iterations);
-    args.get_cmd_line_argument("alpha", alpha);
-    args.get_cmd_line_argument("beta", beta);
-    args.get_cmd_line_argument("schmoo", g_schmoo);
-
-    // Print usage
-    if (args.check_cmd_line_flag("help"))
-    {
-        printf("%s "
-            "[--help] "
-            "[--i=<timing iterations>] "
-            "[--device=<device-id>] "
-            "[--alpha=<alpha> --beta=<beta>] "
-            "[--schmoo=<samples> || --m=<height> --n=<width> --k=<depth>]"
-            "\n", argv[0]);
-        exit(0);
-    }
-
-    // Initialize cuBLAS
-    if (cublasCreate(&g_cublas_handle) != CUBLAS_STATUS_SUCCESS)
-    {
-        fprintf(stderr, "cublasCreate() failed\n");
-        exit(1);
-    }
-
-    bool test_error = false;
-
-    if (g_schmoo)
-    {
-        // Run a schmoo of problem sizes
-        printf("M, N, K, transpose, total_flops, timing_iterations, sol_flop/s, cublas_sol, cutlass_small_sol, cutlass_med_sol, cutlass_large_sol, cutlass_tall_sol, cutlass_wide_sol, cutlass_huge_sol\n");
-
-        // Generate power-law distribution from [32, 16384)
-        std::mt19937 gen(0);
-        std::uniform_real_distribution<float> dis(5, 14);
-        for (int i = 0; i < g_schmoo; ++i)
-        {
-        	int m = int(pow(float(2), dis(gen)));
-        	int n = int(pow(float(2), dis(gen)));
-        	int k = int(pow(float(2), dis(gen)));
-
-        	// Round m and n to nearest multiple of 32 if < 128, otherwise to the nearest 128
-        	m = (m < 128) ?
-        			round_nearest(m, 32) :
-        			round_nearest(m, 128);
-        	n = (n < 128) ?
-        			round_nearest(n, 32) :
-        			round_nearest(n, 128);
-
-        	// Round k to the nearest 16
-            k = (sizeof(value_t) == 1) ?
-                round_nearest(k, 32) :
-                round_nearest(k, 16);
-
-        	test_error |= test<math_op, TransformA, TransformB, value_t, accum_t>(
-                m, n, k,
-                from_float<accum_t>(alpha),
-                from_float<accum_t>(beta));
-
-        	printf("\n"); fflush(stdout);
-        }
-    }
-    else
-    {
-        // Test a single GEMM problem size
-        test_error |= test<math_op, TransformA, TransformB, value_t, accum_t>(
-            m,
-            n,
-            k,
-            from_float<accum_t>(alpha),
-            from_float<accum_t>(beta));
-    }
-
-    // Cleanup
-    cublasDestroy(g_cublas_handle);
-
-    return test_error;
-}
-
--- a/cutlass_test/util/command_line.h
+++ b/cutlass_test/util/command_line.h
@ -1,312 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-
-#pragma once
-
-/**
- * \file
- * Utility for parsing command line arguments
- */
-
-#include <string>
-#include <vector>
-#include <sstream>
-#include <iostream>
-#include <limits>
-
-#include <cuda_runtime.h>
-#include <cutlass/util/debug.h>
-
-
-namespace cutlass {
-
-/******************************************************************************
- * command_line
- ******************************************************************************/
-
-/**
- * Utility for parsing command line arguments
- */
-struct command_line
-{
-
-    std::vector<std::string>    keys;
-    std::vector<std::string>    values;
-    std::vector<std::string>    args;
-    int                         device_id;
-    cudaDeviceProp              device_prop;
-    float                       device_giga_bandwidth;
-    size_t                      device_free_physmem;
-    size_t                      device_total_physmem;
-
-    /**
-     * Constructor
-     */
-    command_line(int argc, const char **argv, int device_id = -1) :
-        keys(10),
-        values(10),
-        device_id(device_id)
-    {
-        using namespace std;
-
-        for (int i = 1; i < argc; i++)
-        {
-            string arg = argv[i];
-
-            if ((arg[0] != '-') || (arg[1] != '-'))
-            {
-                args.push_back(arg);
-                continue;
-            }
-
-            string::size_type pos;
-            string key, val;
-            if ((pos = arg.find('=')) == string::npos) {
-                key = string(arg, 2, arg.length() - 2);
-                val = "";
-            } else {
-                key = string(arg, 2, pos - 2);
-                val = string(arg, pos + 1, arg.length() - 1);
-            }
-
-            keys.push_back(key);
-            values.push_back(val);
-        }
-
-        // Initialize device
-        CUDA_PERROR_EXIT(device_init());
-    }
-
-
-    /**
-     * Checks whether a flag "--<flag>" is present in the commandline
-     */
-    bool check_cmd_line_flag(const char* arg_name)
-    {
-        using namespace std;
-
-        for (int i = 0; i < int(keys.size()); ++i)
-        {
-            if (keys[i] == string(arg_name))
-                return true;
-        }
-        return false;
-    }
-
-
-    /**
-     * Returns number of naked (non-flag and non-key-value) commandline parameters
-     */
-    template <typename value_t>
-    int num_naked_args()
-    {
-        return args.size();
-    }
-
-
-    /**
-     * Returns the commandline parameter for a given index (not including flags)
-     */
-    template <typename value_t>
-    void get_cmd_line_argument(int index, value_t &val)
-    {
-        using namespace std;
-        if (index < args.size()) {
-            istringstream str_stream(args[index]);
-            str_stream >> val;
-        }
-    }
-
-    /**
-     * Returns the value specified for a given commandline parameter --<flag>=<value>
-     */
-    template <typename value_t>
-    void get_cmd_line_argument(const char *arg_name, value_t &val)
-    {
-        using namespace std;
-
-        for (int i = 0; i < int(keys.size()); ++i)
-        {
-            if (keys[i] == string(arg_name))
-            {
-                istringstream str_stream(values[i]);
-                str_stream >> val;
-            }
-        }
-    }
-
-
-    /**
-     * Returns the values specified for a given commandline parameter --<flag>=<value>,<value>*
-     */
-    template <typename value_t>
-    void get_cmd_line_arguments(
-        const char *arg_name,
-        std::vector<value_t> &vals,
-        char sep = ',')
-    {
-        using namespace std;
-
-        if (check_cmd_line_flag(arg_name))
-        {
-            // Clear any default values
-            vals.clear();
-
-            // Recover from multi-value string
-            for (int i = 0; i < keys.size(); ++i)
-            {
-                if (keys[i] == string(arg_name))
-                {
-                    string val_string(values[i]);
-                    istringstream str_stream(val_string);
-                    string::size_type old_pos = 0;
-                    string::size_type new_pos = 0;
-
-                    // Iterate <sep>-delimited values
-                    value_t val;
-                    while ((new_pos = val_string.find(sep, old_pos)) != string::npos)
-                    {
-                        if (new_pos != old_pos)
-                        {
-                            str_stream.width(new_pos - old_pos);
-                            str_stream >> val;
-                            vals.push_back(val);
-                        }
-
-                        // skip over delimiter
-                        str_stream.ignore(1);
-                        old_pos = new_pos + 1;
-                    }
-
-                    // Read last value
-                    str_stream >> val;
-                    vals.push_back(val);
-                }
-            }
-        }
-    }
-
-
-    /**
-     * The number of pairs parsed
-     */
-    int parsed_argc()
-    {
-        return (int) keys.size();
-    }
-
-    /**
-     * Initialize device
-     */
-    cudaError_t device_init()
-    {
-        cudaError_t error = cudaSuccess;
-
-        do
-        {
-            int deviceCount;
-            if (CUDA_PERROR(error = cudaGetDeviceCount(&deviceCount))) break;
-
-            if (deviceCount == 0) {
-                fprintf(stderr, "No devices supporting CUDA.\n");
-                exit(1);
-            }
-            if (device_id < 0)
-            {
-                get_cmd_line_argument("device", device_id);
-            }
-            if ((device_id > deviceCount - 1) || (device_id < 0))
-            {
-                device_id = 0;
-            }
-
-            if (CUDA_PERROR(error = cudaSetDevice(device_id))) break;
-
-            if (CUDA_PERROR(error = cudaMemGetInfo(&device_free_physmem, &device_total_physmem))) break;
-
-            if (CUDA_PERROR(error = cudaGetDeviceProperties(&device_prop, device_id))) break;
-
-            if (device_prop.major < 1) {
-                fprintf(stderr, "Device does not support CUDA.\n");
-                exit(1);
-            }
-
-            device_giga_bandwidth = float(device_prop.memoryBusWidth) * device_prop.memoryClockRate * 2 / 8 / 1000 / 1000;
-
-        } while (0);
-
-        return error;
-    }
-
-
-    //-------------------------------------------------------------------------
-    // Utility functions
-    //-------------------------------------------------------------------------
-
-    /// Tokenizes a comma-delimited list of string pairs delimited by ':'
-    static void tokenize(
-        std::vector<std::pair<std::string, std::string> > &tokens,
-        std::string const &str,
-        char delim = ',',
-        char sep = ':')
-    {
-        // Home-built to avoid Boost dependency
-        size_t s_idx = 0;
-        size_t d_idx = std::string::npos;
-        while (s_idx < str.size())
-        {
-            d_idx = str.find_first_of(delim, s_idx);
-
-            size_t end_idx = (d_idx != std::string::npos ? d_idx : str.size());
-            size_t sep_idx = str.find_first_of(sep, s_idx);
-            size_t offset = 1;
-            if (sep_idx == std::string::npos || sep_idx >= end_idx)
-            {
-                sep_idx = end_idx;
-                offset = 0;
-            }
-
-            std::pair<std::string, std::string> item(
-                str.substr(s_idx, sep_idx - s_idx),
-                str.substr(sep_idx + offset, end_idx - sep_idx - offset));
-
-            tokens.push_back(item);
-            s_idx = end_idx + 1;
-        }
-    }
-
-    /// Tokenizes a comma-delimited list of string pairs delimited by ':'
-    static void tokenize(
-        std::vector<std::string > &tokens,
-        std::string const &str,
-        char delim = ',',
-        char sep = ':')
-    {
-        std::vector<std::pair<std::string, std::string> > token_pairs;
-        tokenize(token_pairs, str, delim, sep);
-        for (auto const &tok : token_pairs)
-        {
-            tokens.push_back(tok.first);
-        }
-    }
-};
-
-
-} // namespace cutlass
--- a/cutlass_test/util/half.h
+++ b/cutlass_test/util/half.h
@ -1,224 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Utilities for interacting with the opaque CUDA __half type
- */
-
-#include <stdint.h>
-#include <cuda_fp16.h>
-#include <iosfwd>
-
-namespace cutlass {
-
-
-/******************************************************************************
- * half_t
- ******************************************************************************/
-
-/**
- * Host-based fp16 data type compatible and convertible with __half
- */
-struct half_t
-{
-    uint16_t __x;
-
-    /// Constructor from __half
-    half_t(const __half &other)
-    {
-        __x = reinterpret_cast<const uint16_t&>(other);
-    }
-
-    /// Constructor from integer
-    half_t(int a)
-    {
-        *this = half_t(float(a));
-    }
-
-
-    /// Constructor from float
-    half_t(float a)
-    {
-        uint32_t ia = *reinterpret_cast<uint32_t*>(&a);
-        uint16_t ir;
-
-        ir = (ia >> 16) & 0x8000;
-
-        if ((ia & 0x7f800000) == 0x7f800000)
-        {
-            if ((ia & 0x7fffffff) == 0x7f800000)
-            {
-                ir |= 0x7c00; /* infinity */
-            }
-            else
-            {
-                ir = 0x7fff; /* canonical NaN */
-            }
-        }
-        else if ((ia & 0x7f800000) >= 0x33000000)
-        {
-            int32_t shift = (int32_t) ((ia >> 23) & 0xff) - 127;
-            if (shift > 15)
-            {
-                ir |= 0x7c00; /* infinity */
-            }
-            else
-            {
-                ia = (ia & 0x007fffff) | 0x00800000; /* extract mantissa */
-                if (shift < -14)
-                { /* denormal */
-                    ir |= ia >> (-1 - shift);
-                    ia = ia << (32 - (-1 - shift));
-                }
-                else
-                { /* normal */
-                    ir |= ia >> (24 - 11);
-                    ia = ia << (32 - (24 - 11));
-                    ir = ir + ((14 + shift) << 10);
-                }
-                /* IEEE-754 round to nearest of even */
-                if ((ia > 0x80000000) || ((ia == 0x80000000) && (ir & 1)))
-                {
-                    ir++;
-                }
-            }
-        }
-
-        this->__x = ir;
-    }
-
-    /// Cast to __half
-    operator __half() const
-    {
-        return reinterpret_cast<const __half&>(__x);
-    }
-
-    /// Cast to float
-    operator float() const
-    {
-        int sign        = ((this->__x >> 15) & 1);
-        int exp         = ((this->__x >> 10) & 0x1f);
-        int mantissa    = (this->__x & 0x3ff);
-        uint32_t f      = 0;
-
-        if (exp > 0 && exp < 31)
-        {
-            // normal
-            exp += 112;
-            f = (sign << 31) | (exp << 23) | (mantissa << 13);
-        }
-        else if (exp == 0)
-        {
-            if (mantissa)
-            {
-                // subnormal
-                exp += 113;
-                while ((mantissa & (1 << 10)) == 0)
-                {
-                    mantissa <<= 1;
-                    exp--;
-                }
-                mantissa &= 0x3ff;
-                f = (sign << 31) | (exp << 23) | (mantissa << 13);
-            }
-            else
-            {
-                // zero
-                f = 0;
-            }
-        }
-        else if (exp == 31)
-        {
-            if (mantissa)
-            {
-                f = 0x7fffffff;     // not a number
-            }
-            else
-            {
-                f = (0xff << 23) | (sign << 31);    //  inf
-            }
-        }
-        return *reinterpret_cast<float const *>(&f);
-    }
-
-
-    /// Get raw storage
-    uint16_t raw()
-    {
-        return this->__x;
-    }
-
-    /// Assignment by sum
-    bool operator ==(const half_t &other)
-    {
-        return (this->__x == other.__x);
-    }
-
-    /// Increment
-    half_t& operator +=(const half_t &rhs)
-    {
-        *this = half_t(float(*this) + float(rhs));
-        return *this;
-    }
-    
-    /// Decrement
-    half_t& operator -=(const half_t &rhs)
-    {
-        *this = half_t(float(*this) - float(rhs));
-        return *this;
-    }
-
-    /// Multiply
-    half_t operator*(const half_t &other)
-    {
-        return half_t(float(*this) * float(other));
-    }
-
-    /// Multiply
-    half_t operator+(const half_t &other)
-    {
-        return half_t(float(*this) + float(other));
-    }
-
-};
-
-
-/******************************************************************************
- * I/O stream overloads
- ******************************************************************************/
-
-/// Insert formatted \p half_t into the output stream
-std::ostream& operator<<(std::ostream &out, const half_t &x)
-{
-    out << (float)x;
-    return out;
-}
-
-
-/// Insert formatted \p __half into the output stream
-std::ostream& operator<<(std::ostream &out, const __half &x)
-{
-    return out << half_t(x);
-}
-
-
-} // namespace cutlass
--- a/cutlass_test/util/matrix.h
+++ b/cutlass_test/util/matrix.h
@ -1,495 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * Matrix data structure providing basic CPU-based algorithms and
- * operations that can be cloned and synchronized in GPU device memory
- */
-
-#include <vector>
-#include <fstream>
-
-#include <cutlass/util/debug.h>
-#include "../cutlass/util/matrix_transform.h"
-#include "half.h"
-
-
-namespace cutlass {
-
-/**
- * \brief Matrix data structure providing basic CPU-based algorithms and
- * operations that be synchronized with a GPU-based replica
- */
-template <typename value_t>
-struct matrix
-{
-    // Host value type (must be convertible to/from value_t)
-    typedef typename nv_std::conditional<
-            (nv_std::is_same<value_t, __half>::value),  // If (value_t == __half) ...
-            half_t,                                     // ... use half_t internally for host storage, else...
-            value_t>::type                              // ... use value_t directly
-        host_value_t;
-
-
-    //-----------------------------------------------------------------------------
-    // Data members
-    //-----------------------------------------------------------------------------
-
-private:
-
-    /// M dimension (height in rows)
-    int _m;
-
-    /// N dimension (width in columns)
-    int _n;
-
-    /// Data array on host
-    std::vector<host_value_t> _h_data;
-
-    /// Clone of data array on GPU device
-    value_t *_d_data;
-
-    /// GPU Device identifier that clone synchronizes with
-    int _device_id;
-
-public:
-
-    //-----------------------------------------------------------------------------
-    // Lifetime and synchronization
-    //-----------------------------------------------------------------------------
-
-    /**
-     * Constructor: zero-initializes the matrix.
-     */
-    matrix(
-        int m,  ///< Height of the matrix in rows
-        int n)  ///< Width of the matrix in columns
-    :
-        _m(m),
-        _n(n),
-        _d_data(NULL),
-        _device_id(0)
-    {
-        _h_data.resize(_m * _n, 0);
-        CUDA_PERROR_EXIT(cudaMalloc((void ** )&_d_data, sizeof(value_t) * _m * _n));
-        CUDA_PERROR_EXIT(cudaGetDevice(&_device_id));
-    }
-
-    /// Destructor
-    ~matrix()
-    {
-        if (_d_data)
-        {
-            CUDA_PERROR_EXIT(cudaFree(_d_data));
-        }
-    }
-
-    /**
-     * Synchronize the GPU-based replica with the current host-based matrix data
-     */
-    void sync_device()
-    {
-        size_t bytes = _m * _n * sizeof(value_t);
-        CUDA_PERROR_EXIT(cudaMemcpy(_d_data, &_h_data[0], bytes, cudaMemcpyHostToDevice));
-    }
-
-
-    /**
-     * Synchronize the host-based replica with the current GPU-based matrix data
-     */
-    void sync_host()
-    {
-        size_t bytes = _m * _n * sizeof(value_t);
-        CUDA_PERROR_EXIT(cudaMemcpy(&_h_data[0], _d_data, bytes, cudaMemcpyDeviceToHost));
-    }
-
-
-    //-----------------------------------------------------------------------------
-    // Inspectors
-    //-----------------------------------------------------------------------------
-
-    /**
-     * Return the height of the matrix, subject to the optional \p transpose_op
-     */
-    int height(matrix_transform_t transpose_op = matrix_transform_t::NonTranspose) const
-    {
-        switch (transpose_op)
-        {
-            case matrix_transform_t::NonTranspose :    return _m;
-            case matrix_transform_t::Transpose :       return _n;
-            default: return -1;
-        }
-    }
-
-
-    /**
-     * Return the width of the matrix, subject to the optional \p transpose_op
-     */
-    int width(matrix_transform_t transpose_op = matrix_transform_t::NonTranspose) const
-    {
-        switch (transpose_op)
-        {
-            case matrix_transform_t::NonTranspose :    return _n;
-            case matrix_transform_t::Transpose :       return _m;
-            default: return -1;
-        }
-    }
-
-
-    /**
-     * Return item at (x, y) coordinate of matrix, subject to the optional \p transform op
-     */
-    host_value_t get(
-        int x,
-        int y,
-        matrix_transform_t transpose_op = matrix_transform_t::NonTranspose) const
-    {
-        switch (transpose_op)
-        {
-            case matrix_transform_t::NonTranspose :    return _h_data[y + (x * _m)];
-            case matrix_transform_t::Transpose :       return _h_data[x + (y * _m)];
-            default: return 0;
-        }
-    }
-
-
-    /**
-     * Return the distance (in items) within memory between elements of two
-     * consecutive columns which have the same row index, subject to the optional \p transform op
-     */
-    int leading_dim(matrix_transform_t transpose_op = matrix_transform_t::NonTranspose) const
-    {
-        switch (transpose_op)
-        {
-            case matrix_transform_t::NonTranspose :    return _m;
-            case matrix_transform_t::Transpose :       return _n;
-            default: return 0;
-        }
-    }
-
-    /**
-     * Get host data pointer
-     */
-    value_t* h_data()
-    {
-        return _h_data.data();
-    }
-
-
-    /**
-     * Get host data pointer
-     */
-    value_t const* h_data() const
-    {
-        return _h_data.data();
-    }
-
-    /**
-     * Get device data pointer
-     */
-    value_t const* d_data() const
-    {
-        return _d_data;
-    }
-
-    /**
-     * Get device data pointer
-     */
-    value_t * d_data()
-    {
-        return _d_data;
-    }
-
-    //-----------------------------------------------------------------------------
-    // Initialization
-    //-----------------------------------------------------------------------------
-
-	/**
-     * Initialize matrix values with a 2D "ramp" defined as
-     * <tt>values(x, y) = (y * rs) + (x * cs)</tt>
-     */
-    void fill_ramp(
-        host_value_t rs,
-        host_value_t cs)
-    {
-        for (int x = 0; x < _n; x++)
-        {
-            for (int y = 0; y < _m; y++)
-            {
-                _h_data[y + (x * _m)] = host_value_t((y * rs) + (x * cs));
-            }
-        }
-    }
-
-
-	/**
-	 * Initialize matrix values such that all the elements of the principal diagonal
-     * are ones and all other elements are zeros
-     */
-    void fill_identity()
-    {
-        for (int j = 0; j < _n; j++)
-        {
-            for (int i = 0; i < _m; i++)
-            {
-                _h_data[i + j * _m] = host_value_t(i == j ? 1 : 0);
-            }
-        }
-    }
-
-
-	/**
-	 * Initialize matrix values using the random number \p generator.  The
-     * \p generator reference is assumed to be a nullary functor that returns
-     * values convertible to the matrix \p value_t.
-     */
-    template <typename T>
-    void fill_random(T & generator)
-    {
-        for (int j = 0; j < _n; j++)
-        {
-            for (int i = 0; i < _m; i++)
-            {
-                _h_data[i + j * _m] = (value_t) generator();
-            }
-        }
-    }
-
-
-    /**
-     * Element-wise matrix addition
-     */
-    matrix & operator+=(matrix const &mat)
-    {
-        for (int j = 0; j < _n; j++)
-        {
-            for (int i = 0; i < _m; i++)
-            {
-                _h_data[i + j * _m] += mat._h_data[i + j * _m];
-            }
-        }
-        return *this;
-    }
-
-    /**
-     * Element-wise matrix subtraction
-     */
-    matrix & operator-=(matrix const &mat)
-    {
-        for (int j = 0; j < _n; j++)
-        {
-            for (int i = 0; i < _m; i++)
-            {
-                _h_data[i + j * _m] -= mat._h_data[i + j * _m];
-            }
-        }
-        return *this;
-    }
-
-    //-----------------------------------------------------------------------------
-    // Output
-    //-----------------------------------------------------------------------------
-
-    /**
-     * Prints matrix in CSV to output stream
-     */
-    template <typename _hv_t>
-    std::ostream & write_matrix(std::ostream &out, _hv_t)
-    {
-        for (int i = 0; i < _m; i++)
-        {
-            for (int j = 0; j < _n; j++)
-            {
-                out << (j ? "," : "") << _h_data[i + j * _m];
-            }
-            out << "\n";
-        }
-        return out;
-    }
-
-
-    /**
-     * Prints matrix in CSV to output stream
-     */
-    std::ostream & write_matrix(std::ostream &out, int8_t)
-    {
-        for (int i = 0; i < _m; i++)
-        {
-            for (int j = 0; j < _n; j++)
-            {
-                out << (j ? "," : "") << int32_t(_h_data[i + j * _m]);
-            }
-            out << "\n";
-        }
-        return out;
-    }
-
-
-    /**
-     * Prints matrix in CSV to output stream
-     */
-    std::ostream & write_matrix(std::ostream &out)
-    {
-        return write_matrix(out, _h_data[0]);
-    }
-
-
-    //-----------------------------------------------------------------------------
-    // Floating point "almost-equal" utilities
-    //-----------------------------------------------------------------------------
-
-    static bool almost_equal_ulps(half_t a, half_t b, int max_ulps)
-    {
-        if (a == b)
-            return true;
-
-        int32_t int_diff = abs(a.raw() - b.raw());
-        if (int_diff <= max_ulps)
-            return true;
-        return false;
-    }
-
-
-    static bool almost_equal_ulps(float a, float b, int max_ulps)
-    {
-        if (a == b)
-            return true;
-        int32_t int_diff = abs(*(int32_t*)&a - *(int32_t*)&b);
-        if (int_diff <= max_ulps)
-            return true;
-        return false;
-    }
-
-
-    static bool almost_equal_ulps(double a, double b, int max_ulps)
-    {
-        if (a == b)
-            return true;
-        int64_t int_diff = abs(*(int64_t*)&a - *(int64_t*)&b);
-        if (int_diff <= max_ulps)
-            return true;
-        return false;
-    }
-
-    static bool almost_equal_ulps(int32_t a, int32_t b, int max_ulps)
-    {
-        return (a == b);
-    }
-
-
-    //-----------------------------------------------------------------------------
-    // matrix operations
-    //-----------------------------------------------------------------------------
-
-
-    /**
-     * Returns matrix equality
-     */
-    bool operator==(const matrix<value_t> &mat) const
-    {
-        int max_ulps = 30;
-
-        if (_m != mat._m || _n != mat._n)
-        {
-            fprintf(stderr, "Error: dimension mismatch during matrix comparison.\n"); exit(1);
-        }
-
-        for (int j = 0; j < _n; j++)
-        {
-            for (int i = 0; i < _m; i++)
-            {
-                if (!almost_equal_ulps(_h_data[i + j * _m], mat._h_data[i + j * _m], max_ulps))
-                {
-                    return false;
-                }
-            }
-        }
-        return true;
-    }
-
-
-    /**
-     * Returns matrix inequality
-     */
-    bool operator!=(const matrix<value_t> &mat) const
-    {
-        return !(*this == mat);
-    }
-
-
-    /**
-     * Computes this = (alpha * op(A) * op(B)) + (beta * this), specialized for gemm_nn
-     */
-    template <typename multiplicand_t>
-    void gemm(
-        matrix_transform_t              transform_a,
-        matrix_transform_t              transform_b,
-        host_value_t                    alpha,
-        const matrix<multiplicand_t>    &A,
-        const matrix<multiplicand_t>    &B,
-        host_value_t                    beta)
-    {
-        // Sanity check dimensions
-        if ((_m != A.height(transform_a)) ||
-            (_n != B.width(transform_b)) ||
-            (A.width(transform_a) != B.height(transform_b)))
-        {
-            fprintf(stderr, "Error: dimension mismatch during gemm.\n");
-            exit(1);
-        }
-
-        int M = A.height(transform_a);
-        int K = A.width(transform_a);
-        int N = B.width(transform_b);
-
-        // Even the host-side implementation utilizes a blocking structure to improve
-        // verification performance
-        int DimBlockM = (M % 16 == 0)  ? 16 : 1;
-        int DimBlockN = (N % 16 == 0)  ? 16 : 1;
-
-        for (int i = 0; i < M; i += DimBlockM)
-        {
-            for (int j = 0; j < N; j += DimBlockN)
-            {
-                for (int block_y = 0; block_y < DimBlockM; block_y++)
-                {
-                    for (int block_x = 0; block_x < DimBlockN; block_x++)
-                    {
-                        int y = i + block_y;
-                        int x = j + block_x;
-
-                        host_value_t accum(0);
-                        for (int k = 0; k < K; k++)
-                        {
-                            accum += host_value_t(A.get(k, y, transform_a)) * host_value_t(B.get(x, k, transform_b));
-                        }
-
-                        _h_data[y + x * M] = (alpha * accum) + (beta * _h_data[y + x * M]);
-                    }
-                }
-            }
-        }
-    }
-};
-
-
-} // namespace cutlass
--- a/cutlass_test/util/timer.h
+++ b/cutlass_test/util/timer.h
@ -1,99 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- * 
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- * 
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * GPU kernel timer
- */
-
-#include <cuda_runtime.h>
-
-#include <cutlass/util/debug.h>
-
-namespace cutlass {
-
-
-/******************************************************************************
- * gpu_timer
- ******************************************************************************/
-
-/**
- * GPU event-based timer
- */
-struct gpu_timer
-{
-    cudaEvent_t _start;
-    cudaEvent_t _stop;
-
-    gpu_timer()
-    {
-        CUDA_PERROR_EXIT(cudaEventCreate(&_start));
-        CUDA_PERROR_EXIT(cudaEventCreate(&_stop));
-    }
-
-    ~gpu_timer()
-    {
-        CUDA_PERROR_EXIT(cudaEventDestroy(_start));
-        CUDA_PERROR_EXIT(cudaEventDestroy(_stop));
-    }
-
-    void start()
-    {
-        CUDA_PERROR_EXIT(cudaEventRecord(_start, 0));
-    }
-
-    void stop()
-    {
-        CUDA_PERROR_EXIT(cudaEventRecord(_stop, 0));
-    }
-
-    float elapsed_millis()
-    {
-        float elapsed = 0.0;
-        CUDA_PERROR_EXIT(cudaEventSynchronize(_stop));
-        CUDA_PERROR_EXIT(cudaEventElapsedTime(&elapsed, _start, _stop));
-        return elapsed;
-    }
-};
-
-
-/******************************************************************************
- * sleep_millis
- ******************************************************************************/
-
-#ifdef _WIN32
-    #include <windows.h>
-
-    void sleep_millis(unsigned milliseconds)
-    {
-        Sleep(milliseconds);
-    }
-#else
-    #include <unistd.h>
-
-    void sleep_millis(unsigned milliseconds)
-    {
-        usleep(milliseconds * 1000); // takes microseconds
-    }
-#endif
-
-
-} // namespace cutlass
--- a/cutlass_test/util/type_conversion.h
+++ b/cutlass_test/util/type_conversion.h
@ -1,155 +0,0 @@
-/******************************************************************************
- * Copyright (c) 2011-2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are not permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
- * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- ******************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * \brief Utilities for converting between types and assessing traits
- */
-
-#include "half.h"
-
-namespace cutlass {
-
-/******************************************************************************
- * Float conversion utilities
- ******************************************************************************/
-
-/// Convert float to value type
-template <typename value_t>
-value_t from_float(float val)
-{
-    return value_t(val);
-}
-
-/// Convert float to value type (__half specialization)
-template <>
-__half from_float<__half>(float val)
-{
-    return half_t(val);
-}
-
-
-/******************************************************************************
- * Type conversion utilities
- ******************************************************************************/
-
-/// Member \p type is defined as the signed integer type having the same size as \p T
-template <typename T>
-struct integer_alias;
-
-template <>
-struct integer_alias<int8_t> {
-    using type = int8_t;
-};
-
-template <>
-struct integer_alias<half_t> {
-    using type = int16_t;
-};
-
-template <>
-struct integer_alias<__half> {
-    using type = int16_t;
-};
-
-template <>
-struct integer_alias<float> {
-    using type = int32_t;
-};
-
-template <>
-struct integer_alias<int> {
-    using type = int32_t;
-};
-
-template <>
-struct integer_alias<double> {
-    using type = int64_t;
-};
-
-
-
-/******************************************************************************
- * Type-info utilities
- ******************************************************************************/
-
-/// Returns a string to prefix 'gemm' to construct CUBLAS-like kernel names
-template <math_operation_class_t math_op, typename value_t, typename accum_t> char const *to_prefix_string();
-
-template <> char const *to_prefix_string<math_operation_class_t::scalar, half_t, half_t>() {
-    return "H";
-}
-
-template <> char const *to_prefix_string<math_operation_class_t::scalar, __half, __half>() {
-    return "H";
-}
-
-template <> char const *to_prefix_string<math_operation_class_t::scalar, float, float>() {
-    return "S";
-}
-
-template <> char const *to_prefix_string<math_operation_class_t::matrix, __half, __half>() {
-    return "WmmaH";
-}
-
-template <> char const *to_prefix_string<math_operation_class_t::matrix, __half, float>() {
-    return "WmmaS";
-}
-
-template <> char const *to_prefix_string<math_operation_class_t::scalar, double, double>() {
-    return "D";
-}
-
-template <> char const *to_prefix_string<math_operation_class_t::scalar, int8_t, int32_t>() {
-    return "I";
-}
-
-
-/******************************************************************************
- * Maps value_t to the minimum vector size used to load operand
- ******************************************************************************/
-
-template <typename T>
-struct operand_load_type;
-
-template <>
-struct operand_load_type<int8_t> { using type = int32_t; };
-
-template <typename T>
-struct operand_load_type { using type = T; };
-
-
-/******************************************************************************
- * Minimum alignment requirement, if any, determined from value_t.
- ******************************************************************************/
-
-template <typename value_t>
-struct gemm_alignment_requirement;
-
-template <>
-struct gemm_alignment_requirement<uint8_t> { static const int value = 4; };
-
-template <typename value_t>
-struct gemm_alignment_requirement { static const int value = 0; };
-
-
-
-} // namespace cutlass
--- a/docs/aligned__buffer_8h.html
+++ b/docs/aligned__buffer_8h.html
@ -0,0 +1,145 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: aligned_buffer.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">aligned_buffer.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>AlignedBuffer is a container for trivially copyable elements suitable for use in unions and shared memory.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="cutlass_8h_source.html">cutlass/cutlass.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="array_8h_source.html">cutlass/array.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for aligned_buffer.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="aligned__buffer_8h__incl.png" border="0" usemap="#aligned__buffer_8h" alt=""/></div>
+<map name="aligned__buffer_8h" id="aligned__buffer_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="aligned__buffer_8h__dep__incl.png" border="0" usemap="#aligned__buffer_8hdep" alt=""/></div>
+<map name="aligned__buffer_8hdep" id="aligned__buffer_8hdep">
+</map>
+</div>
+</div>
+<p><a href="aligned__buffer_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1AlignedBuffer.html">cutlass::AlignedBuffer&lt; T, N, Align &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Modifies semantics of cutlass::Array&lt;&gt; to provide guaranteed alignment.  <a href="structcutlass_1_1AlignedBuffer.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/aligned__buffer_8hdepincl.md5
+++ b/docs/aligned__buffer_8hdepincl.md5
@ -0,0 +1 @@
+6cbc6b81ede44b5f08afd4f4519d56d1
--- a/docs/aligned__buffer_8h__incl.md5
+++ b/docs/aligned__buffer_8h__incl.md5
@ -0,0 +1 @@
+b26c62930ff7668b89f2ee6624e0be3a
--- a/docs/aligned__buffer_8h_source.html
+++ b/docs/aligned__buffer_8h_source.html
--- a/docs/annotated.html
+++ b/docs/annotated.html
@ -0,0 +1,867 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: Class List</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
+      <li><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li class="current"><a href="annotated.html"><span>Class&#160;List</span></a></li>
+      <li><a href="classes.html"><span>Class&#160;Index</span></a></li>
+      <li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
+      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
+    </ul>
+  </div>
+</div><!-- top -->
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div class="header">
+  <div class="headertitle">
+<div class="title">Class List</div>  </div>
+</div><!--header-->
+<div class="contents">
+<div class="textblock">Here are the classes, structs, unions and interfaces with brief descriptions:</div><div class="directory">
+<div class="levels">[detail level <span onclick="javascript:toggleLevel(1);">1</span><span onclick="javascript:toggleLevel(2);">2</span><span onclick="javascript:toggleLevel(3);">3</span><span onclick="javascript:toggleLevel(4);">4</span><span onclick="javascript:toggleLevel(5);">5</span><span onclick="javascript:toggleLevel(6);">6</span>]</div><table class="directory">
+<tr id="row_0_" class="even"><td class="entry"><span style="width:0px;display:inline-block;">&#160;</span><span id="arr_0_" class="arrow" onclick="toggleFolder('0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass.html" target="_self">cutlass</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_0_" class="arrow" onclick="toggleFolder('0_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1arch.html" target="_self">arch</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma.html" target="_self">Mma</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_30fa42e1ad201df010637cd22fc070a1.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_48b3a43bc03fff93a111ac01abe7e40d.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, double, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_76f9d24016e1b4167b16f4d7628c9546.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_f1c9d2ee842455cd0c5b71d56108d468.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, float, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_070b94670e040ed5855e5b42d5ca8a443.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_0aa57e6a2e6b5da37d10688bf99419a23.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, double, LayoutB, double, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01ElementAb6e65b2cf5ede7f41cb070a767158dee.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, Operator &gt;</a></td><td class="desc">Matrix multiply-add operation - specialized for 1x1x1x1 matrix multiply operation </td></tr>
+<tr id="row_0_0_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_00e3e12e263df6506b8cf06c3f4d478b8e.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_004bb3fd76ca2af7b3210676fa9644d95b.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, float, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01half__t_4f30ee91f7bb3844ff7579c68d078818.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01int_00_00b2dff9ce8caad9aff5bc6a355539161.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, int, LayoutA, int, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_012_01_4_00_011_00_01int16__t8c4bac365710598317a69c489f7239db.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 2 &gt;, 1, int16_t, layout::RowMajor, int16_t, layout::ColumnMajor, int, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_014_01_4_00_011_00_01int8__t_a1ef6624fc8c10126f17f4ee88283d72.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 4 &gt;, 1, int8_t, LayoutA, int8_t, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_012_00_011_01_4_00_011_00_01half__t_f3dc2e59f857ada163d1e0781ea8f391.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 2, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_0116_00_0116_00_014_01_4_00_0132_00_01half_0bcc4d05f9811035f08cc1b7f0154a4d.html" target="_self">Mma&lt; gemm::GemmShape&lt; 16, 16, 4 &gt;, 32, half_t, LayoutA, half_t, LayoutB, ElementC, LayoutC, Operator &gt;</a></td><td class="desc">Matrix multiply-add operation specialized for the entire warp </td></tr>
+<tr id="row_0_0_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_0116_00_018_00_018_01_4_00_0132_00_01half__02a3f19a78995f97d793a668e0e4d4f0.html" target="_self">Mma&lt; gemm::GemmShape&lt; 16, 8, 8 &gt;, 32, half_t, layout::RowMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
+<tr id="row_0_0_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_0116_00_018_00_018_01_4_00_0132_00_01half__96363097c47b056f0ca1911afd7f8b7a.html" target="_self">Mma&lt; gemm::GemmShape&lt; 16, 8, 8 &gt;, 32, half_t, layout::RowMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation - F16 = F16 * F16 + F16 </td></tr>
+<tr id="row_0_0_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_011_00_011_01_4_00_011_00_01half__t_8cf78649807b93684f3d431bfa34ee28.html" target="_self">Mma&lt; gemm::GemmShape&lt; 2, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, half_t, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_ccde11d1bbbdab3702772ce44eb9729a.html" target="_self">Mma&lt; gemm::GemmShape&lt; 2, 2, 1 &gt;, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::ColumnMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_c07cc6439298fa5486a719e577be2538.html" target="_self">Mma&lt; gemm::GemmShape&lt; 2, 2, 1 &gt;, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_01128_01_4_00_0132_00_01uint15918972b95027764b3a849b03075ed2b.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 128 &gt;, 32, uint1b_t, layout::RowMajor, uint1b_t, layout::ColumnMajor, int, layout::RowMajor, OpXorPopc &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__927179f46017ea5f58f859f1196c4829.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * S8 + S32 </td></tr>
+<tr id="row_0_0_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__8ebae0cbdf333fddfe5c24d35ebe8e02.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * S8 + S32 </td></tr>
+<tr id="row_0_0_24_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__5299c9c90c8f2f521be0c8cec1c3eb08.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
+<tr id="row_0_0_25_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__f083347e265b1e9eea5572d86ddb6bf9.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
+<tr id="row_0_0_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_a62aa63a212985df306fb27e8a50aeae.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U8 * S8 + S32 </td></tr>
+<tr id="row_0_0_27_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_ab741d81fdc991345cb9e43c29fca573.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U8 * S8 + S32 </td></tr>
+<tr id="row_0_0_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_5221708cec5828d35db1d1c47cb4964e.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
+<tr id="row_0_0_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_bef0c048bc0f8ba2d875cb7ab26d363b.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
+<tr id="row_0_0_30_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_6e513ccbc44ae7909a60d93b9b5435b3.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * S4 + S32 </td></tr>
+<tr id="row_0_0_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_0ee08a4520882d24ba9026879265e892.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * S4 + S32 </td></tr>
+<tr id="row_0_0_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_4746fc55e614df0016c518d3fda2677e.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * U4 + S32 </td></tr>
+<tr id="row_0_0_33_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_546e9ec6de6a5970b326da6f6280f1d4.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * U4 + S32 </td></tr>
+<tr id="row_0_0_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4b03e3b50dbcb30d0d1ac062f3a9d5abef.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * S4 + S32 </td></tr>
+<tr id="row_0_0_35_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4b6d968039dde5c9f062ab15f90a8049fe.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * S4 + S32 </td></tr>
+<tr id="row_0_0_36_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4bc4b6ba004e25c44bfd9266c61f937dfb.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * U4 + S32 </td></tr>
+<tr id="row_0_0_37_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4b451d5cf5d7e8cbbe476afe3dab5c09b2.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * U4 + S32 </td></tr>
+<tr id="row_0_0_38_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_b0242d7a01097510effbc4718040d3e5.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
+<tr id="row_0_0_39_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_c7f88bfd32a544fba8111d2dcadeab11.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
+<tr id="row_0_0_40_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_44a3b2a8df88a2b067f1284515cb5371.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::RowMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
+<tr id="row_0_0_41_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_4b7308177b308a272c1889fbe9670275.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
+<tr id="row_0_0_42_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_5a9888862cebd333ecaf11f7262f77d4.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
+<tr id="row_0_0_43_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_31defda8ea2b7d855642ffd77da1a411.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
+<tr id="row_0_0_44_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_839a7c8bb938d1661f4611e68f85d8cb.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::RowMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
+<tr id="row_0_0_45_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_73d9802d6b944a5299bc255887db6bbc.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
+<tr id="row_0_0_46_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmma.html" target="_self">PtxWmma</a></td><td class="desc">WMMA Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_47_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaLoadA.html" target="_self">PtxWmmaLoadA</a></td><td class="desc">WMMA PTX string load for A, B, and C matrices </td></tr>
+<tr id="row_0_0_48_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaLoadB.html" target="_self">PtxWmmaLoadB</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_49_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaLoadC.html" target="_self">PtxWmmaLoadC</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_50_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaStoreD.html" target="_self">PtxWmmaStoreD</a></td><td class="desc">WMMA store for matrix D </td></tr>
+<tr id="row_0_0_51_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm50.html" target="_self">Sm50</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_52_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm60.html" target="_self">Sm60</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_53_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm61.html" target="_self">Sm61</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_54_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm70.html" target="_self">Sm70</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_55_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm72.html" target="_self">Sm72</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_56_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm75.html" target="_self">Sm75</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_57_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01cutlass_1_1half__t_00_01LayoutA___00_01cutlass_1_84e30c8cc93eeb7ca02f651bd16d4c38.html" target="_self">Wmma&lt; Shape_, cutlass::half_t, LayoutA_, cutlass::half_t, LayoutB_, ElementC_, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_58_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01cutlass_1_1int4b__t_00_01LayoutA___00_01cutlass_16fd808a90b3cf9d7cfc99f30888ca3fe.html" target="_self">Wmma&lt; Shape_, cutlass::int4b_t, LayoutA_, cutlass::int4b_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_59_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01cutlass_1_1uint1b__t_00_01LayoutA___00_01cutlass_c80a7ea4d219cd9b13b560b493338028.html" target="_self">Wmma&lt; Shape_, cutlass::uint1b_t, LayoutA_, cutlass::uint1b_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpXorPopc &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_60_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01int8__t_00_01LayoutA___00_01int8__t_00_01LayoutB_505c57bb6818a941dc16f00cf35a9ec0.html" target="_self">Wmma&lt; Shape_, int8_t, LayoutA_, int8_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_61_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01uint8__t_00_01LayoutA___00_01uint8__t_00_01Layout219a464a1248ebfc37aa29bcb10cb1b0.html" target="_self">Wmma&lt; Shape_, uint8_t, LayoutA_, uint8_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_1_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_1_" class="arrow" onclick="toggleFolder('0_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1device__memory.html" target="_self">device_memory</a></td><td class="desc"></td></tr>
+<tr id="row_0_1_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_1_0_" class="arrow" onclick="toggleFolder('0_1_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1device__memory_1_1allocation.html" target="_self">allocation</a></td><td class="desc">Device allocation abstraction that tracks size and capacity </td></tr>
+<tr id="row_0_1_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1device__memory_1_1allocation_1_1deleter.html" target="_self">deleter</a></td><td class="desc">Delete functor for CUDA device memory </td></tr>
+<tr id="row_0_2_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_2_" class="arrow" onclick="toggleFolder('0_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue.html" target="_self">epilogue</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_0_" class="arrow" onclick="toggleFolder('0_2_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_0_" class="arrow" onclick="toggleFolder('0_2_0_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1Convert.html" target="_self">Convert</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1Convert_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_0_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_1_" class="arrow" onclick="toggleFolder('0_2_0_1_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombination.html" target="_self">LinearCombination</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_1_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombination_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_0_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_2_" class="arrow" onclick="toggleFolder('0_2_0_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombinationClamp.html" target="_self">LinearCombinationClamp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_2_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombinationClamp_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_0_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_3_" class="arrow" onclick="toggleFolder('0_2_0_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu.html" target="_self">LinearCombinationRelu</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_3_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_0_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_4_" class="arrow" onclick="toggleFolder('0_2_0_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu_3_01ElementOutput___00_01Count_00_01int_00_01float_00_01Round_01_4.html" target="_self">LinearCombinationRelu&lt; ElementOutput_, Count, int, float, Round &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_4_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu_3_01ElementOutput___00_01Count_00_00274a94522c46cd041d0b10d484e2ef3.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_0_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_5_" class="arrow" onclick="toggleFolder('0_2_0_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1ReductionOpPlus.html" target="_self">ReductionOpPlus</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_5_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1ReductionOpPlus_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_1_" class="arrow" onclick="toggleFolder('0_2_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1threadblock.html" target="_self">threadblock</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_0_" class="arrow" onclick="toggleFolder('0_2_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1threadblock_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" target="_self">RowArrangement</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" title="RowArrangement determines how one or more warps cover a region of consecutive rows. ">RowArrangement</a> determines how one or more warps cover a region of consecutive rows </td></tr>
+<tr id="row_0_2_1_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement_3_01Shape_00_01WarpsRemaini91159e6f7e123d881e3ec45101fa4f81.html" target="_self">RowArrangement&lt; Shape, WarpsRemaining, ElementsPerAccess, ElementSize, false &gt;</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" title="RowArrangement determines how one or more warps cover a region of consecutive rows. ">RowArrangement</a> in which each warp's access is a 1D tiled arrangement </td></tr>
+<tr id="row_0_2_1_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_2_1_0_2_" class="arrow" onclick="toggleFolder('0_2_1_0_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement_3_01Shape_00_01WarpsRemaini6d8790249bf12cac580da73bb37eb791.html" target="_self">RowArrangement&lt; Shape, WarpsRemaining, ElementsPerAccess, ElementSize, true &gt;</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" title="RowArrangement determines how one or more warps cover a region of consecutive rows. ">RowArrangement</a> in which each warp's access is a 2D tiled arrangement </td></tr>
+<tr id="row_0_2_1_0_2_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement_3_01Shape_00_01WarpsRemainief28e98b3f284469f271d28aba73de2e.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueComplexTensorOp.html" target="_self">DefaultEpilogueComplexTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for TensorOps </td></tr>
+<tr id="row_0_2_1_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueSimt.html" target="_self">DefaultEpilogueSimt</a></td><td class="desc">Defines sensible defaults for epilogues for SimtOps </td></tr>
+<tr id="row_0_2_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueTensorOp.html" target="_self">DefaultEpilogueTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for TensorOps </td></tr>
+<tr id="row_0_2_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueVoltaTensorOp.html" target="_self">DefaultEpilogueVoltaTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for TensorOps </td></tr>
+<tr id="row_0_2_1_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueWmmaTensorOp.html" target="_self">DefaultEpilogueWmmaTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for WMMA TensorOps </td></tr>
+<tr id="row_0_2_1_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultInterleavedEpilogueTensorOp.html" target="_self">DefaultInterleavedEpilogueTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_7_" class="arrow" onclick="toggleFolder('0_2_1_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultInterleavedThreadMapTensorOp.html" target="_self">DefaultInterleavedThreadMapTensorOp</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_7_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultInterleavedThreadMapTensorOp_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_8_" class="arrow" onclick="toggleFolder('0_2_1_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapSimt.html" target="_self">DefaultThreadMapSimt</a></td><td class="desc">Defines the optimal thread map for SIMT accumulator layouts </td></tr>
+<tr id="row_0_2_1_8_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapSimt_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_9_" class="arrow" onclick="toggleFolder('0_2_1_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapTensorOp.html" target="_self">DefaultThreadMapTensorOp</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_9_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapTensorOp_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp.html" target="_self">DefaultThreadMapVoltaTensorOp</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_11_" class="arrow" onclick="toggleFolder('0_2_1_11_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__95db04b7b72e34283958bd7fbf851d16.html" target="_self">DefaultThreadMapVoltaTensorOp&lt; ThreadblockShape_, WarpShape_, PartitionsK, ElementOutput_, ElementsPerAccess, float &gt;</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_11_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__52116c60c62f0fd520071558e42b814f.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_12_" class="arrow" onclick="toggleFolder('0_2_1_12_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__d58c94abc36b7c5c109b55202c6992e7.html" target="_self">DefaultThreadMapVoltaTensorOp&lt; ThreadblockShape_, WarpShape_, PartitionsK, ElementOutput_, ElementsPerAccess, half_t &gt;</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_12_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__4433cc988100e98097a748d2670fb0fc.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_13_" class="arrow" onclick="toggleFolder('0_2_1_13_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapWmmaTensorOp.html" target="_self">DefaultThreadMapWmmaTensorOp</a></td><td class="desc">Defines the optimal thread map for Wmma TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_13_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapWmmaTensorOp_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_14_" class="arrow" onclick="toggleFolder('0_2_1_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1DirectEpilogueTensorOp.html" target="_self">DirectEpilogueTensorOp</a></td><td class="desc"><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" title="Epilogue operator without splitk. ">Epilogue</a> operator </td></tr>
+<tr id="row_0_2_1_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DirectEpilogueTensorOp_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure for host-constructible state </td></tr>
+<tr id="row_0_2_1_14_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DirectEpilogueTensorOp_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
+<tr id="row_0_2_1_15_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" target="_self">Epilogue</a></td><td class="desc"><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" title="Epilogue operator without splitk. ">Epilogue</a> operator without splitk </td></tr>
+<tr id="row_0_2_1_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_16_" class="arrow" onclick="toggleFolder('0_2_1_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1EpilogueBase.html" target="_self">EpilogueBase</a></td><td class="desc">Base class for epilogues defining warp-level </td></tr>
+<tr id="row_0_2_1_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1EpilogueBase_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
+<tr id="row_0_2_1_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_17_" class="arrow" onclick="toggleFolder('0_2_1_17_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1InterleavedEpilogue.html" target="_self">InterleavedEpilogue</a></td><td class="desc"><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" title="Epilogue operator without splitk. ">Epilogue</a> operator without splitk </td></tr>
+<tr id="row_0_2_1_17_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedEpilogue_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
+<tr id="row_0_2_1_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_18_" class="arrow" onclick="toggleFolder('0_2_1_18_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedOutputTileThreadMap.html" target="_self">InterleavedOutputTileThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_18_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedOutputTileThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_19_" class="arrow" onclick="toggleFolder('0_2_1_19_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator.html" target="_self">InterleavedPredicatedTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_19_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator_1_1Mask.html" target="_self">Mask</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator_1_1Mask.html" title="Mask object. ">Mask</a> object </td></tr>
+<tr id="row_0_2_1_19_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_20_" class="arrow" onclick="toggleFolder('0_2_1_20_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileOptimalThreadMap.html" target="_self">OutputTileOptimalThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_20_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileOptimalThreadMap_1_1CompactedThreadMap.html" target="_self">CompactedThreadMap</a></td><td class="desc">Compacted thread map in which the 4D region is contiguous </td></tr>
+<tr id="row_0_2_1_20_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileOptimalThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileShape.html" target="_self">OutputTileShape</a></td><td class="desc">Tuple defining point in output tile </td></tr>
+<tr id="row_0_2_1_22_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileThreadMap.html" target="_self">OutputTileThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_23_" class="arrow" onclick="toggleFolder('0_2_1_23_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator.html" target="_self">PredicatedTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_23_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator_1_1Mask.html" target="_self">Mask</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator_1_1Mask.html" title="Mask object. ">Mask</a> object </td></tr>
+<tr id="row_0_2_1_23_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1SharedLoadIterator.html" target="_self">SharedLoadIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_2_" class="arrow" onclick="toggleFolder('0_2_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1warp.html" target="_self">warp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorComplexTensorOp.html" target="_self">FragmentIteratorComplexTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorComplexTensorOp_3_01WarpShape___00_01Operato8cf03c624cf3210c71b7cbd580b080f8.html" target="_self">FragmentIteratorComplexTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorSimt.html" target="_self">FragmentIteratorSimt</a></td><td class="desc">Fragment iterator for SIMT accumulator arrangements </td></tr>
+<tr id="row_0_2_2_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorSimt_3_01WarpShape___00_01Operator___00_01la3f2abc523201c1b0228df99119ab88e1.html" target="_self">FragmentIteratorSimt&lt; WarpShape_, Operator_, layout::RowMajor, MmaSimtPolicy_ &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorTensorOp.html" target="_self">FragmentIteratorTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorTensorOp_3_01WarpShape___00_01OperatorShape_e459aab140a2ce78336e584f95886726.html" target="_self">FragmentIteratorTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt; &gt;</a></td><td class="desc">Dedicated to interleaved layout </td></tr>
+<tr id="row_0_2_2_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorTensorOp_3_01WarpShape___00_01OperatorShape_5e78dabe303f20d76b00c600aab61eda.html" target="_self">FragmentIteratorTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorVoltaTensorOp.html" target="_self">FragmentIteratorVoltaTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1Gdb805a2dc5571ac3b66e0fe6ffdcede2.html" target="_self">FragmentIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, float, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1G16e08718cffa0989cce3fe8dbc4b075b.html" target="_self">FragmentIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, half_t, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorWmmaTensorOp.html" target="_self">FragmentIteratorWmmaTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorWmmaTensorOp_3_01WarpShape___00_01OperatorShfdb1f120c6797383663f9fd11d0fc599.html" target="_self">FragmentIteratorWmmaTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1SimtPolicy.html" target="_self">SimtPolicy</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1SimtPolicy_3_01WarpShape___00_01Operator___00_01layout_1_1Rcef1c60e23e997017ae176c92931151d.html" target="_self">SimtPolicy&lt; WarpShape_, Operator_, layout::RowMajor, MmaSimtPolicy_ &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
+<tr id="row_0_2_2_14_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TensorOpPolicy.html" target="_self">TensorOpPolicy</a></td><td class="desc">Policy details related to the epilogue </td></tr>
+<tr id="row_0_2_2_15_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TensorOpPolicy_3_01WarpShape_00_01OperatorShape_00_01layout69549d10c3610d943987eb90e827bc05.html" target="_self">TensorOpPolicy&lt; WarpShape, OperatorShape, layout::ColumnMajorInterleaved&lt; InterleavedK &gt; &gt;</a></td><td class="desc">Partial specialization for column-major-interleaved </td></tr>
+<tr id="row_0_2_2_16_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TensorOpPolicy_3_01WarpShape_00_01OperatorShape_00_01layout_1_1RowMajor_01_4.html" target="_self">TensorOpPolicy&lt; WarpShape, OperatorShape, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
+<tr id="row_0_2_2_17_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorSimt.html" target="_self">TileIteratorSimt</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_18_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorSimt_3_01WarpShape___00_01Operator___00_01Elemenf2bd262ed3e202b25d5802d83965bf3b.html" target="_self">TileIteratorSimt&lt; WarpShape_, Operator_, Element_, layout::RowMajor, MmaSimtPolicy_ &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_19_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorTensorOp.html" target="_self">TileIteratorTensorOp</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_2_20_" class="arrow" onclick="toggleFolder('0_2_2_20_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorTensorOp_3_01WarpShape___00_01OperatorShape___003cbb32beb84b4984cb7853662096d289.html" target="_self">TileIteratorTensorOp&lt; WarpShape_, OperatorShape_, Element_, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_20_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorTensorOp_3_01WarpShape___00_01OperatorShape___05f11e023c9e6ee5f7a888fa4c5bbf6d1.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp.html" target="_self">TileIteratorVoltaTensorOp</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_2_22_" class="arrow" onclick="toggleFolder('0_2_2_22_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1GemmS2fe0c60b727c738c622c18fc3dd76644.html" target="_self">TileIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, float, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_22_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1Gemm770cbca45441d295d5d7433e8222a700.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_2_23_" class="arrow" onclick="toggleFolder('0_2_2_23_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1GemmSa0ceeeddc22575876eb977da7f5416a8.html" target="_self">TileIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, half_t, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_23_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1Gemmffcab2297c8de8d0013602a39c525b78.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorWmmaTensorOp.html" target="_self">TileIteratorWmmaTensorOp</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorWmmaTensorOp_3_01WarpShape___00_01OperatorShape_fd6a91cd8bbd07ecd1344326b830e3a4.html" target="_self">TileIteratorWmmaTensorOp&lt; WarpShape_, OperatorShape_, OperatorFragment_, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_26_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1VoltaTensorOpPolicy.html" target="_self">VoltaTensorOpPolicy</a></td><td class="desc">Policy details related to the epilogue </td></tr>
+<tr id="row_0_2_2_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1VoltaTensorOpPolicy_3_01WarpShape___00_01gemm_1_1GemmShape_136ce744d4c1c6e8707f5a9785196194.html" target="_self">VoltaTensorOpPolicy&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, float, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
+<tr id="row_0_2_2_28_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1VoltaTensorOpPolicy_3_01WarpShape___00_01gemm_1_1GemmShape_1d48185f49e4d066f8e9327bf0856b7f.html" target="_self">VoltaTensorOpPolicy&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, half_t, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
+<tr id="row_0_2_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_3_" class="arrow" onclick="toggleFolder('0_2_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1EpilogueWorkspace.html" target="_self">EpilogueWorkspace</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1EpilogueWorkspace_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_2_3_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1EpilogueWorkspace_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
+<tr id="row_0_3_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_3_" class="arrow" onclick="toggleFolder('0_3_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm.html" target="_self">gemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_0_" class="arrow" onclick="toggleFolder('0_3_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1device.html" target="_self">device</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration.html" target="_self">DefaultGemmConfiguration</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassSimt_00_01ArchTag286687c5e6abe22d241f789fe344a465.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassSimt, ArchTag, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassSimt_00_01ArchTag3026e48abb8c905d1cc6d13d669700e4.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassSimt, ArchTag, int8_t, int8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc567cad318a31d04b70ea615d6321decd.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm70, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcde61af9be1337dac1fdb210e7e7a6e01.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc485a4f0b5a7d2d4ab2c1a24da6328048.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int4b_t, int4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc8e2604a56dff3a7595da9ee0604ae55e.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int4b_t, uint4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc4fada4957d463c80a2831e47f28157c4.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int8_t, int8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc8ab5fd2693c6a6ec43e447acb07f784c.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int8_t, uint8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcffcf31256aed23d4d8d0eab627bc0cad.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint4b_t, int4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcb2e258b7bd321c633dd65d3ebcf6414a.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint4b_t, uint4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcb27bf218007928652d5b803193eab473.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint8_t, int8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcfea0f3503156e8e3fba6456f0cedafdd.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint8_t, uint8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassWmmaTensorOp_00_0884059ecad03bea3e86c4cf722226097.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassWmmaTensorOp, ArchTag, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_14_" class="arrow" onclick="toggleFolder('0_3_0_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1Gemm_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_15_" class="arrow" onclick="toggleFolder('0_3_0_15_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1Gemm_3_01ElementA___00_01LayoutA___00_01ElementB___00_01Layout4d0960ae6b1d1bf19e6239dbd002249c.html" target="_self">Gemm&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero &gt;</a></td><td class="desc">Parital specialization for column-major output exchanges problem size and operand </td></tr>
+<tr id="row_0_3_0_15_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1Gemm_3_01ElementA___00_01LayoutA___00_01ElementB___00_01Layou1b211cc9c97c022d8fe10f2dd32c8709.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_16_" class="arrow" onclick="toggleFolder('0_3_0_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmBatched.html" target="_self">GemmBatched</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmBatched_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_17_" class="arrow" onclick="toggleFolder('0_3_0_17_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmBatched_3_01ElementA___00_01LayoutA___00_01ElementB___00_0c9bb6f4463ab6085e6008b5d5ad6abfd.html" target="_self">GemmBatched&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ &gt;</a></td><td class="desc">Parital specialization for column-major output exchanges problem size and operand </td></tr>
+<tr id="row_0_3_0_17_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmBatched_3_01ElementA___00_01LayoutA___00_01ElementB___00_213d78696663f4231cd52c6a277c60e5.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_18_" class="arrow" onclick="toggleFolder('0_3_0_18_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmComplex.html" target="_self">GemmComplex</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_18_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmComplex_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_19_" class="arrow" onclick="toggleFolder('0_3_0_19_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmComplex_3_01ElementA___00_01LayoutA___00_01ElementB___00_07c56401b4df75709ae636675d9980a9a.html" target="_self">GemmComplex&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial &gt;</a></td><td class="desc">Parital specialization for column-major output exchanges problem size and operand </td></tr>
+<tr id="row_0_3_0_19_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmComplex_3_01ElementA___00_01LayoutA___00_01ElementB___00_a3923967cafb5cb9774c320dc24baa77.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_20_" class="arrow" onclick="toggleFolder('0_3_0_20_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel.html" target="_self">GemmSplitKParallel</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_20_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_21_" class="arrow" onclick="toggleFolder('0_3_0_21_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel_3_01ElementA___00_01LayoutA___00_01ElementBbe7c1f7154ad5b5bf9d4d28301e2b457.html" target="_self">GemmSplitKParallel&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ConvertScaledOp_, ReductionOp_, ThreadblockSwizzle_, Stages, kAlignmentA, kAlignmentB, Operator_ &gt;</a></td><td class="desc">Partial specialization for column-major output </td></tr>
+<tr id="row_0_3_0_21_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel_3_01ElementA___00_01LayoutA___00_01Elementafcb1aeaf2035a7ac769d7acc233423b.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_1_" class="arrow" onclick="toggleFolder('0_3_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1kernel.html" target="_self">kernel</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_0_" class="arrow" onclick="toggleFolder('0_3_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1kernel_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1detail_1_1GemvBatchedStridedEpilogueScaling.html" target="_self">GemvBatchedStridedEpilogueScaling</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm.html" target="_self">DefaultGemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01layout_1_1ColumnMajorInterleave661fe54d13cc2c9153dcdf31e4beaa30.html" target="_self">DefaultGemm&lt; ElementA, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, kAlignmentA, ElementB, layout::RowMajorInterleaved&lt; InterleavedK &gt;, kAlignmentB, ElementC, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, int32_t, arch::OpClassTensorOp, arch::Sm75, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator, IsBetaZero &gt;</a></td><td class="desc">Partial specialization for Turing Integer Matrix Multiply Interleaved layout </td></tr>
+<tr id="row_0_3_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00_01Edd80343e6570718ed237122e4ebf7fb5.html" target="_self">DefaultGemm&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape&lt; 1, 1, 1 &gt;, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator &gt;</a></td><td class="desc">Partial specialization for SIMT </td></tr>
+<tr id="row_0_3_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00_01E044b039b2fe402f29b04a9f5feee5342.html" target="_self">DefaultGemm&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassTensorOp, arch::Sm70, ThreadblockShape, WarpShape, GemmShape&lt; 8, 8, 4 &gt;, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator &gt;</a></td><td class="desc">Partial specialization for Volta architecture </td></tr>
+<tr id="row_0_3_1_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00_01E5d78d37a9ae2ec08d7d477d571df036e.html" target="_self">DefaultGemm&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassTensorOp, arch::Sm75, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator &gt;</a></td><td class="desc">Partial specialization for Turing Architecture </td></tr>
+<tr id="row_0_3_1_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01int8__t_00_01LayoutA_00_01kAlignmentA_00_01inf48440732c1c5f42ddbfaba179861815.html" target="_self">DefaultGemm&lt; int8_t, LayoutA, kAlignmentA, int8_t, LayoutB, kAlignmentB, ElementC, LayoutC, ElementAccumulator, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape&lt; 1, 1, 4 &gt;, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator, false &gt;</a></td><td class="desc">Partial specialization for SIMT DP4A </td></tr>
+<tr id="row_0_3_1_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemmSplitKParallel.html" target="_self">DefaultGemmSplitKParallel</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemv.html" target="_self">DefaultGemv</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_9_" class="arrow" onclick="toggleFolder('0_3_1_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_9_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1Gemm_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_3_1_9_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="unioncutlass_1_1gemm_1_1kernel_1_1Gemm_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared memory storage structure </td></tr>
+<tr id="row_0_3_1_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_10_" class="arrow" onclick="toggleFolder('0_3_1_10_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmBatched.html" target="_self">GemmBatched</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_10_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmBatched_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_3_1_10_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="unioncutlass_1_1gemm_1_1kernel_1_1GemmBatched_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared memory storage structure </td></tr>
+<tr id="row_0_3_1_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_11_" class="arrow" onclick="toggleFolder('0_3_1_11_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmSplitKParallel.html" target="_self">GemmSplitKParallel</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_11_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmSplitKParallel_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_3_1_11_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="unioncutlass_1_1gemm_1_1kernel_1_1GemmSplitKParallel_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared memory storage structure </td></tr>
+<tr id="row_0_3_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_2_" class="arrow" onclick="toggleFolder('0_3_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_2_0_" class="arrow" onclick="toggleFolder('0_3_2_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1thread_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1EnableMma__Crow__SM60.html" target="_self">EnableMma_Crow_SM60</a></td><td class="desc">Determines whether to enable thread::Gemm&lt;&gt; specializations compatible with SM50 </td></tr>
+<tr id="row_0_3_2_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2.html" target="_self">Mma_HFMA2</a></td><td class="desc">Structure to compute the matrix product for HFMA </td></tr>
+<tr id="row_0_3_2_0_2_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_72621f7ab9ae4a4ba4fe9725cf8e89c1.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::ColumnMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_3_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_94c813e3bbfb6f9857c155166f772687.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::ColumnMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_4_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_17070298bc4cced0a1b98aee2bb6b455.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::RowMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_5_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_bf6d29bb09a025e7b96942809743e28a.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::RowMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_6_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01l26a133b13650c1d058273e3649f60f04.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::ColumnMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_7_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01lbba3a796be96a0276693ef6b259ecc4a.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::ColumnMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_8_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01l2aa4d2fd2e940e0d0cf7c47bc8f6017c.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::RowMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_9_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01l086c058a15d6c79558e4f3d9ff1dc148.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::RowMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_10_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01LayoutA_00_01LayoutB_00_0e1104c65871c539155bd3a0c7631928b.html" target="_self">Mma_HFMA2&lt; Shape, LayoutA, LayoutB, layout::ColumnMajor, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_11_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01LayoutA_00_01LayoutB_00_07ac147cb320ee0d28ff8e78eb4cd330e.html" target="_self">Mma_HFMA2&lt; Shape, LayoutA, LayoutB, layout::RowMajor, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma.html" target="_self">Mma</a></td><td class="desc">Structure to compute the matrix product </td></tr>
+<tr id="row_0_3_2_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01ElementA___00_01LayoutA___00_01ElementB_e41c1cd6078b6d1347fac239b0639d56.html" target="_self">Mma&lt; Shape_, ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, arch::OpMultiplyAdd, bool &gt;</a></td><td class="desc">Gemplate that handles conventional layouts for FFMA and DFMA GEMM </td></tr>
+<tr id="row_0_3_2_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01half__t_00_01LayoutA_00_01half__t_00_01L066c9d2371712cdf0cac099ca9bcc578.html" target="_self">Mma&lt; Shape_, half_t, LayoutA, half_t, LayoutB, half_t, LayoutC, arch::OpMultiplyAdd &gt;</a></td><td class="desc">Structure to compute the matrix product </td></tr>
+<tr id="row_0_3_2_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01half__t_00_01LayoutA___00_01half__t_00_088f0e99e501b6012297eb30b4e89bcea.html" target="_self">Mma&lt; Shape_, half_t, LayoutA_, half_t, LayoutB_, half_t, layout::RowMajor, arch::OpMultiplyAdd, typename platform::enable_if&lt; detail::EnableMma_Crow_SM60&lt; LayoutA_, LayoutB_ &gt;::value &gt;::type &gt;</a></td><td class="desc">Computes matrix product when C is row-major </td></tr>
+<tr id="row_0_3_2_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01int8__t_00_01layout_1_1ColumnMajor_00_013f3785e722edc6e9aab6f866309b8623.html" target="_self">Mma&lt; Shape_, int8_t, layout::ColumnMajor, int8_t, layout::RowMajor, int32_t, LayoutC_, arch::OpMultiplyAdd, int8_t &gt;</a></td><td class="desc">Gemplate that handles conventional layouts for IDP4A </td></tr>
+<tr id="row_0_3_2_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01int8__t_00_01layout_1_1RowMajor_00_01int89c659e7faf47264972bdba6cd80f42b.html" target="_self">Mma&lt; Shape_, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int32_t, LayoutC_, arch::OpMultiplyAdd, bool &gt;</a></td><td class="desc">Gemplate that handles conventional layouts for IDP4A </td></tr>
+<tr id="row_0_3_2_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1MmaGeneric.html" target="_self">MmaGeneric</a></td><td class="desc">Gemplate that handles all packed matrix layouts </td></tr>
+<tr id="row_0_3_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_3_" class="arrow" onclick="toggleFolder('0_3_3_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1threadblock.html" target="_self">threadblock</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultGemvCore.html" target="_self">DefaultGemvCore</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma.html" target="_self">DefaultMma</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01ElementA_00_01LayoutA_00_01kAlignmentA_0010764e1fd5a3251a57eddafbd83eab8e.html" target="_self">DefaultMma&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, true &gt;</a></td><td class="desc">Specialization for column-major-interleaved output </td></tr>
+<tr id="row_0_3_3_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00c67c16f9881e4f2fda76d8ed83ebabd6.html" target="_self">DefaultMma&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, false &gt;</a></td><td class="desc">Specialization for row-major output (OperatorClass Simt) </td></tr>
+<tr id="row_0_3_3_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00ce36642cae579bce6605ff8edde3c6ab.html" target="_self">DefaultMma&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, false &gt;</a></td><td class="desc">Specialization for row-major output (OperatorClass Simt) </td></tr>
+<tr id="row_0_3_3_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01int8__t_00_01LayoutA_00_01kAlignmentA_00_07e7230d4011ada5e22cfcb29103b696.html" target="_self">DefaultMma&lt; int8_t, LayoutA, kAlignmentA, int8_t, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape&lt; 1, 1, 4 &gt;, 2, Operator, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore.html" target="_self">DefaultMmaCore</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShab94a11a77dd0565102710907089acee0.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShafafd5c61db86cbfe90863578ddd11092.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha46446d1e3871e31d2e728f710d78c8c1.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_, &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha8da7a0cfbbe859b701fdd9f2b8566aa7.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha84e9f8afb6a4ca9f5dcd219b182d16e7.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha2c0d0b7cdb5c4bcb11e83c058eb65345.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::ColumnMajor, int8_t, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc">Partial specialization: </td></tr>
+<tr id="row_0_3_3_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha34a52cc7b2942e8c290f0032b6779b52.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::ColumnMajor, int8_t, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_14_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShaaf312aafe9da92ea9d417bcc12a8e7dc.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc">Partial specialization: </td></tr>
+<tr id="row_0_3_3_15_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha863d4139ccaa713bc4bde32c425f4067.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::RowMajor, int8_t, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc">Partial specialization: </td></tr>
+<tr id="row_0_3_3_16_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShaf03a122202ad10acdc96f280106d678b.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_17_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha69bef08ea63dd930f99d9788105873dd.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_18_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha3adf608332a8c9ee7014fced0da8a9ca.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_19_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShab7edfba3cdf43a07e3c4d719d87565a4.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_20_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc803d38bc1e4618c07c47f54c87ae2678.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instrucf60fe02fcdd80d28b7fd419133465dcc.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_22_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc2bf00737f4ad0a9da9a8be6d3e66c152.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, ElementB_, layout::RowMajorInterleaved&lt; InterleavedK &gt;, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_, AccumulatorsInRowMajor &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_23_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc24092ddc01fc83dabb7db4c14880fe60.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc4fee9f2965b8468bfb42b94a74527d22.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmBatchedIdentityThreadblockSwizzle.html" target="_self">GemmBatchedIdentityThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for batched GEMMs </td></tr>
+<tr id="row_0_3_3_26_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmHorizontalThreadblockSwizzle.html" target="_self">GemmHorizontalThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for GEMMs </td></tr>
+<tr id="row_0_3_3_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmIdentityThreadblockSwizzle.html" target="_self">GemmIdentityThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for GEMMs </td></tr>
+<tr id="row_0_3_3_28_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmSplitKHorizontalThreadblockSwizzle.html" target="_self">GemmSplitKHorizontalThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for split-K GEMMs </td></tr>
+<tr id="row_0_3_3_29_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmSplitKIdentityThreadblockSwizzle.html" target="_self">GemmSplitKIdentityThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for split-K GEMMs </td></tr>
+<tr id="row_0_3_3_30_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1Gemv.html" target="_self">Gemv</a></td><td class="desc">Structure to compute the matrix-vector product using SIMT math instructions </td></tr>
+<tr id="row_0_3_3_31_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemvBatchedStridedThreadblockDefaultSwizzle.html" target="_self">GemvBatchedStridedThreadblockDefaultSwizzle</a></td><td class="desc">Threadblock swizzling function for batched GEMVs </td></tr>
+<tr id="row_0_3_3_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_3_32_" class="arrow" onclick="toggleFolder('0_3_3_32_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaBase.html" target="_self">MmaBase</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_32_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaBase_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage object needed by threadblock-scoped GEMM </td></tr>
+<tr id="row_0_3_3_33_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaPipelined.html" target="_self">MmaPipelined</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
+<tr id="row_0_3_3_34_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1MmaPolicy.html" target="_self">MmaPolicy</a></td><td class="desc">Policy object describing MmaTensorOp </td></tr>
+<tr id="row_0_3_3_35_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaSingleStage.html" target="_self">MmaSingleStage</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
+<tr id="row_0_3_4_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_4_" class="arrow" onclick="toggleFolder('0_3_4_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1warp.html" target="_self">warp</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1DefaultMmaTensorOp.html" target="_self">DefaultMmaTensorOp</a></td><td class="desc">Partial specialization for m-by-n-by-kgroup </td></tr>
+<tr id="row_0_3_4_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaComplexTensorOp.html" target="_self">MmaComplexTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaComplexTensorOp_3_01Shape___00_01complex_3_01RealElementA_01_146441010dad1f40eb51b6dae3ded216.html" target="_self">MmaComplexTensorOp&lt; Shape_, complex&lt; RealElementA &gt;, LayoutA_, complex&lt; RealElementB &gt;, LayoutB_, complex&lt; RealElementC &gt;, LayoutC_, Policy_, TransformA, TransformB, Enable &gt;</a></td><td class="desc">Partial specialization for complex*complex+complex =&gt; complex using real-valued TensorOps </td></tr>
+<tr id="row_0_3_4_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimt.html" target="_self">MmaSimt</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
+<tr id="row_0_3_4_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaSimtPolicy.html" target="_self">MmaSimtPolicy</a></td><td class="desc">Describes the arrangement and configuration of per-lane operations in warp-level matrix multiply </td></tr>
+<tr id="row_0_3_4_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator.html" target="_self">MmaSimtTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kA_00_01Element_67ca7e11a38e38f2c51b84767654a90f.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kA, Element_, layout::ColumnMajor, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kA_00_01Element_f0ce904a9294556f15e1cc9cf7c99a93.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kA, Element_, layout::ColumnMajorInterleaved&lt; 4 &gt;, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kB_00_01Element_ea0a4e7ce3cd5d25cabf79383efdf4d9.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kB, Element_, layout::RowMajor, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kB_00_01Element_ada156b62fcbdce47009c5bf1321c92c.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kB, Element_, layout::RowMajorInterleaved&lt; 4 &gt;, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kC_00_01Element_4ccafbc821b3a55cd532602442a74031.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kC, Element_, layout::ColumnMajor, Policy_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kC_00_01Element_8f92ea79e85febb67169c4b2d94b1b20.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kC, Element_, layout::RowMajor, Policy_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOp.html" target="_self">MmaTensorOp</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
+<tr id="row_0_3_4_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator.html" target="_self">MmaTensorOpAccumulatorTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_14_" class="arrow" onclick="toggleFolder('0_3_4_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___008f607b871a2b3d854eb4def64712c042.html" target="_self">MmaTensorOpAccumulatorTileIterator&lt; Shape_, Element_, cutlass::layout::ColumnMajor, InstructionShape_, OpDelta_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___0d35fa5dc4e4b4f72784c943fd857fc1d.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_15_" class="arrow" onclick="toggleFolder('0_3_4_15_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___00027dabdc144edd6276f664ca74088510.html" target="_self">MmaTensorOpAccumulatorTileIterator&lt; Shape_, Element_, cutlass::layout::ColumnMajorInterleaved&lt; InterleavedN &gt;, InstructionShape_, OpDelta_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_15_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___03822d9be37f3725022005a5434441f22.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_16_" class="arrow" onclick="toggleFolder('0_3_4_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___006c39f57875e0aa9d0ad82c8043ed8b98.html" target="_self">MmaTensorOpAccumulatorTileIterator&lt; Shape_, Element_, cutlass::layout::RowMajor, InstructionShape_, OpDelta_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___093b5d2838ac5a742704ef62b5c8688f0.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_17_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator.html" target="_self">MmaTensorOpMultiplicandTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_18_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0b84f53cd44b339eccc12067c9f86e11c.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::ColumnMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_19_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0e52ad425e1ee3e68544873f66733237b.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_20_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___039819fb3ccd43786d556c2c9669508ef.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::RowMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0352e0dcab42bc8360606874e00173556.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_22_" class="arrow" onclick="toggleFolder('0_3_4_22_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0ed7daaeba1c095e77f68533d4d2c475c.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::TensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, 64 &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_22_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___07638f8b7761f6e2e2e6918e2c05e739.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_23_" class="arrow" onclick="toggleFolder('0_3_4_23_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0c7d419c589d601ce4eb603be566fea21.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::TensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_23_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0784c74bd670999ec23ad8ef9dc55777.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpPolicy.html" target="_self">MmaTensorOpPolicy</a></td><td class="desc">Policy </td></tr>
+<tr id="row_0_3_4_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOp.html" target="_self">MmaVoltaTensorOp</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
+<tr id="row_0_3_4_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_26_" class="arrow" onclick="toggleFolder('0_3_4_26_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpAccumulatorTileIterator.html" target="_self">MmaVoltaTensorOpAccumulatorTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_26_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpAccumulatorTileIterator_1_1Policy.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_28_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan0d3248553e52cd61ed8a2b3b12a20343.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kA, Element_, cutlass::layout::ColumnMajorVoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_29_" class="arrow" onclick="toggleFolder('0_3_4_29_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan34be8e21a40af3ebd2dc3dff460dca72.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kA, Element_, cutlass::layout::VoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_29_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Opera33cdf53848564e894d4407637dc86caf.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_30_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand734577b7e54a074d143aba59828c2f2.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kB, Element_, cutlass::layout::RowMajorVoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_31_" class="arrow" onclick="toggleFolder('0_3_4_31_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan16c56cdc2dda5eeb996af8ec0242d501.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kB, Element_, cutlass::layout::VoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_31_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Opera6fa6d2d3725bb3ec613d5c527ea3ffe7.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_32_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan5a221944f4a0e16ccab77ba684856942.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::ColumnMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, KBlock &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_33_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operandcc9821c435540895138bc9af495f321.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::RowMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, KBlock &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_34_" class="arrow" onclick="toggleFolder('0_3_4_34_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operana2f40b28f0d2286b84d86f7238d67b52.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::VoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, KBlock &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_34_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operafa294175b280756dd8388f9ffe7b72c4.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_35_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1WarpSize.html" target="_self">WarpSize</a></td><td class="desc">Query the number of threads per warp </td></tr>
+<tr id="row_0_3_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1BatchedGemmCoord.html" target="_self">BatchedGemmCoord</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1GemmCoord.html" target="_self">GemmCoord</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1GemmShape.html" target="_self">GemmShape</a></td><td class="desc">Shape of a matrix multiply-add operation </td></tr>
+<tr id="row_0_4_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_4_" class="arrow" onclick="toggleFolder('0_4_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1layout.html" target="_self">layout</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1ColumnMajor.html" target="_self">ColumnMajor</a></td><td class="desc">Mapping function for column-major matrices </td></tr>
+<tr id="row_0_4_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorBlockLinear.html" target="_self">ColumnMajorBlockLinear</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorInterleaved.html" target="_self">ColumnMajorInterleaved</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorTensorOpMultiplicandCongruous.html" target="_self">ColumnMajorTensorOpMultiplicandCongruous</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorTensorOpMultiplicandCrosswise.html" target="_self">ColumnMajorTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorVoltaTensorOpMultiplicandBCongruous.html" target="_self">ColumnMajorVoltaTensorOpMultiplicandBCongruous</a></td><td class="desc">Template mapping a column-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
+<tr id="row_0_4_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorVoltaTensorOpMultiplicandCongruous.html" target="_self">ColumnMajorVoltaTensorOpMultiplicandCongruous</a></td><td class="desc">Template mapping a column-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
+<tr id="row_0_4_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorVoltaTensorOpMultiplicandCrosswise.html" target="_self">ColumnMajorVoltaTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ContiguousMatrix.html" target="_self">ContiguousMatrix</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1GeneralMatrix.html" target="_self">GeneralMatrix</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1LayoutTranspose.html" target="_self">LayoutTranspose</a></td><td class="desc">Defines transposes of matrix layouts </td></tr>
+<tr id="row_0_4_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1LayoutTranspose_3_01layout_1_1ColumnMajor_01_4.html" target="_self">LayoutTranspose&lt; layout::ColumnMajor &gt;</a></td><td class="desc">Transpose of column-major is row-major </td></tr>
+<tr id="row_0_4_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1LayoutTranspose_3_01layout_1_1RowMajor_01_4.html" target="_self">LayoutTranspose&lt; layout::RowMajor &gt;</a></td><td class="desc">Transpose of row-major is column-major </td></tr>
+<tr id="row_0_4_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1PackedVectorLayout.html" target="_self">PackedVectorLayout</a></td><td class="desc">Tensor layout for densely packed vectors </td></tr>
+<tr id="row_0_4_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1PitchLinear.html" target="_self">PitchLinear</a></td><td class="desc">Mapping function for pitch-linear memory </td></tr>
+<tr id="row_0_4_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1PitchLinearCoord.html" target="_self">PitchLinearCoord</a></td><td class="desc">Coordinate in pitch-linear space </td></tr>
+<tr id="row_0_4_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1PitchLinearShape.html" target="_self">PitchLinearShape</a></td><td class="desc">Template defining a shape used by pitch-linear operators </td></tr>
+<tr id="row_0_4_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1RowMajor.html" target="_self">RowMajor</a></td><td class="desc">Mapping function for row-major matrices </td></tr>
+<tr id="row_0_4_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorBlockLinear.html" target="_self">RowMajorBlockLinear</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorInterleaved.html" target="_self">RowMajorInterleaved</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorTensorOpMultiplicandCongruous.html" target="_self">RowMajorTensorOpMultiplicandCongruous</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorTensorOpMultiplicandCrosswise.html" target="_self">RowMajorTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorVoltaTensorOpMultiplicandBCongruous.html" target="_self">RowMajorVoltaTensorOpMultiplicandBCongruous</a></td><td class="desc">Template mapping a row-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
+<tr id="row_0_4_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorVoltaTensorOpMultiplicandCongruous.html" target="_self">RowMajorVoltaTensorOpMultiplicandCongruous</a></td><td class="desc">Template mapping a row-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
+<tr id="row_0_4_24_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorVoltaTensorOpMultiplicandCrosswise.html" target="_self">RowMajorVoltaTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_25_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorCxRSKx.html" target="_self">TensorCxRSKx</a></td><td class="desc">Mapping function for 4-D CxRSKx tensors </td></tr>
+<tr id="row_0_4_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorNCHW.html" target="_self">TensorNCHW</a></td><td class="desc">Mapping function for 4-D NCHW tensors </td></tr>
+<tr id="row_0_4_27_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorNCxHWx.html" target="_self">TensorNCxHWx</a></td><td class="desc">Mapping function for 4-D NC/xHWx tensors </td></tr>
+<tr id="row_0_4_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorNHWC.html" target="_self">TensorNHWC</a></td><td class="desc">Mapping function for 4-D NHWC tensors </td></tr>
+<tr id="row_0_4_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicand.html" target="_self">TensorOpMultiplicand</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_30_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandColumnMajorInterleaved.html" target="_self">TensorOpMultiplicandColumnMajorInterleaved</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
+<tr id="row_0_4_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandCongruous.html" target="_self">TensorOpMultiplicandCongruous</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandCongruous_3_0132_00_01Crosswise_01_4.html" target="_self">TensorOpMultiplicandCongruous&lt; 32, Crosswise &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_33_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandCrosswise.html" target="_self">TensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandRowMajorInterleaved.html" target="_self">TensorOpMultiplicandRowMajorInterleaved</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
+<tr id="row_0_4_35_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandBCongruous.html" target="_self">VoltaTensorOpMultiplicandBCongruous</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
+<tr id="row_0_4_36_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" target="_self">VoltaTensorOpMultiplicandCongruous</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
+<tr id="row_0_4_37_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCrosswise.html" target="_self">VoltaTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_5_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_5_" class="arrow" onclick="toggleFolder('0_5_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1library.html" target="_self">library</a></td><td class="desc"></td></tr>
+<tr id="row_0_5_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmArguments.html" target="_self">GemmArguments</a></td><td class="desc">Arguments for GEMM </td></tr>
+<tr id="row_0_5_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmArrayArguments.html" target="_self">GemmArrayArguments</a></td><td class="desc">Arguments for GEMM - used by all the GEMM operations </td></tr>
+<tr id="row_0_5_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmArrayConfiguration.html" target="_self">GemmArrayConfiguration</a></td><td class="desc">Configuration for batched GEMM in which multiple matrix products are computed </td></tr>
+<tr id="row_0_5_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmBatchedConfiguration.html" target="_self">GemmBatchedConfiguration</a></td><td class="desc">Configuration for batched GEMM in which multiple matrix products are computed </td></tr>
+<tr id="row_0_5_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmConfiguration.html" target="_self">GemmConfiguration</a></td><td class="desc">Configuration for basic GEMM operations </td></tr>
+<tr id="row_0_5_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmDescription.html" target="_self">GemmDescription</a></td><td class="desc">Description of all GEMM computations </td></tr>
+<tr id="row_0_5_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmPlanarComplexBatchedConfiguration.html" target="_self">GemmPlanarComplexBatchedConfiguration</a></td><td class="desc">Batched complex valued GEMM in which real and imaginary parts are separated by a stride </td></tr>
+<tr id="row_0_5_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmPlanarComplexConfiguration.html" target="_self">GemmPlanarComplexConfiguration</a></td><td class="desc">Complex valued GEMM in which real and imaginary parts are separated by a stride </td></tr>
+<tr id="row_0_5_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1library_1_1Manifest.html" target="_self">Manifest</a></td><td class="desc"><a class="el" href="classcutlass_1_1library_1_1Manifest.html" title="Manifest of CUTLASS Library. ">Manifest</a> of CUTLASS Library </td></tr>
+<tr id="row_0_5_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1MathInstructionDescription.html" target="_self">MathInstructionDescription</a></td><td class="desc"></td></tr>
+<tr id="row_0_5_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1library_1_1Operation.html" target="_self">Operation</a></td><td class="desc">Base class for all device-wide operations </td></tr>
+<tr id="row_0_5_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1OperationDescription.html" target="_self">OperationDescription</a></td><td class="desc">High-level description of an operation </td></tr>
+<tr id="row_0_5_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1TensorDescription.html" target="_self">TensorDescription</a></td><td class="desc">Structure describing the properties of a tensor </td></tr>
+<tr id="row_0_5_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1TileDescription.html" target="_self">TileDescription</a></td><td class="desc">Structure describing the tiled structure of a GEMM-like computation </td></tr>
+<tr id="row_0_6_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_6_" class="arrow" onclick="toggleFolder('0_6_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1platform.html" target="_self">platform</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1aligned__chunk.html" target="_self">aligned_chunk</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1aligned__storage.html" target="_self">aligned_storage</a></td><td class="desc">Std::aligned_storage </td></tr>
+<tr id="row_0_6_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_6_2_" class="arrow" onclick="toggleFolder('0_6_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of.html" target="_self">alignment_of</a></td><td class="desc">Std::alignment_of </td></tr>
+<tr id="row_0_6_2_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_1_1pad.html" target="_self">pad</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01const_01value__t_01_4.html" target="_self">alignment_of&lt; const value_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01const_01volatile_01value__t_01_4.html" target="_self">alignment_of&lt; const volatile value_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01double2_01_4.html" target="_self">alignment_of&lt; double2 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01double4_01_4.html" target="_self">alignment_of&lt; double4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01float4_01_4.html" target="_self">alignment_of&lt; float4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01int4_01_4.html" target="_self">alignment_of&lt; int4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01long4_01_4.html" target="_self">alignment_of&lt; long4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01longlong2_01_4.html" target="_self">alignment_of&lt; longlong2 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01longlong4_01_4.html" target="_self">alignment_of&lt; longlong4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01uint4_01_4.html" target="_self">alignment_of&lt; uint4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01ulong4_01_4.html" target="_self">alignment_of&lt; ulong4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01ulonglong2_01_4.html" target="_self">alignment_of&lt; ulonglong2 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01ulonglong4_01_4.html" target="_self">alignment_of&lt; ulonglong4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01volatile_01value__t_01_4.html" target="_self">alignment_of&lt; volatile value_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1bool__constant.html" target="_self">bool_constant</a></td><td class="desc">Std::bool_constant </td></tr>
+<tr id="row_0_6_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1conditional.html" target="_self">conditional</a></td><td class="desc">Std::conditional (true specialization) </td></tr>
+<tr id="row_0_6_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1conditional_3_01false_00_01T_00_01F_01_4.html" target="_self">conditional&lt; false, T, F &gt;</a></td><td class="desc">Std::conditional (false specialization) </td></tr>
+<tr id="row_0_6_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1default__delete.html" target="_self">default_delete</a></td><td class="desc">Default deleter </td></tr>
+<tr id="row_0_6_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1default__delete_3_01T[]_4.html" target="_self">default_delete&lt; T[]&gt;</a></td><td class="desc">Partial specialization for deleting array types </td></tr>
+<tr id="row_0_6_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1enable__if.html" target="_self">enable_if</a></td><td class="desc">Std::enable_if (true specialization) </td></tr>
+<tr id="row_0_6_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1enable__if_3_01false_00_01T_01_4.html" target="_self">enable_if&lt; false, T &gt;</a></td><td class="desc">Std::enable_if (false specialization) </td></tr>
+<tr id="row_0_6_24_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1integral__constant.html" target="_self">integral_constant</a></td><td class="desc">Std::integral_constant </td></tr>
+<tr id="row_0_6_25_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__arithmetic.html" target="_self">is_arithmetic</a></td><td class="desc">Std::is_arithmetic </td></tr>
+<tr id="row_0_6_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__base__of.html" target="_self">is_base_of</a></td><td class="desc">Std::is_base_of </td></tr>
+<tr id="row_0_6_27_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_6_27_" class="arrow" onclick="toggleFolder('0_6_27_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__base__of__helper.html" target="_self">is_base_of_helper</a></td><td class="desc">Helper for std::is_base_of </td></tr>
+<tr id="row_0_6_27_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__base__of__helper_1_1dummy.html" target="_self">dummy</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__floating__point.html" target="_self">is_floating_point</a></td><td class="desc">Std::is_floating_point </td></tr>
+<tr id="row_0_6_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__fundamental.html" target="_self">is_fundamental</a></td><td class="desc">Std::is_fundamental </td></tr>
+<tr id="row_0_6_30_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral.html" target="_self">is_integral</a></td><td class="desc">Std::is_integral </td></tr>
+<tr id="row_0_6_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01char_01_4.html" target="_self">is_integral&lt; char &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01const_01T_01_4.html" target="_self">is_integral&lt; const T &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_33_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01const_01volatile_01T_01_4.html" target="_self">is_integral&lt; const volatile T &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01int_01_4.html" target="_self">is_integral&lt; int &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_35_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01long_01_4.html" target="_self">is_integral&lt; long &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_36_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01long_01long_01_4.html" target="_self">is_integral&lt; long long &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_37_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01short_01_4.html" target="_self">is_integral&lt; short &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_38_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01signed_01char_01_4.html" target="_self">is_integral&lt; signed char &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_39_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01char_01_4.html" target="_self">is_integral&lt; unsigned char &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_40_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01int_01_4.html" target="_self">is_integral&lt; unsigned int &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_41_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01long_01_4.html" target="_self">is_integral&lt; unsigned long &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_42_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01long_01long_01_4.html" target="_self">is_integral&lt; unsigned long long &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_43_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01short_01_4.html" target="_self">is_integral&lt; unsigned short &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_44_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01volatile_01T_01_4.html" target="_self">is_integral&lt; volatile T &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_45_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__pointer.html" target="_self">is_pointer</a></td><td class="desc">Std::is_pointer </td></tr>
+<tr id="row_0_6_46_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__pointer__helper.html" target="_self">is_pointer_helper</a></td><td class="desc">Helper for std::is_pointer (false specialization) </td></tr>
+<tr id="row_0_6_47_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__pointer__helper_3_01T_01_5_01_4.html" target="_self">is_pointer_helper&lt; T * &gt;</a></td><td class="desc">Helper for std::is_pointer (true specialization) </td></tr>
+<tr id="row_0_6_48_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__same.html" target="_self">is_same</a></td><td class="desc">Std::is_same (false specialization) </td></tr>
+<tr id="row_0_6_49_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__same_3_01A_00_01A_01_4.html" target="_self">is_same&lt; A, A &gt;</a></td><td class="desc">Std::is_same (true specialization) </td></tr>
+<tr id="row_0_6_50_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__trivially__copyable.html" target="_self">is_trivially_copyable</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_51_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__void.html" target="_self">is_void</a></td><td class="desc">Std::is_void </td></tr>
+<tr id="row_0_6_52_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__volatile.html" target="_self">is_volatile</a></td><td class="desc">Std::is_volatile </td></tr>
+<tr id="row_0_6_53_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__volatile_3_01volatile_01T_01_4.html" target="_self">is_volatile&lt; volatile T &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_54_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1nullptr__t.html" target="_self">nullptr_t</a></td><td class="desc">Std::nullptr_t </td></tr>
+<tr id="row_0_6_55_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__const.html" target="_self">remove_const</a></td><td class="desc">Std::remove_const (non-const specialization) </td></tr>
+<tr id="row_0_6_56_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__const_3_01const_01T_01_4.html" target="_self">remove_const&lt; const T &gt;</a></td><td class="desc">Std::remove_const (const specialization) </td></tr>
+<tr id="row_0_6_57_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__cv.html" target="_self">remove_cv</a></td><td class="desc">Std::remove_cv </td></tr>
+<tr id="row_0_6_58_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__volatile.html" target="_self">remove_volatile</a></td><td class="desc">Std::remove_volatile (non-volatile specialization) </td></tr>
+<tr id="row_0_6_59_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__volatile_3_01volatile_01T_01_4.html" target="_self">remove_volatile&lt; volatile T &gt;</a></td><td class="desc">Std::remove_volatile (volatile specialization) </td></tr>
+<tr id="row_0_6_60_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1platform_1_1unique__ptr.html" target="_self">unique_ptr</a></td><td class="desc">Std::unique_ptr </td></tr>
+<tr id="row_0_7_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_7_" class="arrow" onclick="toggleFolder('0_7_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reduction.html" target="_self">reduction</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_7_0_" class="arrow" onclick="toggleFolder('0_7_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reduction_1_1kernel.html" target="_self">kernel</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_0_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_7_0_0_" class="arrow" onclick="toggleFolder('0_7_0_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK.html" target="_self">ReduceSplitK</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_0_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK_1_1Params.html" target="_self">Params</a></td><td class="desc"><a class="el" href="structcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK_1_1Params.html" title="Params structure. ">Params</a> structure </td></tr>
+<tr id="row_0_7_0_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_7_1_" class="arrow" onclick="toggleFolder('0_7_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reduction_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_1_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" target="_self">Reduce</a></td><td class="desc">Structure to compute the thread level reduction </td></tr>
+<tr id="row_0_7_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01half__t_01_4_00_01AlignedArray_3_01half__t_00_01N_01_4_01_4.html" target="_self">Reduce&lt; plus&lt; half_t &gt;, AlignedArray&lt; half_t, N &gt; &gt;</a></td><td class="desc">Partial specializations of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for AlignedArray&lt;half_t, N&gt; </td></tr>
+<tr id="row_0_7_1_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01half__t_01_4_00_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">Reduce&lt; plus&lt; half_t &gt;, Array&lt; half_t, N &gt; &gt;</a></td><td class="desc">Partial specializations of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for Array&lt;half_t, N&gt; </td></tr>
+<tr id="row_0_7_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01T_01_4_00_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">Reduce&lt; plus&lt; T &gt;, Array&lt; T, N &gt; &gt;</a></td><td class="desc">Partial specialization of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for Array&lt;T, N&gt; </td></tr>
+<tr id="row_0_7_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01T_01_4_00_01T_01_4.html" target="_self">Reduce&lt; plus&lt; T &gt;, T &gt;</a></td><td class="desc">Partial Specialization of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for "plus" (a functional operator) </td></tr>
+<tr id="row_0_7_1_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_7_1_5_" class="arrow" onclick="toggleFolder('0_7_1_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1ReduceAdd.html" target="_self">ReduceAdd</a></td><td class="desc">Mixed-precision reduction </td></tr>
+<tr id="row_0_7_1_5_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1ReduceAdd_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1BatchedReduction.html" target="_self">BatchedReduction</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_7_3_" class="arrow" onclick="toggleFolder('0_7_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1BatchedReductionTraits.html" target="_self">BatchedReductionTraits</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1BatchedReductionTraits_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1DefaultBlockSwizzle.html" target="_self">DefaultBlockSwizzle</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_8_" class="arrow" onclick="toggleFolder('0_8_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference.html" target="_self">reference</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_8_0_" class="arrow" onclick="toggleFolder('0_8_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1detail_1_1Cast.html" target="_self">Cast</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1detail_1_1Cast_3_01float_00_01int8__t_01_4.html" target="_self">Cast&lt; float, int8_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1detail_1_1Cast_3_01float_00_01uint8__t_01_4.html" target="_self">Cast&lt; float, uint8_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_8_1_" class="arrow" onclick="toggleFolder('0_8_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device.html" target="_self">device</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_" class="arrow" onclick="toggleFolder('0_8_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_0_" class="arrow" onclick="toggleFolder('0_8_1_0_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomGaussianFunc.html" target="_self">RandomGaussianFunc</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_0_0_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomGaussianFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_1_" class="arrow" onclick="toggleFolder('0_8_1_0_1_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomUniformFunc.html" target="_self">RandomUniformFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_1_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomUniformFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_2_" class="arrow" onclick="toggleFolder('0_8_1_0_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalInFunc.html" target="_self">TensorCopyDiagonalInFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_2_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalInFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_3_" class="arrow" onclick="toggleFolder('0_8_1_0_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalOutFunc.html" target="_self">TensorCopyDiagonalOutFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_3_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalOutFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_4_" class="arrow" onclick="toggleFolder('0_8_1_0_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillDiagonalFunc.html" target="_self">TensorFillDiagonalFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_4_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillDiagonalFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_5_" class="arrow" onclick="toggleFolder('0_8_1_0_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillLinearFunc.html" target="_self">TensorFillLinearFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_5_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillLinearFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_6_" class="arrow" onclick="toggleFolder('0_8_1_0_6_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomGaussianFunc.html" target="_self">TensorFillRandomGaussianFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_6_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomGaussianFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_7_" class="arrow" onclick="toggleFolder('0_8_1_0_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomUniformFunc.html" target="_self">TensorFillRandomUniformFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_7_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomUniformFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_8_" class="arrow" onclick="toggleFolder('0_8_1_0_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateDiagonalFunc.html" target="_self">TensorUpdateDiagonalFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_8_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateDiagonalFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_9_" class="arrow" onclick="toggleFolder('0_8_1_0_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateOffDiagonalFunc.html" target="_self">TensorUpdateOffDiagonalFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_9_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateOffDiagonalFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_1_1_" class="arrow" onclick="toggleFolder('0_8_1_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1kernel.html" target="_self">kernel</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_1_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_1_0_" class="arrow" onclick="toggleFolder('0_8_1_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1kernel_1_1detail.html" target="_self">detail</a></td><td class="desc">Defines several helpers </td></tr>
+<tr id="row_0_8_1_1_0_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1kernel_1_1detail_1_1TensorForEachHelper.html" target="_self">TensorForEachHelper</a></td><td class="desc">Helper to perform for-each operation </td></tr>
+<tr id="row_0_8_1_1_0_1_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1kernel_1_1detail_1_1TensorForEachHelper_3_01Func_00_01Rank_00_010_01_4.html" target="_self">TensorForEachHelper&lt; Func, Rank, 0 &gt;</a></td><td class="desc">Helper to perform for-each operation </td></tr>
+<tr id="row_0_8_1_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_1_2_" class="arrow" onclick="toggleFolder('0_8_1_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_2_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1thread_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc">Thread-level blocked general matrix product </td></tr>
+<tr id="row_0_8_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1BlockForEach.html" target="_self">BlockForEach</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01Layout4e016ab7cfc644acd7cb4ae770339773.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAdd &gt;</a></td><td class="desc">Partial specialization for multiply-add </td></tr>
+<tr id="row_0_8_1_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01Layout30b72addd464a2ca4a26785cbfd77a8e.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAddSaturate &gt;</a></td><td class="desc">Partial specialization for multiply-add-saturate </td></tr>
+<tr id="row_0_8_1_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01Layout660562b232f408218828ca5915b7e73a.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpXorPopc &gt;</a></td><td class="desc">Parital specialization for XOR-popc </td></tr>
+<tr id="row_0_8_1_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1TensorDiagonalForEach.html" target="_self">TensorDiagonalForEach</a></td><td class="desc">Launches a kernel calling a functor for each element along a tensor's diagonal </td></tr>
+<tr id="row_0_8_1_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1TensorForEach.html" target="_self">TensorForEach</a></td><td class="desc">Launches a kernel calling a functor for each element in a tensor's index space </td></tr>
+<tr id="row_0_8_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_8_2_" class="arrow" onclick="toggleFolder('0_8_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1host.html" target="_self">host</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_2_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_2_0_" class="arrow" onclick="toggleFolder('0_8_2_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1host_1_1detail.html" target="_self">detail</a></td><td class="desc">Defines several helpers </td></tr>
+<tr id="row_0_8_2_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomGaussianFunc.html" target="_self">RandomGaussianFunc</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_2_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomGaussianFunc_3_01complex_3_01Element_01_4_01_4.html" target="_self">RandomGaussianFunc&lt; complex&lt; Element &gt; &gt;</a></td><td class="desc">Partial specialization for initializing a complex value </td></tr>
+<tr id="row_0_8_2_0_2_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomUniformFunc.html" target="_self">RandomUniformFunc</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_2_0_3_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomUniformFunc_3_01complex_3_01Element_01_4_01_4.html" target="_self">RandomUniformFunc&lt; complex&lt; Element &gt; &gt;</a></td><td class="desc">Partial specialization for initializing a complex value </td></tr>
+<tr id="row_0_8_2_0_4_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorContainsFunc.html" target="_self">TensorContainsFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_5_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorCopyIf.html" target="_self">TensorCopyIf</a></td><td class="desc">Helper to conditionally copy between tensor views </td></tr>
+<tr id="row_0_8_2_0_6_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorEqualsFunc.html" target="_self">TensorEqualsFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_7_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillDiagonalFunc.html" target="_self">TensorFillDiagonalFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_8_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillFunc.html" target="_self">TensorFillFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_9_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillGaussianFunc.html" target="_self">TensorFillGaussianFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_2_0_10_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillLinearFunc.html" target="_self">TensorFillLinearFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_11_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillRandomUniformFunc.html" target="_self">TensorFillRandomUniformFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_2_0_12_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorForEachHelper.html" target="_self">TensorForEachHelper</a></td><td class="desc">Helper to perform for-each operation </td></tr>
+<tr id="row_0_8_2_0_13_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorForEachHelper_3_01Func_00_01Rank_00_010_01_4.html" target="_self">TensorForEachHelper&lt; Func, Rank, 0 &gt;</a></td><td class="desc">Helper to perform for-each operation </td></tr>
+<tr id="row_0_8_2_0_14_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFuncBinaryOp.html" target="_self">TensorFuncBinaryOp</a></td><td class="desc">Helper to apply a binary operator in place </td></tr>
+<tr id="row_0_8_2_0_15_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorUpdateOffDiagonalFunc.html" target="_self">TensorUpdateOffDiagonalFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_16_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TrivialConvert.html" target="_self">TrivialConvert</a></td><td class="desc">Helper to convert between types </td></tr>
+<tr id="row_0_8_2_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1BlockForEach.html" target="_self">BlockForEach</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_2_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_2_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01LayoutB_193dd3a37f00deff1e5dcd7c310afb1f.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAdd &gt;</a></td><td class="desc">Partial specialization for multiply-add </td></tr>
+<tr id="row_0_8_2_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01LayoutB_55729eac7dbd6bf311ea36f680e83e93.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAddSaturate &gt;</a></td><td class="desc">Partial specialization for multiply-add-saturate </td></tr>
+<tr id="row_0_8_2_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01LayoutB_4f3f32c4b336238abfd741e87bfced46.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpXorPopc &gt;</a></td><td class="desc">Parital specialization for XOR-popc </td></tr>
+<tr id="row_0_9_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_9_" class="arrow" onclick="toggleFolder('0_9_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_9_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1thread_1_1Matrix.html" target="_self">Matrix</a></td><td class="desc">Per-thread matrix object storing a packed matrix </td></tr>
+<tr id="row_0_10_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_10_" class="arrow" onclick="toggleFolder('0_10_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1transform.html" target="_self">transform</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_0_" class="arrow" onclick="toggleFolder('0_10_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1transform_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1thread_1_1Transpose.html" target="_self">Transpose</a></td><td class="desc">Transforms a fragment by doing a transpose </td></tr>
+<tr id="row_0_10_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1thread_1_1Transpose_3_01ElementCount___00_01layout_1_1PitchLinearS99f8e05faf0bb5ed48a0154afe740d81.html" target="_self">Transpose&lt; ElementCount_, layout::PitchLinearShape&lt; 4, 4 &gt;, int8_t &gt;</a></td><td class="desc">Specialization for int8_t 4x4 transpose </td></tr>
+<tr id="row_0_10_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_1_" class="arrow" onclick="toggleFolder('0_10_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1transform_1_1threadblock.html" target="_self">threadblock</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator.html" target="_self">PredicatedTileAccessIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile.html" target="_self">PredicatedTileAccessIterator2dThreadTile</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_2_" class="arrow" onclick="toggleFolder('0_10_1_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__da632779aba661c0f4cfaaa78126b771.html" target="_self">PredicatedTileAccessIterator2dThreadTile&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_2_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__18e9cf25bb3b8edfaad595241a6dc2d7.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_3_" class="arrow" onclick="toggleFolder('0_10_1_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__1790abaa54a01f277d75766d5882fec8.html" target="_self">PredicatedTileAccessIterator2dThreadTile&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_3_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__8ccc62d47a092afc8bee32ffe9d1e4ba.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_4_" class="arrow" onclick="toggleFolder('0_10_1_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__7327fa15996bcb8502cdfcc192350fe1.html" target="_self">PredicatedTileAccessIterator2dThreadTile&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_4_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__a56cbccec33ee916292ad9d068474609.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_5_" class="arrow" onclick="toggleFolder('0_10_1_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen89c687c583745a73cb485041911a4c4e.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_5_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemenc07b5ec72f83e782121ac629288d61fe.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_6_" class="arrow" onclick="toggleFolder('0_10_1_6_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemenab63a1e105bf37f6371516cb9e2c5a7a.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_6_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemena9b06926a275b569ee9f7f142604b997.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_7_" class="arrow" onclick="toggleFolder('0_10_1_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen784a0e9da3f55064c47e5613791f51f7.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_7_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen41e459f664d17473570cf22fb616845f.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_8_" class="arrow" onclick="toggleFolder('0_10_1_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen9838736ad62fae54213fbaf722a989ab.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_8_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen44ce348364e78f5a56fa0c2cef6af930.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_9_" class="arrow" onclick="toggleFolder('0_10_1_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen809793e785fb4211888c6b4e5dcfcb39.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::RowMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_9_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen058417e2cdd86f3cd6ad5458581571c8.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator.html" target="_self">PredicatedTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile.html" target="_self">PredicatedTileIterator2dThreadTile</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_12_" class="arrow" onclick="toggleFolder('0_10_1_12_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0165b39a630d10785a3558406f9adb99b9.html" target="_self">PredicatedTileIterator2dThreadTile&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Transpose_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_12_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_01e11ed7192af5d7ad1bce5641fa13112e.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_13_" class="arrow" onclick="toggleFolder('0_10_1_13_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_017a517f3c73efd795ab05059cc9b111e1.html" target="_self">PredicatedTileIterator2dThreadTile&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Transpose_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_13_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0b878062cc0cd214bf7e17d74ff17e246.html" target="_self">AccessType</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_13_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0145ef045e8f7d57dc718098adcb00cf3d.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_14_" class="arrow" onclick="toggleFolder('0_10_1_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_013671177d6219bfeb0e1b4dc4c1b5bf11.html" target="_self">PredicatedTileIterator2dThreadTile&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Transpose_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0102e766863c6ac9ec2063a02c4803eecb.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_15_" class="arrow" onclick="toggleFolder('0_10_1_15_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___0068b3e874b5d93d11f0fa902c7f1d11d9.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_15_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00a6b756b1bcfbb35fe4a3e68ff074e380.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_16_" class="arrow" onclick="toggleFolder('0_10_1_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00f6b3a9dfab5e7c72d5233f7e5e6e3b9b.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00ebd1a63351e1085d0b718582ec7b06c8.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_17_" class="arrow" onclick="toggleFolder('0_10_1_17_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00e7c2c404e7aedfe60ad56bb5571306a1.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_17_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___006a5f2f7a8271031e6cdc5daa5441f2af.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_18_" class="arrow" onclick="toggleFolder('0_10_1_18_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___0041ea81994f8af0d4d071fdb9e66b5ff0.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_18_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___004d0f9b5e19c29acc17bcdc360dafebbd.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_19_" class="arrow" onclick="toggleFolder('0_10_1_19_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00d670f969180a8d182dffb356ebcc957e.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::RowMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_19_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___009fd89f6dad84238fd7d63df0a0c0364f.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_20_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator.html" target="_self">RegularTileAccessIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__eb7d20f8b9d69e0ae5e7ef51dc480867.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_22_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__2c1476eaf582bfe972793e17babfe985.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_23_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__a3c11cf1f00ef7a1efb8389ac6e4c6e0.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__0855e9d9ab619202d2397180c1e4c4a5.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__f04332958a49a47d6fb2b25201764630.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_26_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__6baada077236f1a368c61c5e11b45b72.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__0184b7188941788a96624510a4b2f876.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_28_" class="arrow" onclick="toggleFolder('0_10_1_28_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__ebf4714349612673e8b6609b763eeb6f.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_28_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element_0a9491607d11be8e1780e79ad711aa42.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_29_" class="arrow" onclick="toggleFolder('0_10_1_29_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__e9a9e0f4286f652f55eb9b863b21effe.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_29_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element_3be8b96d170d886f39b6b30acab65e7a.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_30_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator.html" target="_self">RegularTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_31_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile.html" target="_self">RegularTileIterator2dThreadTile</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_32_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile_3_01Shape___00_01Eleb60d066756d1c18f05fceee6a27bdb8a.html" target="_self">RegularTileIterator2dThreadTile&lt; Shape_, Element_, layout::ColumnMajorInterleaved&lt; 4 &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for interleaved layout + 2d thread-tiled threadmapping </td></tr>
+<tr id="row_0_10_1_33_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile_3_01Shape___00_01Ele76ed82829532ae1c17f4c78158f036c7.html" target="_self">RegularTileIterator2dThreadTile&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear + 2d thread-tiled threadmapping </td></tr>
+<tr id="row_0_10_1_34_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile_3_01Shape___00_01Ele654c8f6161ae5340f040397a4e2e045c.html" target="_self">RegularTileIterator2dThreadTile&lt; Shape_, Element_, layout::RowMajorInterleaved&lt; 4 &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for interleaved layout + 2d thread-tiled threadmapping </td></tr>
+<tr id="row_0_10_1_35_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_011d3637dbd8bc58bcb020b51bf57fbfc0.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear </td></tr>
+<tr id="row_0_10_1_36_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_017982f81d4ef592e19c8427de2ea933a3.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_37_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_010889a732373c350de9b9a9f6c13cd761.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_38_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01187f8574e1fe9d7d5e8fbf09bd834bf0.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_39_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01793f74bfd8f116a827948ab01a37349a.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_40_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01bd31b3810c1fedf2e7e5959ff92b5d3d.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Shape_::kRow &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_41_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0184a89653916f5d51ab59d1b386989a17.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear </td></tr>
+<tr id="row_0_10_1_42_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0149454d361ea5885cf5166a920b5145df.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear </td></tr>
+<tr id="row_0_10_1_43_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01c20d35180520077a5a09b1e33543c1a5.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_44_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01a31b454d9c930525c1e9ca406a514f40.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_45_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0104ad31bd559a88cc418ae1cab7492ed5.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_46_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01f6f6511b5033cad31083644ac69c54d8.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_47_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01b3fa5720e807697de61b9f937b269cd0.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Shape_::kColumn &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_48_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_48_" class="arrow" onclick="toggleFolder('0_10_1_48_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01efd5013a2503d6567e2bf6b40c97360c.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_48_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_052caec9d5bceeb59b9a13cb3338ce64d.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_49_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_49_" class="arrow" onclick="toggleFolder('0_10_1_49_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0197fef2242a3454a7d1cebe61aee28b43.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_49_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_039093927f4b1ee61538c569bf1ae4efd.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_50_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_50_" class="arrow" onclick="toggleFolder('0_10_1_50_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01a75d2cd74e722d6ad6a3b41aabfd432d.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::VoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_50_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_02d305cfb0b55c6fb236a52cf2240651e.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_51_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_51_" class="arrow" onclick="toggleFolder('0_10_1_51_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01f96bbeb63e6d4ce4a2551279de3a9f0e.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::VoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_51_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_032f88d1be8b209e44a4815c707ba35bb.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_52_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_52_" class="arrow" onclick="toggleFolder('0_10_1_52_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01dbd6b8468d5bd787308d2f615a24d123.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::VoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Shape_::kContiguous &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_52_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0390833403016f5d817416e20828845df.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinear2DThreadTileStripminedThreadMap.html" target="_self">PitchLinear2DThreadTileStripminedThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_3_" class="arrow" onclick="toggleFolder('0_10_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinear2DThreadTileStripminedThreadMap_3_01Shape___00_01Thread0082c3467229b12cc9dd996283ee7160.html" target="_self">PitchLinear2DThreadTileStripminedThreadMap&lt; Shape_, Threads, cutlass::layout::PitchLinearShape&lt; 4, 4 &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinear2DThreadTileStripminedThreadMap_3_01Shape___00_01Thread896c01a3c466da1bf392e0cdfced4d53.html" target="_self">Detail</a></td><td class="desc">Internal implementation details </td></tr>
+<tr id="row_0_10_4_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_4_" class="arrow" onclick="toggleFolder('0_10_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearStripminedThreadMap.html" target="_self">PitchLinearStripminedThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_4_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearStripminedThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal implementation details </td></tr>
+<tr id="row_0_10_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearTilePolicyStripminedThreadContiguous.html" target="_self">PitchLinearTilePolicyStripminedThreadContiguous</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearTilePolicyStripminedThreadStrided.html" target="_self">PitchLinearTilePolicyStripminedThreadStrided</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_7_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_7_" class="arrow" onclick="toggleFolder('0_10_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpRakedThreadMap.html" target="_self">PitchLinearWarpRakedThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_7_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpRakedThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) </td></tr>
+<tr id="row_0_10_8_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_8_" class="arrow" onclick="toggleFolder('0_10_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpStripedThreadMap.html" target="_self">PitchLinearWarpStripedThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_8_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpStripedThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) </td></tr>
+<tr id="row_0_10_9_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_9_" class="arrow" onclick="toggleFolder('0_10_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMap.html" target="_self">TransposePitchLinearThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_9_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) </td></tr>
+<tr id="row_0_10_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMap2DThreadTile.html" target="_self">TransposePitchLinearThreadMap2DThreadTile</a></td><td class="desc">Thread Mapping a 2D threadtiled mapping as a tranposed Pitchlinear2DThreadTile mapping </td></tr>
+<tr id="row_0_10_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMapSimt.html" target="_self">TransposePitchLinearThreadMapSimt</a></td><td class="desc"></td></tr>
+<tr id="row_0_11_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1AlignedArray.html" target="_self">AlignedArray</a></td><td class="desc">Aligned array type </td></tr>
+<tr id="row_0_12_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1AlignedBuffer.html" target="_self">AlignedBuffer</a></td><td class="desc">Modifies semantics of cutlass::Array&lt;&gt; to provide guaranteed alignment </td></tr>
+<tr id="row_0_13_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_13_" class="arrow" onclick="toggleFolder('0_13_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html" target="_self">Array&lt; T, N, false &gt;</a></td><td class="desc">Statically sized array for any data type </td></tr>
+<tr id="row_0_13_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html" target="_self">const_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
+<tr id="row_0_13_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html" target="_self">const_reference</a></td><td class="desc">Reference object extracts sub-byte items </td></tr>
+<tr id="row_0_13_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html" target="_self">const_reverse_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
+<tr id="row_0_13_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1iterator.html" target="_self">iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
+<tr id="row_0_13_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1reference.html" target="_self">reference</a></td><td class="desc">Reference object inserts or extracts sub-byte items </td></tr>
+<tr id="row_0_13_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1reverse__iterator.html" target="_self">reverse_iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
+<tr id="row_0_14_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_14_" class="arrow" onclick="toggleFolder('0_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4.html" target="_self">Array&lt; T, N, true &gt;</a></td><td class="desc">Statically sized array for any data type </td></tr>
+<tr id="row_0_14_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1const__iterator.html" target="_self">const_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
+<tr id="row_0_14_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1const__reverse__iterator.html" target="_self">const_reverse_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
+<tr id="row_0_14_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1iterator.html" target="_self">iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
+<tr id="row_0_14_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1reverse__iterator.html" target="_self">reverse_iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
+<tr id="row_0_15_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1CommandLine.html" target="_self">CommandLine</a></td><td class="desc"></td></tr>
+<tr id="row_0_16_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1complex.html" target="_self">complex</a></td><td class="desc"></td></tr>
+<tr id="row_0_17_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1ConstSubbyteReference.html" target="_self">ConstSubbyteReference</a></td><td class="desc"></td></tr>
+<tr id="row_0_18_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Coord.html" target="_self">Coord</a></td><td class="desc">Statically-sized array specifying Coords within a tensor </td></tr>
+<tr id="row_0_19_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1cuda__exception.html" target="_self">cuda_exception</a></td><td class="desc">C++ exception wrapper for CUDA <code>cudaError_t</code> </td></tr>
+<tr id="row_0_20_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Distribution.html" target="_self">Distribution</a></td><td class="desc"><a class="el" href="structcutlass_1_1Distribution.html" title="Distribution type. ">Distribution</a> type </td></tr>
+<tr id="row_0_21_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divide__assert.html" target="_self">divide_assert</a></td><td class="desc"></td></tr>
+<tr id="row_0_22_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divides.html" target="_self">divides</a></td><td class="desc"></td></tr>
+<tr id="row_0_23_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divides_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">divides&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_24_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divides_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">divides&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_25_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType.html" target="_self">FloatType</a></td><td class="desc">Defines a floating-point type based on the number of exponent and mantissa bits </td></tr>
+<tr id="row_0_26_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType_3_0111_00_0152_01_4.html" target="_self">FloatType&lt; 11, 52 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_27_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType_3_015_00_0110_01_4.html" target="_self">FloatType&lt; 5, 10 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_28_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType_3_018_00_0123_01_4.html" target="_self">FloatType&lt; 8, 23 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_29_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1half__t.html" target="_self">half_t</a></td><td class="desc">IEEE half-precision floating-point type </td></tr>
+<tr id="row_0_30_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1HostTensor.html" target="_self">HostTensor</a></td><td class="desc">Host tensor </td></tr>
+<tr id="row_0_31_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1IdentityTensorLayout.html" target="_self">IdentityTensorLayout</a></td><td class="desc"></td></tr>
+<tr id="row_0_32_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1integer__subbyte.html" target="_self">integer_subbyte</a></td><td class="desc">4-bit signed integer type </td></tr>
+<tr id="row_0_33_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType.html" target="_self">IntegerType</a></td><td class="desc">Defines integers based on size and whether they are signed </td></tr>
+<tr id="row_0_34_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_011_00_01false_01_4.html" target="_self">IntegerType&lt; 1, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_35_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_011_00_01true_01_4.html" target="_self">IntegerType&lt; 1, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_36_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0116_00_01false_01_4.html" target="_self">IntegerType&lt; 16, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_37_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0116_00_01true_01_4.html" target="_self">IntegerType&lt; 16, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_38_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0132_00_01false_01_4.html" target="_self">IntegerType&lt; 32, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_39_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0132_00_01true_01_4.html" target="_self">IntegerType&lt; 32, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_40_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_014_00_01false_01_4.html" target="_self">IntegerType&lt; 4, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_41_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_014_00_01true_01_4.html" target="_self">IntegerType&lt; 4, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_42_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0164_00_01false_01_4.html" target="_self">IntegerType&lt; 64, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_43_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0164_00_01true_01_4.html" target="_self">IntegerType&lt; 64, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_44_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_018_00_01false_01_4.html" target="_self">IntegerType&lt; 8, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_45_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_018_00_01true_01_4.html" target="_self">IntegerType&lt; 8, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_46_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1is__pow2.html" target="_self">is_pow2</a></td><td class="desc"></td></tr>
+<tr id="row_0_47_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1KernelLaunchConfiguration.html" target="_self">KernelLaunchConfiguration</a></td><td class="desc">Structure containing the basic launch configuration of a CUDA kernel </td></tr>
+<tr id="row_0_48_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__down.html" target="_self">log2_down</a></td><td class="desc"></td></tr>
+<tr id="row_0_49_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__down_3_01N_00_011_00_01Count_01_4.html" target="_self">log2_down&lt; N, 1, Count &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_50_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__up.html" target="_self">log2_up</a></td><td class="desc"></td></tr>
+<tr id="row_0_51_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__up_3_01N_00_011_00_01Count_01_4.html" target="_self">log2_up&lt; N, 1, Count &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_52_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1MatrixCoord.html" target="_self">MatrixCoord</a></td><td class="desc"></td></tr>
+<tr id="row_0_53_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1MatrixShape.html" target="_self">MatrixShape</a></td><td class="desc">Describes the size of a matrix tile </td></tr>
+<tr id="row_0_54_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Max.html" target="_self">Max</a></td><td class="desc"></td></tr>
+<tr id="row_0_55_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1maximum.html" target="_self">maximum</a></td><td class="desc"></td></tr>
+<tr id="row_0_56_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1maximum_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">maximum&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_57_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1maximum_3_01float_01_4.html" target="_self">maximum&lt; float &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_58_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Min.html" target="_self">Min</a></td><td class="desc"></td></tr>
+<tr id="row_0_59_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minimum.html" target="_self">minimum</a></td><td class="desc"></td></tr>
+<tr id="row_0_60_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minimum_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">minimum&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_61_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minimum_3_01float_01_4.html" target="_self">minimum&lt; float &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_62_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minus.html" target="_self">minus</a></td><td class="desc"></td></tr>
+<tr id="row_0_63_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minus_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">minus&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_64_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minus_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">minus&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_65_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiplies.html" target="_self">multiplies</a></td><td class="desc"></td></tr>
+<tr id="row_0_66_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiplies_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">multiplies&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_67_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiplies_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">multiplies&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_68_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add.html" target="_self">multiply_add</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_69_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01Array_3_01half__t_00_01N_01_4_00_01Array_3_01half__t_00_01N_01adaeadb27c0e4439444709c0eb30963.html" target="_self">multiply_add&lt; Array&lt; half_t, N &gt;, Array&lt; half_t, N &gt;, Array&lt; half_t, N &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_70_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01Array_3_01T_00_01N_01_4_00_01Array_3_01T_00_01N_01_4_00_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">multiply_add&lt; Array&lt; T, N &gt;, Array&lt; T, N &gt;, Array&lt; T, N &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_71_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01complex_3_01T_01_4_00_01complex_3_01T_01_4_00_01complex_3_01T_01_4_01_4.html" target="_self">multiply_add&lt; complex&lt; T &gt;, complex&lt; T &gt;, complex&lt; T &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_72_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01complex_3_01T_01_4_00_01T_00_01complex_3_01T_01_4_01_4.html" target="_self">multiply_add&lt; complex&lt; T &gt;, T, complex&lt; T &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_73_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01T_00_01complex_3_01T_01_4_00_01complex_3_01T_01_4_01_4.html" target="_self">multiply_add&lt; T, complex&lt; T &gt;, complex&lt; T &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_74_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1negate.html" target="_self">negate</a></td><td class="desc"></td></tr>
+<tr id="row_0_75_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1negate_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">negate&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_76_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1negate_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">negate&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_77_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter.html" target="_self">NumericArrayConverter</a></td><td class="desc">Conversion operator for Array </td></tr>
+<tr id="row_0_78_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01float_00_01half__t_00_012_00_01Round_01_4.html" target="_self">NumericArrayConverter&lt; float, half_t, 2, Round &gt;</a></td><td class="desc">Partial specialization for Array&lt;float, 2&gt; &lt;= Array&lt;half_t, 2&gt;, round to nearest </td></tr>
+<tr id="row_0_79_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01float_00_01half__t_00_01N_00_01Round_01_4.html" target="_self">NumericArrayConverter&lt; float, half_t, N, Round &gt;</a></td><td class="desc">Partial specialization for Array&lt;half&gt; &lt;= Array&lt;float&gt; </td></tr>
+<tr id="row_0_80_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01half__t_00_01float_00_012_00_01FloatRoundStyle_1_1round__to__nearest_01_4.html" target="_self">NumericArrayConverter&lt; half_t, float, 2, FloatRoundStyle::round_to_nearest &gt;</a></td><td class="desc">Partial specialization for Array&lt;half, 2&gt; &lt;= Array&lt;float, 2&gt;, round to nearest </td></tr>
+<tr id="row_0_81_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01half__t_00_01float_00_01N_00_01Round_01_4.html" target="_self">NumericArrayConverter&lt; half_t, float, N, Round &gt;</a></td><td class="desc">Partial specialization for Array&lt;half&gt; &lt;= Array&lt;float&gt; </td></tr>
+<tr id="row_0_82_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter.html" target="_self">NumericConverter</a></td><td class="desc"></td></tr>
+<tr id="row_0_83_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01float_00_01half__t_00_01Round_01_4.html" target="_self">NumericConverter&lt; float, half_t, Round &gt;</a></td><td class="desc">Partial specialization for float &lt;= <a class="el" href="structcutlass_1_1half__t.html" title="IEEE half-precision floating-point type. ">half_t</a> </td></tr>
+<tr id="row_0_84_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01half__t_00_01float_00_01FloatRoundStyle_1_1round__to__nearest_01_4.html" target="_self">NumericConverter&lt; half_t, float, FloatRoundStyle::round_to_nearest &gt;</a></td><td class="desc">Specialization for round-to-nearest </td></tr>
+<tr id="row_0_85_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01half__t_00_01float_00_01FloatRoundStyle_1_1round__toward__zero_01_4.html" target="_self">NumericConverter&lt; half_t, float, FloatRoundStyle::round_toward_zero &gt;</a></td><td class="desc">Specialization for round-toward-zero </td></tr>
+<tr id="row_0_86_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01int8__t_00_01float_00_01Round_01_4.html" target="_self">NumericConverter&lt; int8_t, float, Round &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_87_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01T_00_01T_00_01Round_01_4.html" target="_self">NumericConverter&lt; T, T, Round &gt;</a></td><td class="desc">Partial specialization for float &lt;= <a class="el" href="structcutlass_1_1half__t.html" title="IEEE half-precision floating-point type. ">half_t</a> </td></tr>
+<tr id="row_0_88_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverterClamp.html" target="_self">NumericConverterClamp</a></td><td class="desc"></td></tr>
+<tr id="row_0_89_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1plus.html" target="_self">plus</a></td><td class="desc"></td></tr>
+<tr id="row_0_90_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1plus_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">plus&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_91_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1plus_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">plus&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_92_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_92_" class="arrow" onclick="toggleFolder('0_92_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1PredicateVector.html" target="_self">PredicateVector</a></td><td class="desc">Statically sized array of bits implementing </td></tr>
+<tr id="row_0_92_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1PredicateVector_1_1ConstIterator.html" target="_self">ConstIterator</a></td><td class="desc">An iterator implementing <a class="el" href="group__predicate__iterator__concept.html">Predicate Iterator Concept</a> enabling sequential read and write access to predicates </td></tr>
+<tr id="row_0_92_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1PredicateVector_1_1Iterator.html" target="_self">Iterator</a></td><td class="desc">An iterator implementing <a class="el" href="group__predicate__iterator__concept.html">Predicate Iterator Concept</a> enabling sequential read and write access to predicates </td></tr>
+<tr id="row_0_92_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1PredicateVector_1_1TrivialIterator.html" target="_self">TrivialIterator</a></td><td class="desc"><a class="el" href="classcutlass_1_1PredicateVector_1_1Iterator.html" title="An iterator implementing Predicate Iterator Concept enabling sequential read and write access to pred...">Iterator</a> that always returns true </td></tr>
+<tr id="row_0_93_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1RealType.html" target="_self">RealType</a></td><td class="desc">Used to determine the real-valued underlying type of a numeric type T </td></tr>
+<tr id="row_0_94_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1RealType_3_01complex_3_01T_01_4_01_4.html" target="_self">RealType&lt; complex&lt; T &gt; &gt;</a></td><td class="desc">Partial specialization for complex-valued type </td></tr>
+<tr id="row_0_95_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ReferenceFactory.html" target="_self">ReferenceFactory</a></td><td class="desc"></td></tr>
+<tr id="row_0_96_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ReferenceFactory_3_01Element_00_01false_01_4.html" target="_self">ReferenceFactory&lt; Element, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_97_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ReferenceFactory_3_01Element_00_01true_01_4.html" target="_self">ReferenceFactory&lt; Element, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_98_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ScalarIO.html" target="_self">ScalarIO</a></td><td class="desc">Helper to enable formatted printing of CUTLASS scalar types to an ostream </td></tr>
+<tr id="row_0_99_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Semaphore.html" target="_self">Semaphore</a></td><td class="desc">CTA-wide semaphore for inter-CTA synchronization </td></tr>
+<tr id="row_0_100_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits.html" target="_self">sizeof_bits</a></td><td class="desc">Defines the size of an element in bits </td></tr>
+<tr id="row_0_101_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01Array_3_01T_00_01N_00_01RegisterSized_01_4_01_4.html" target="_self">sizeof_bits&lt; Array&lt; T, N, RegisterSized &gt; &gt;</a></td><td class="desc">Statically sized array for any data type </td></tr>
+<tr id="row_0_102_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01bin1__t_01_4.html" target="_self">sizeof_bits&lt; bin1_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for bin1_t </td></tr>
+<tr id="row_0_103_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01int4b__t_01_4.html" target="_self">sizeof_bits&lt; int4b_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for int4b_t </td></tr>
+<tr id="row_0_104_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01uint1b__t_01_4.html" target="_self">sizeof_bits&lt; uint1b_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for uint1b_t </td></tr>
+<tr id="row_0_105_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01uint4b__t_01_4.html" target="_self">sizeof_bits&lt; uint4b_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for uint4b_t </td></tr>
+<tr id="row_0_106_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sqrt__est.html" target="_self">sqrt_est</a></td><td class="desc"></td></tr>
+<tr id="row_0_107_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1SubbyteReference.html" target="_self">SubbyteReference</a></td><td class="desc"></td></tr>
+<tr id="row_0_108_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Tensor4DCoord.html" target="_self">Tensor4DCoord</a></td><td class="desc">Defines a canonical 4D coordinate used by tensor operations </td></tr>
+<tr id="row_0_109_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1TensorRef.html" target="_self">TensorRef</a></td><td class="desc"></td></tr>
+<tr id="row_0_110_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1TensorView.html" target="_self">TensorView</a></td><td class="desc"></td></tr>
+<tr id="row_0_111_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits.html" target="_self">TypeTraits</a></td><td class="desc"></td></tr>
+<tr id="row_0_112_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_112_" class="arrow" onclick="toggleFolder('0_112_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01double_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; double &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_112_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01double_01_4_01_4_1_1integer__type.html" target="_self">integer_type</a></td><td class="desc"></td></tr>
+<tr id="row_0_112_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01double_01_4_01_4_1_1unsigned__type.html" target="_self">unsigned_type</a></td><td class="desc"></td></tr>
+<tr id="row_0_113_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01float_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; float &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_114_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01half_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; half &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_115_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01half__t_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; half_t &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_116_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01double_01_4.html" target="_self">TypeTraits&lt; double &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_117_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01float_01_4.html" target="_self">TypeTraits&lt; float &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_118_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01half__t_01_4.html" target="_self">TypeTraits&lt; half_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_119_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01int_01_4.html" target="_self">TypeTraits&lt; int &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_120_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01int64__t_01_4.html" target="_self">TypeTraits&lt; int64_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_121_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01int8__t_01_4.html" target="_self">TypeTraits&lt; int8_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_122_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01uint64__t_01_4.html" target="_self">TypeTraits&lt; uint64_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_123_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01uint8__t_01_4.html" target="_self">TypeTraits&lt; uint8_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_124_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01unsigned_01_4.html" target="_self">TypeTraits&lt; unsigned &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_125_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1xor__add.html" target="_self">xor_add</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_1_"><td class="entry"><span style="width:0px;display:inline-block;">&#160;</span><span id="arr_1_" class="arrow" onclick="toggleFolder('1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><b>std</b></td><td class="desc">STL namespace </td></tr>
+<tr id="row_1_0_" class="even" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structstd_1_1numeric__limits_3_01cutlass_1_1half__t_01_4.html" target="_self">numeric_limits&lt; cutlass::half_t &gt;</a></td><td class="desc">Numeric limits </td></tr>
+<tr id="row_2_" class="even"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structDebugType.html" target="_self">DebugType</a></td><td class="desc"></td></tr>
+<tr id="row_3_"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structDebugValue.html" target="_self">DebugValue</a></td><td class="desc"></td></tr>
+</table>
+</div><!-- directory -->
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_2mma_8h.html
+++ b/docs/arch_2mma_8h.html
@ -0,0 +1,156 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: mma.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">arch/mma.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Templates exposing architecture support for multiply-add operations.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="array_8h_source.html">cutlass/array.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="numeric__types_8h_source.html">cutlass/numeric_types.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="include_2cutlass_2gemm_2gemm_8h_source.html">cutlass/gemm/gemm.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="arch_2mma__sm50_8h_source.html">cutlass/arch/mma_sm50.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="arch_2mma__sm60_8h_source.html">cutlass/arch/mma_sm60.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="arch_2mma__sm61_8h_source.html">cutlass/arch/mma_sm61.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="mma__sm70_8h_source.html">cutlass/arch/mma_sm70.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="mma__sm75_8h_source.html">cutlass/arch/mma_sm75.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for arch/mma.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma_8h__incl.png" border="0" usemap="#mma_8h" alt=""/></div>
+<map name="mma_8h" id="mma_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma_8h__dep__incl.png" border="0" usemap="#mma_8hdep" alt=""/></div>
+<map name="mma_8hdep" id="mma_8hdep">
+</map>
+</div>
+</div>
+<p><a href="arch_2mma_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma.html">cutlass::arch::Mma&lt; Shape_, kThreads_, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, Operator &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01ElementAb6e65b2cf5ede7f41cb070a767158dee.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, Operator &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation - specialized for 1x1x1x1 matrix multiply operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01ElementAb6e65b2cf5ede7f41cb070a767158dee.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_2mma_8hdepincl.md5
+++ b/docs/arch_2mma_8hdepincl.md5
@ -0,0 +1 @@
+7d16b59e6ba0442b8a275a213d5da3a6
--- a/docs/arch_2mma_8h__incl.md5
+++ b/docs/arch_2mma_8h__incl.md5
@ -0,0 +1 @@
+d1fff3f9d55a262110aa6a456caa91e0
--- a/docs/arch_2mma_8h_source.html
+++ b/docs/arch_2mma_8h_source.html
--- a/docs/arch_2mma__sm50_8h.html
+++ b/docs/arch_2mma__sm50_8h.html
@ -0,0 +1,176 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: mma_sm50.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">arch/mma_sm50.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Matrix multiply.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="arch_2mma_8h_source.html">cutlass/arch/mma.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="complex_8h_source.html">cutlass/complex.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="layout_2matrix_8h_source.html">cutlass/layout/matrix.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="include_2cutlass_2gemm_2gemm_8h_source.html">cutlass/gemm/gemm.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for arch/mma_sm50.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm50_8h__incl.png" border="0" usemap="#mma__sm50_8h" alt=""/></div>
+<map name="mma__sm50_8h" id="mma__sm50_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm50_8h__dep__incl.png" border="0" usemap="#mma__sm50_8hdep" alt=""/></div>
+<map name="mma__sm50_8hdep" id="mma__sm50_8hdep">
+</map>
+</div>
+</div>
+<p><a href="arch_2mma__sm50_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_004bb3fd76ca2af7b3210676fa9644d95b.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, float, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_004bb3fd76ca2af7b3210676fa9644d95b.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_0aa57e6a2e6b5da37d10688bf99419a23.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, double, LayoutB, double, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_0aa57e6a2e6b5da37d10688bf99419a23.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01int_00_00b2dff9ce8caad9aff5bc6a355539161.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, int, LayoutA, int, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01int_00_00b2dff9ce8caad9aff5bc6a355539161.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_76f9d24016e1b4167b16f4d7628c9546.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_76f9d24016e1b4167b16f4d7628c9546.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_f1c9d2ee842455cd0c5b71d56108d468.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, float, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_f1c9d2ee842455cd0c5b71d56108d468.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_00e3e12e263df6506b8cf06c3f4d478b8e.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_00e3e12e263df6506b8cf06c3f4d478b8e.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_30fa42e1ad201df010637cd22fc070a1.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_30fa42e1ad201df010637cd22fc070a1.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_48b3a43bc03fff93a111ac01abe7e40d.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, double, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_48b3a43bc03fff93a111ac01abe7e40d.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_070b94670e040ed5855e5b42d5ca8a443.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_070b94670e040ed5855e5b42d5ca8a443.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01half__t_4f30ee91f7bb3844ff7579c68d078818.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01half__t_4f30ee91f7bb3844ff7579c68d078818.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_2mma__sm50_8hdepincl.md5
+++ b/docs/arch_2mma__sm50_8hdepincl.md5
@ -0,0 +1 @@
+988e6466c703c4e63c9a889b8c3c54b5
--- a/docs/arch_2mma__sm50_8h__incl.md5
+++ b/docs/arch_2mma__sm50_8h__incl.md5
@ -0,0 +1 @@
+03f1613fdffbd6e7575de0d2967d08bf
--- a/docs/arch_2mma__sm50_8h_source.html
+++ b/docs/arch_2mma__sm50_8h_source.html
--- a/docs/arch_2mma__sm60_8h.html
+++ b/docs/arch_2mma__sm60_8h.html
@ -0,0 +1,157 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: mma_sm60.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">arch/mma_sm60.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Matrix multiply.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &lt;cuda_fp16.h&gt;</code><br />
+<code>#include &quot;<a class="el" href="arch_2mma_8h_source.html">cutlass/arch/mma.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="layout_2matrix_8h_source.html">cutlass/layout/matrix.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for arch/mma_sm60.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm60_8h__incl.png" border="0" usemap="#mma__sm60_8h" alt=""/></div>
+<map name="mma__sm60_8h" id="mma__sm60_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm60_8h__dep__incl.png" border="0" usemap="#mma__sm60_8hdep" alt=""/></div>
+<map name="mma__sm60_8hdep" id="mma__sm60_8hdep">
+</map>
+</div>
+</div>
+<p><a href="arch_2mma__sm60_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_011_00_011_01_4_00_011_00_01half__t_8cf78649807b93684f3d431bfa34ee28.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 2, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, half_t, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_011_00_011_01_4_00_011_00_01half__t_8cf78649807b93684f3d431bfa34ee28.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_012_00_011_01_4_00_011_00_01half__t_f3dc2e59f857ada163d1e0781ea8f391.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 2, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_012_00_011_01_4_00_011_00_01half__t_f3dc2e59f857ada163d1e0781ea8f391.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_ccde11d1bbbdab3702772ce44eb9729a.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 2, 2, 1 &gt;, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::ColumnMajor, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_ccde11d1bbbdab3702772ce44eb9729a.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_c07cc6439298fa5486a719e577be2538.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 2, 2, 1 &gt;, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_c07cc6439298fa5486a719e577be2538.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_2mma__sm60_8hdepincl.md5
+++ b/docs/arch_2mma__sm60_8hdepincl.md5
@ -0,0 +1 @@
+ba69b14e3936946092854211499ae9fa
--- a/docs/arch_2mma__sm60_8h__incl.md5
+++ b/docs/arch_2mma__sm60_8h__incl.md5
@ -0,0 +1 @@
+e820099c55f2397639bb210d76ec4c05
--- a/docs/arch_2mma__sm60_8h_source.html
+++ b/docs/arch_2mma__sm60_8h_source.html
--- a/docs/arch_2mma__sm61_8h.html
+++ b/docs/arch_2mma__sm61_8h.html
@ -0,0 +1,149 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: mma_sm61.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">arch/mma_sm61.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Matrix multiply.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="layout_2matrix_8h_source.html">cutlass/layout/matrix.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for arch/mma_sm61.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm61_8h__incl.png" border="0" usemap="#mma__sm61_8h" alt=""/></div>
+<map name="mma__sm61_8h" id="mma__sm61_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm61_8h__dep__incl.png" border="0" usemap="#mma__sm61_8hdep" alt=""/></div>
+<map name="mma__sm61_8hdep" id="mma__sm61_8hdep">
+</map>
+</div>
+</div>
+<p><a href="arch_2mma__sm61_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_014_01_4_00_011_00_01int8__t_a1ef6624fc8c10126f17f4ee88283d72.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 4 &gt;, 1, int8_t, LayoutA, int8_t, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_014_01_4_00_011_00_01int8__t_a1ef6624fc8c10126f17f4ee88283d72.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_012_01_4_00_011_00_01int16__t8c4bac365710598317a69c489f7239db.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 2 &gt;, 1, int16_t, layout::RowMajor, int16_t, layout::ColumnMajor, int, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_012_01_4_00_011_00_01int16__t8c4bac365710598317a69c489f7239db.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_2mma__sm61_8hdepincl.md5
+++ b/docs/arch_2mma__sm61_8hdepincl.md5
@ -0,0 +1 @@
+1faaf1631d5f0e44d6cc6c7121e6972e
--- a/docs/arch_2mma__sm61_8h__incl.md5
+++ b/docs/arch_2mma__sm61_8h__incl.md5
@ -0,0 +1 @@
+8cce8aef2d98c4082d68734b538253c7
--- a/docs/arch_2mma__sm61_8h_source.html
+++ b/docs/arch_2mma__sm61_8h_source.html
--- a/docs/arch_8h.html
+++ b/docs/arch_8h.html
@ -0,0 +1,147 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: arch.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">arch.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Defines tags for architecture-specific configurations.  
+<a href="#details">More...</a></p>
+<div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_8h__dep__incl.png" border="0" usemap="#arch_8hdep" alt=""/></div>
+<map name="arch_8hdep" id="arch_8hdep">
+</map>
+</div>
+</div>
+<p><a href="arch_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Sm50.html">cutlass::arch::Sm50</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Sm60.html">cutlass::arch::Sm60</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Sm61.html">cutlass::arch::Sm61</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Sm70.html">cutlass::arch::Sm70</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Sm72.html">cutlass::arch::Sm72</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Sm75.html">cutlass::arch::Sm75</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_8hdepincl.md5
+++ b/docs/arch_8hdepincl.md5
@ -0,0 +1 @@
+9ea32ea41ab87776449ab855965480b3
--- a/docs/arch_8h_source.html
+++ b/docs/arch_8h_source.html
--- a/docs/array_8h.html
+++ b/docs/array_8h.html
@ -0,0 +1,167 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: array.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a> &#124;
+<a href="#func-members">Functions</a>  </div>
+  <div class="headertitle">
+<div class="title">array.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="cutlass_8h_source.html">cutlass/cutlass.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="numeric__types_8h_source.html">cutlass/numeric_types.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="array__subbyte_8h_source.html">cutlass/array_subbyte.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for array.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="array_8h__incl.png" border="0" usemap="#array_8h" alt=""/></div>
+<map name="array_8h" id="array_8h">
+</map>
+</div>
+</div>
+<p><a href="array_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1sizeof__bits_3_01Array_3_01T_00_01N_00_01RegisterSized_01_4_01_4.html">cutlass::sizeof_bits&lt; Array&lt; T, N, RegisterSized &gt; &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Statically sized array for any data type.  <a href="structcutlass_1_1sizeof__bits_3_01Array_3_01T_00_01N_00_01RegisterSized_01_4_01_4.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4.html">cutlass::Array&lt; T, N, true &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Statically sized array for any data type.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1iterator.html">cutlass::Array&lt; T, N, true &gt;::iterator</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Bidirectional iterator over elements.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1iterator.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, true &gt;::const_iterator</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Bidirectional constant iterator over elements.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1const__iterator.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1reverse__iterator.html">cutlass::Array&lt; T, N, true &gt;::reverse_iterator</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Bidirectional iterator over elements.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1reverse__iterator.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1const__reverse__iterator.html">cutlass::Array&lt; T, N, true &gt;::const_reverse_iterator</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Bidirectional constant iterator over elements.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1const__reverse__iterator.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1AlignedArray.html">cutlass::AlignedArray&lt; T, N, Alignment &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Aligned array type.  <a href="classcutlass_1_1AlignedArray.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="func-members"></a>
+Functions</h2></td></tr>
+<tr class="memitem:a935aabfdc47cf03f87c67bb22533f97f"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> <a class="el" href="platform_8h.html#a72f0657181cca64b44eb186b707eb380">constexpr</a> bool&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html#a935aabfdc47cf03f87c67bb22533f97f">cutlass::ispow2</a> (unsigned x)</td></tr>
+<tr class="memdesc:a935aabfdc47cf03f87c67bb22533f97f"><td class="mdescLeft">&#160;</td><td class="mdescRight">Returns true if the argument is a power of 2.  <a href="namespacecutlass.html#a935aabfdc47cf03f87c67bb22533f97f">More...</a><br /></td></tr>
+<tr class="separator:a935aabfdc47cf03f87c67bb22533f97f"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:ac16d8caf23537912eb02123c4bdacd14"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> <a class="el" href="platform_8h.html#a72f0657181cca64b44eb186b707eb380">constexpr</a> unsigned&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html#ac16d8caf23537912eb02123c4bdacd14">cutlass::floor_pow_2</a> (unsigned x)</td></tr>
+<tr class="memdesc:ac16d8caf23537912eb02123c4bdacd14"><td class="mdescLeft">&#160;</td><td class="mdescRight">Returns the largest power of two not greater than the argument.  <a href="namespacecutlass.html#ac16d8caf23537912eb02123c4bdacd14">More...</a><br /></td></tr>
+<tr class="separator:ac16d8caf23537912eb02123c4bdacd14"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/array_8h__incl.md5
+++ b/docs/array_8h__incl.md5
@ -0,0 +1 @@
+90c159bd7ad938ad2d6e263ea8402fe7
--- a/docs/array_8h_source.html
+++ b/docs/array_8h_source.html
--- a/docs/array__subbyte_8h.html
+++ b/docs/array__subbyte_8h.html
@ -0,0 +1,164 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: array_subbyte.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">array_subbyte.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="cutlass_8h_source.html">cutlass/cutlass.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="array_8h_source.html">cutlass/array.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="platform_8h_source.html">cutlass/platform/platform.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for array_subbyte.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="array__subbyte_8h__incl.png" border="0" usemap="#array__subbyte_8h" alt=""/></div>
+<map name="array__subbyte_8h" id="array__subbyte_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="array__subbyte_8h__dep__incl.png" border="0" usemap="#array__subbyte_8hdep" alt=""/></div>
+<map name="array__subbyte_8hdep" id="array__subbyte_8hdep">
+</map>
+</div>
+</div>
+<p><a href="array__subbyte_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Statically sized array for any data type.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1reference.html">cutlass::Array&lt; T, N, false &gt;::reference</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Reference object inserts or extracts sub-byte items.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1reference.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html">cutlass::Array&lt; T, N, false &gt;::const_reference</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Reference object extracts sub-byte items.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1iterator.html">cutlass::Array&lt; T, N, false &gt;::iterator</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Bidirectional iterator over elements.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1iterator.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Bidirectional constant iterator over elements.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1reverse__iterator.html">cutlass::Array&lt; T, N, false &gt;::reverse_iterator</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Bidirectional iterator over elements.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1reverse__iterator.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">class &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_reverse_iterator</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Bidirectional constant iterator over elements.  <a href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/array__subbyte_8hdepincl.md5
+++ b/docs/array__subbyte_8hdepincl.md5
@ -0,0 +1 @@
+7c0288c037b6ea169ec7a3aa1015a4d4
--- a/docs/array__subbyte_8h__incl.md5
+++ b/docs/array__subbyte_8h__incl.md5
@ -0,0 +1 @@
+36310516438810c2a8ba31a7816cd1de
--- a/docs/array__subbyte_8h_source.html
+++ b/docs/array__subbyte_8h_source.html
--- a/docs/batched__reduction_8h.html
+++ b/docs/batched__reduction_8h.html
@ -0,0 +1,155 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: batched_reduction.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_ac488927e63b76ba9cb3ad9c317bbde9.html">reduction</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a> &#124;
+<a href="#func-members">Functions</a>  </div>
+  <div class="headertitle">
+<div class="title">batched_reduction.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Implements a software-pipelined efficient batched reduction. D = alpha * Reduction(A) + beta * C.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &lt;cuda.h&gt;</code><br />
+<code>#include &quot;<a class="el" href="coord_8h_source.html">cutlass/coord.h</a>&quot;</code><br />
+<code>#include &quot;cutlass/util/platform.h&quot;</code><br />
+<code>#include &quot;cutlass/fragment.h&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for batched_reduction.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="batched__reduction_8h__incl.png" border="0" usemap="#batched__reduction_8h" alt=""/></div>
+<map name="batched__reduction_8h" id="batched__reduction_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="batched__reduction_8h__dep__incl.png" border="0" usemap="#batched__reduction_8hdep" alt=""/></div>
+<map name="batched__reduction_8hdep" id="batched__reduction_8hdep">
+</map>
+</div>
+</div>
+<p><a href="batched__reduction_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1reduction_1_1BatchedReduction.html">cutlass::reduction::BatchedReduction&lt; BatchedReductionTraits_ &gt;</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1reduction"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1reduction.html">cutlass::reduction</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="func-members"></a>
+Functions</h2></td></tr>
+<tr class="memitem:a9665e8f438a7b290d6e2eb640d93045f"><td class="memTemplParams" colspan="2">template&lt;typename batched_reduction_ &gt; </td></tr>
+<tr class="memitem:a9665e8f438a7b290d6e2eb640d93045f"><td class="memTemplItemLeft" align="right" valign="top">__global__&#160;</td><td class="memTemplItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1reduction.html#a9665e8f438a7b290d6e2eb640d93045f">cutlass::reduction::__launch_bounds__</a> (batched_reduction_::Traits::kThreads, 1) void batched_reduction_kernel(typename batched_reduction_</td></tr>
+<tr class="separator:a9665e8f438a7b290d6e2eb640d93045f"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/batched__reduction_8hdepincl.md5
+++ b/docs/batched__reduction_8hdepincl.md5
@ -0,0 +1 @@
+2bce650f452329d669d303788cc619c8
--- a/docs/batched__reduction_8h__incl.md5
+++ b/docs/batched__reduction_8h__incl.md5
@ -0,0 +1 @@
+d38876c9b9d3ade81fb457e3ebf5c6fd
--- a/docs/batched__reduction_8h_source.html
+++ b/docs/batched__reduction_8h_source.html
--- a/docs/batchedreductiontraits_8h.html
+++ b/docs/batchedreductiontraits_8h.html
@ -0,0 +1,144 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: batched_reduction_traits.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_ac488927e63b76ba9cb3ad9c317bbde9.html">reduction</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">batched_reduction_traits.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Defines structural properties of complete batched reduction. D = alpha * Reduction(A) + beta * C.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="cutlass_8h_source.html">cutlass/cutlass.h</a>&quot;</code><br />
+<code>#include &quot;cutlass/shape.h&quot;</code><br />
+<code>#include &quot;<a class="el" href="reduction_2threadblock__swizzle_8h_source.html">cutlass/reduction/threadblock_swizzle.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="batched__reduction_8h_source.html">cutlass/reduction/batched_reduction.h</a>&quot;</code><br />
+<code>#include &quot;cutlass/gemm/linear_scaling.h&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for batched_reduction_traits.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="batched__reduction__traits_8h__incl.png" border="0" usemap="#batched__reduction__traits_8h" alt=""/></div>
+<map name="batched__reduction__traits_8h" id="batched__reduction__traits_8h">
+</map>
+</div>
+</div>
+<p><a href="batched__reduction__traits_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1reduction_1_1BatchedReductionTraits.html">cutlass::reduction::BatchedReductionTraits&lt; ScalarA_, ScalarC_, ScalarD_, ScalarAlphaBeta_, ScalarAccum_, ReductionSize_, OutputTile_, SubTile_, ThreadShape_, Index_, BlockSwizzle_, maxInReg_, maxOutReg_, Functor_ &gt;</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1reduction_1_1BatchedReductionTraits_1_1Params.html">cutlass::reduction::BatchedReductionTraits&lt; ScalarA_, ScalarC_, ScalarD_, ScalarAlphaBeta_, ScalarAccum_, ReductionSize_, OutputTile_, SubTile_, ThreadShape_, Index_, BlockSwizzle_, maxInReg_, maxOutReg_, Functor_ &gt;::Params</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1reduction"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1reduction.html">cutlass::reduction</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/batchedreductiontraits_8h__incl.md5
+++ b/docs/batchedreductiontraits_8h__incl.md5
@ -0,0 +1 @@
+957af6c3e40d98d122a3ef83474f7252
--- a/docs/batchedreductiontraits_8h_source.html
+++ b/docs/batchedreductiontraits_8h_source.html
--- a/docs/bc_s.png
+++ b/docs/bc_s.png
--- a/docs/bdwn.png
+++ b/docs/bdwn.png
--- a/docs/classcutlass_1_1AlignedArray.html
+++ b/docs/classcutlass_1_1AlignedArray.html
@ -0,0 +1,133 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: cutlass::AlignedArray&lt; T, N, Alignment &gt; Class Template Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
+      <li><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="annotated.html"><span>Class&#160;List</span></a></li>
+      <li><a href="classes.html"><span>Class&#160;Index</span></a></li>
+      <li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
+      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="namespacecutlass.html">cutlass</a></li><li class="navelem"><a class="el" href="classcutlass_1_1AlignedArray.html">AlignedArray</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="headertitle">
+<div class="title">cutlass::AlignedArray&lt; T, N, Alignment &gt; Class Template Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Aligned array type.  
+</p>
+
+<p><code>#include &lt;<a class="el" href="array_8h_source.html">array.h</a>&gt;</code></p>
+<div class="dynheader">
+Inheritance diagram for cutlass::AlignedArray&lt; T, N, Alignment &gt;:</div>
+<div class="dyncontent">
+<div class="center"><img src="classcutlass_1_1AlignedArray__inherit__graph.png" border="0" usemap="#cutlass_1_1AlignedArray_3_01T_00_01N_00_01Alignment_01_4_inherit__map" alt="Inheritance graph"/></div>
+<map name="cutlass_1_1AlignedArray_3_01T_00_01N_00_01Alignment_01_4_inherit__map" id="cutlass_1_1AlignedArray_3_01T_00_01N_00_01Alignment_01_4_inherit__map">
+</map>
+<center><span class="legend">[<a href="graph_legend.html">legend</a>]</span></center></div>
+<div class="dynheader">
+Collaboration diagram for cutlass::AlignedArray&lt; T, N, Alignment &gt;:</div>
+<div class="dyncontent">
+<div class="center"><img src="classcutlass_1_1AlignedArray__coll__graph.png" border="0" usemap="#cutlass_1_1AlignedArray_3_01T_00_01N_00_01Alignment_01_4_coll__map" alt="Collaboration graph"/></div>
+<map name="cutlass_1_1AlignedArray_3_01T_00_01N_00_01Alignment_01_4_coll__map" id="cutlass_1_1AlignedArray_3_01T_00_01N_00_01Alignment_01_4_coll__map">
+</map>
+<center><span class="legend">[<a href="graph_legend.html">legend</a>]</span></center></div>
+<hr/>The documentation for this class was generated from the following file:<ul>
+<li><a class="el" href="array_8h_source.html">array.h</a></li>
+</ul>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/classcutlass_1_1AlignedArraycollgraph.md5
+++ b/docs/classcutlass_1_1AlignedArraycollgraph.md5
@ -0,0 +1 @@
+5bfb78a70e6c0c4f1dba98d2cf455a30
--- a/docs/classcutlass_1_1AlignedArrayinheritgraph.md5
+++ b/docs/classcutlass_1_1AlignedArrayinheritgraph.md5
@ -0,0 +1 @@
+5bfb78a70e6c0c4f1dba98d2cf455a30
--- a/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4-members.html
+++ b/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4-members.html
@ -0,0 +1,153 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: Member List</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
+      <li><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="annotated.html"><span>Class&#160;List</span></a></li>
+      <li><a href="classes.html"><span>Class&#160;Index</span></a></li>
+      <li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
+      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="namespacecutlass.html">cutlass</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">Array&lt; T, N, false &gt;</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="headertitle">
+<div class="title">cutlass::Array&lt; T, N, false &gt; Member List</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>This is the complete list of members for <a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a>, including all inherited members.</p>
+<table class="directory">
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#ac37d0c85dd6246ff7e08d12903f49c4d">Array</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a5d4667c3c9ebf3322ba94d43421e2577">Array</a>(Array const &amp;x)</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a6268f2bbbdfc671cf7066ea0ee1bb46f">at</a>(size_type pos)</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a0443a4af7c9594492bfb8a84bbd12a52">at</a>(size_type pos) const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a693677ee48012a4d013d55741d38764e">back</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a2c1665d0eff4c1788b0a5a3bfa3bc63e">back</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a6e9dbf4a486f07dc72dd5140a7628971">begin</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a86a56cc907c8566068034ef8294cf7c2">cbegin</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#ae6106b72ee9035389afb313801561b16">cend</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a5b84c4dc5257f31108a0598915f03f94">clear</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a8a90423fc5483b3ee1d31f377321e9e0">const_pointer</a> typedef</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a01b9f76c6052dc2467095b91c1ebe34e">crbegin</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#abbc436f18649c1578ef95eb501872094">crend</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a1949c8a8c81dc2743328a56ff19fc933">data</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#ab617ed6c9cc6336baf1030713d6dfbbb">data</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#af8dd11bf19216707ab3340b66833c9c9">difference_type</a> typedef</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a700940b7ec4aa2c10506b8109b58b709">Element</a> typedef</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a40829269d53d097b5b7bfce32e4afcc4">empty</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a80258d6b5e43ae529cd726f0d4292619">end</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a1c31d3673a48b2ed275bd56714fbcfbe">fill</a>(T const &amp;value)</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#aa89dd0781c0a81421589182a5402df8b">front</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#ab7ebd33505e48ab3beb6b551e8b762e5">front</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a56c28da772c3cf49799eeef4ee1eb981">kElements</a></td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">static</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a4a6f489743eb03c5c97fe6bb3ed2fa22">kElementsPerStoredItem</a></td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">static</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a6981c3aa259d3a1cc4818e29fa1d1423">kMask</a></td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">static</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a45932cad6b905c9ab72889c53112d529">kSizeBits</a></td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">static</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#afbe4f574d87e61bf18ac5b9f5a6ea8aa">kStorageElements</a></td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">static</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a8f982c95366ce4fda90e35281adfe63c">max_size</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#aeaeeb7bddb6824adc6feb5ab912d65dc">operator[]</a>(size_type pos)</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a35db1c6ac0d42a486eb3a0a0eee95c80">operator[]</a>(size_type pos) const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a2a77712281a0ddbf880a4f6fb9aa2ea3">pointer</a> typedef</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a66e2465301e46afebf9e56c4060fb3cb">raw_data</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a16e55f7c4ae1700ae09c2bce137d06ae">raw_data</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a2098c88aed61f9b27bac37a083130336">rbegin</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a39c08a75c7cc22fcd296e6c9fefe754e">rend</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#ae1b48e77c8381a8059a09a791d6b8d37">size</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a714f3275de8a7f9d14f8b04aed45988d">size_type</a> typedef</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a878e152905d602bcdb98e0e6acd8bd82">Storage</a> typedef</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#ac1a07d3bbf76e850a948c8efe864acdb">value_type</a> typedef</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">cutlass::Array&lt; T, N, false &gt;</a></td><td class="entry"></td></tr>
+</table></div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html
+++ b/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html
--- a/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator-members.html
+++ b/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator-members.html
@ -0,0 +1,123 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: Member List</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
+      <li><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="annotated.html"><span>Class&#160;List</span></a></li>
+      <li><a href="classes.html"><span>Class&#160;Index</span></a></li>
+      <li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
+      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="namespacecutlass.html">cutlass</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">Array&lt; T, N, false &gt;</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">const_iterator</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="headertitle">
+<div class="title">cutlass::Array&lt; T, N, false &gt;::const_iterator Member List</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>This is the complete list of members for <a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a>, including all inherited members.</p>
+<table class="directory">
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a2baacc6de7180213621a2d6b2328ca7d">const_iterator</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a273a0ea9cf66fac0787e90339fd49371">const_iterator</a>(Storage const *ptr, int idx=0)</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#ad8a6c87e370a53e7ff783ee4ad3d1198">operator!=</a>(iterator const &amp;other) const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a36aa6aa70a9536a7d2750d83d53f39f3">operator*</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#adcdcdf49b5d8e3ed801e2555c4f02b99">operator++</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a4094d6ae6bb6ade0f850ce96870bbc37">operator++</a>(int)</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#aa2c9f9bb9601208bd784bdc821b62f3a">operator--</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a3eebbf306ba37383e98360c0aa882e34">operator--</a>(int)</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a42dd93a0f0df4ec86de4880fa9cc5843">operator==</a>(iterator const &amp;other) const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+</table></div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html
+++ b/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html
@ -0,0 +1,386 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: cutlass::Array&lt; T, N, false &gt;::const_iterator Class Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
+      <li><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="annotated.html"><span>Class&#160;List</span></a></li>
+      <li><a href="classes.html"><span>Class&#160;Index</span></a></li>
+      <li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
+      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="namespacecutlass.html">cutlass</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">Array&lt; T, N, false &gt;</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html">const_iterator</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#pub-methods">Public Member Functions</a> &#124;
+<a href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator-members.html">List of all members</a>  </div>
+  <div class="headertitle">
+<div class="title">cutlass::Array&lt; T, N, false &gt;::const_iterator Class Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Bidirectional constant iterator over elements.  
+</p>
+
+<p><code>#include &lt;<a class="el" href="array__subbyte_8h_source.html">array_subbyte.h</a>&gt;</code></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="pub-methods"></a>
+Public Member Functions</h2></td></tr>
+<tr class="memitem:a2baacc6de7180213621a2d6b2328ca7d"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a2baacc6de7180213621a2d6b2328ca7d">const_iterator</a> ()</td></tr>
+<tr class="separator:a2baacc6de7180213621a2d6b2328ca7d"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:a273a0ea9cf66fac0787e90339fd49371"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a273a0ea9cf66fac0787e90339fd49371">const_iterator</a> (<a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a878e152905d602bcdb98e0e6acd8bd82">Storage</a> const *ptr, int idx=0)</td></tr>
+<tr class="separator:a273a0ea9cf66fac0787e90339fd49371"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:adcdcdf49b5d8e3ed801e2555c4f02b99"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> iterator &amp;&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#adcdcdf49b5d8e3ed801e2555c4f02b99">operator++</a> ()</td></tr>
+<tr class="separator:adcdcdf49b5d8e3ed801e2555c4f02b99"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:aa2c9f9bb9601208bd784bdc821b62f3a"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> iterator &amp;&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#aa2c9f9bb9601208bd784bdc821b62f3a">operator--</a> ()</td></tr>
+<tr class="separator:aa2c9f9bb9601208bd784bdc821b62f3a"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:a4094d6ae6bb6ade0f850ce96870bbc37"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> iterator&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a4094d6ae6bb6ade0f850ce96870bbc37">operator++</a> (int)</td></tr>
+<tr class="separator:a4094d6ae6bb6ade0f850ce96870bbc37"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:a3eebbf306ba37383e98360c0aa882e34"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> iterator&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a3eebbf306ba37383e98360c0aa882e34">operator--</a> (int)</td></tr>
+<tr class="separator:a3eebbf306ba37383e98360c0aa882e34"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:a36aa6aa70a9536a7d2750d83d53f39f3"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> const_reference&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a36aa6aa70a9536a7d2750d83d53f39f3">operator*</a> () const </td></tr>
+<tr class="separator:a36aa6aa70a9536a7d2750d83d53f39f3"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:a42dd93a0f0df4ec86de4880fa9cc5843"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> bool&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#a42dd93a0f0df4ec86de4880fa9cc5843">operator==</a> (iterator const &amp;other) const </td></tr>
+<tr class="separator:a42dd93a0f0df4ec86de4880fa9cc5843"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:ad8a6c87e370a53e7ff783ee4ad3d1198"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> bool&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html#ad8a6c87e370a53e7ff783ee4ad3d1198">operator!=</a> (iterator const &amp;other) const </td></tr>
+<tr class="separator:ad8a6c87e370a53e7ff783ee4ad3d1198"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+<h2 class="groupheader">Constructor &amp; Destructor Documentation</h2>
+<a class="anchor" id="a2baacc6de7180213621a2d6b2328ca7d"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> cutlass::Array&lt; T, N, false &gt;::const_iterator::const_iterator </td>
+          <td>(</td>
+          <td class="paramname"></td><td>)</td>
+          <td></td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="a273a0ea9cf66fac0787e90339fd49371"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> cutlass::Array&lt; T, N, false &gt;::const_iterator::const_iterator </td>
+          <td>(</td>
+          <td class="paramtype"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a878e152905d602bcdb98e0e6acd8bd82">Storage</a> const *&#160;</td>
+          <td class="paramname"><em>ptr</em>, </td>
+        </tr>
+        <tr>
+          <td class="paramkey"></td>
+          <td></td>
+          <td class="paramtype">int&#160;</td>
+          <td class="paramname"><em>idx</em> = <code>0</code>&#160;</td>
+        </tr>
+        <tr>
+          <td></td>
+          <td>)</td>
+          <td></td><td></td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<h2 class="groupheader">Member Function Documentation</h2>
+<a class="anchor" id="ad8a6c87e370a53e7ff783ee4ad3d1198"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> bool cutlass::Array&lt; T, N, false &gt;::const_iterator::operator!= </td>
+          <td>(</td>
+          <td class="paramtype">iterator const &amp;&#160;</td>
+          <td class="paramname"><em>other</em></td><td>)</td>
+          <td> const</td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="a36aa6aa70a9536a7d2750d83d53f39f3"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> const_reference cutlass::Array&lt; T, N, false &gt;::const_iterator::operator* </td>
+          <td>(</td>
+          <td class="paramname"></td><td>)</td>
+          <td> const</td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="adcdcdf49b5d8e3ed801e2555c4f02b99"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> iterator&amp; cutlass::Array&lt; T, N, false &gt;::const_iterator::operator++ </td>
+          <td>(</td>
+          <td class="paramname"></td><td>)</td>
+          <td></td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="a4094d6ae6bb6ade0f850ce96870bbc37"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> iterator cutlass::Array&lt; T, N, false &gt;::const_iterator::operator++ </td>
+          <td>(</td>
+          <td class="paramtype">int&#160;</td>
+          <td class="paramname"></td><td>)</td>
+          <td></td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="aa2c9f9bb9601208bd784bdc821b62f3a"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> iterator&amp; cutlass::Array&lt; T, N, false &gt;::const_iterator::operator-- </td>
+          <td>(</td>
+          <td class="paramname"></td><td>)</td>
+          <td></td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="a3eebbf306ba37383e98360c0aa882e34"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> iterator cutlass::Array&lt; T, N, false &gt;::const_iterator::operator-- </td>
+          <td>(</td>
+          <td class="paramtype">int&#160;</td>
+          <td class="paramname"></td><td>)</td>
+          <td></td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="a42dd93a0f0df4ec86de4880fa9cc5843"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> bool cutlass::Array&lt; T, N, false &gt;::const_iterator::operator== </td>
+          <td>(</td>
+          <td class="paramtype">iterator const &amp;&#160;</td>
+          <td class="paramname"><em>other</em></td><td>)</td>
+          <td> const</td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<hr/>The documentation for this class was generated from the following file:<ul>
+<li><a class="el" href="array__subbyte_8h_source.html">array_subbyte.h</a></li>
+</ul>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference-members.html
+++ b/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference-members.html
@ -0,0 +1,120 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: Member List</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
+      <li><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="annotated.html"><span>Class&#160;List</span></a></li>
+      <li><a href="classes.html"><span>Class&#160;Index</span></a></li>
+      <li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
+      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="namespacecutlass.html">cutlass</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">Array&lt; T, N, false &gt;</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html">const_reference</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="headertitle">
+<div class="title">cutlass::Array&lt; T, N, false &gt;::const_reference Member List</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>This is the complete list of members for <a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html">cutlass::Array&lt; T, N, false &gt;::const_reference</a>, including all inherited members.</p>
+<table class="directory">
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#abf1841f0ac863891efcf23bd5ac57847">const_reference</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html">cutlass::Array&lt; T, N, false &gt;::const_reference</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#ac9e3b9e2f5797efbc47e3415aa204079">const_reference</a>(Storage const *ptr, int idx=0)</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html">cutlass::Array&lt; T, N, false &gt;::const_reference</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#a37a90c6f1edcc3d7a916211aa7520cc1">get</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html">cutlass::Array&lt; T, N, false &gt;::const_reference</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#afa022bf34a7086c43b5bd45b40c2b25f">operator float</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html">cutlass::Array&lt; T, N, false &gt;::const_reference</a></td><td class="entry"><span class="mlabel">inline</span><span class="mlabel">explicit</span></td></tr>
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#a61648afeb4e15881fb001611c37df1ec">operator int</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html">cutlass::Array&lt; T, N, false &gt;::const_reference</a></td><td class="entry"><span class="mlabel">inline</span><span class="mlabel">explicit</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#a7c5f7d59a22d89a7dd5c923d9bcebd97">operator T</a>() const </td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html">cutlass::Array&lt; T, N, false &gt;::const_reference</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+</table></div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html
+++ b/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html
@ -0,0 +1,306 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: cutlass::Array&lt; T, N, false &gt;::const_reference Class Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
+      <li><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="annotated.html"><span>Class&#160;List</span></a></li>
+      <li><a href="classes.html"><span>Class&#160;Index</span></a></li>
+      <li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
+      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="namespacecutlass.html">cutlass</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">Array&lt; T, N, false &gt;</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html">const_reference</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#pub-methods">Public Member Functions</a> &#124;
+<a href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference-members.html">List of all members</a>  </div>
+  <div class="headertitle">
+<div class="title">cutlass::Array&lt; T, N, false &gt;::const_reference Class Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Reference object extracts sub-byte items.  
+</p>
+
+<p><code>#include &lt;<a class="el" href="array__subbyte_8h_source.html">array_subbyte.h</a>&gt;</code></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="pub-methods"></a>
+Public Member Functions</h2></td></tr>
+<tr class="memitem:abf1841f0ac863891efcf23bd5ac57847"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#abf1841f0ac863891efcf23bd5ac57847">const_reference</a> ()</td></tr>
+<tr class="memdesc:abf1841f0ac863891efcf23bd5ac57847"><td class="mdescLeft">&#160;</td><td class="mdescRight">Default ctor.  <a href="#abf1841f0ac863891efcf23bd5ac57847">More...</a><br /></td></tr>
+<tr class="separator:abf1841f0ac863891efcf23bd5ac57847"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:ac9e3b9e2f5797efbc47e3415aa204079"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#ac9e3b9e2f5797efbc47e3415aa204079">const_reference</a> (<a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a878e152905d602bcdb98e0e6acd8bd82">Storage</a> const *ptr, int idx=0)</td></tr>
+<tr class="memdesc:ac9e3b9e2f5797efbc47e3415aa204079"><td class="mdescLeft">&#160;</td><td class="mdescRight">Ctor.  <a href="#ac9e3b9e2f5797efbc47e3415aa204079">More...</a><br /></td></tr>
+<tr class="separator:ac9e3b9e2f5797efbc47e3415aa204079"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:a37a90c6f1edcc3d7a916211aa7520cc1"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> const T&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#a37a90c6f1edcc3d7a916211aa7520cc1">get</a> () const </td></tr>
+<tr class="separator:a37a90c6f1edcc3d7a916211aa7520cc1"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:a7c5f7d59a22d89a7dd5c923d9bcebd97"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#a7c5f7d59a22d89a7dd5c923d9bcebd97">operator T</a> () const </td></tr>
+<tr class="memdesc:a7c5f7d59a22d89a7dd5c923d9bcebd97"><td class="mdescLeft">&#160;</td><td class="mdescRight">Extract.  <a href="#a7c5f7d59a22d89a7dd5c923d9bcebd97">More...</a><br /></td></tr>
+<tr class="separator:a7c5f7d59a22d89a7dd5c923d9bcebd97"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:a61648afeb4e15881fb001611c37df1ec"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#a61648afeb4e15881fb001611c37df1ec">operator int</a> () const </td></tr>
+<tr class="memdesc:a61648afeb4e15881fb001611c37df1ec"><td class="mdescLeft">&#160;</td><td class="mdescRight">Explicit cast to int.  <a href="#a61648afeb4e15881fb001611c37df1ec">More...</a><br /></td></tr>
+<tr class="separator:a61648afeb4e15881fb001611c37df1ec"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:afa022bf34a7086c43b5bd45b40c2b25f"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html#afa022bf34a7086c43b5bd45b40c2b25f">operator float</a> () const </td></tr>
+<tr class="memdesc:afa022bf34a7086c43b5bd45b40c2b25f"><td class="mdescLeft">&#160;</td><td class="mdescRight">Explicit cast to float.  <a href="#afa022bf34a7086c43b5bd45b40c2b25f">More...</a><br /></td></tr>
+<tr class="separator:afa022bf34a7086c43b5bd45b40c2b25f"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+<h2 class="groupheader">Constructor &amp; Destructor Documentation</h2>
+<a class="anchor" id="abf1841f0ac863891efcf23bd5ac57847"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> cutlass::Array&lt; T, N, false &gt;::const_reference::const_reference </td>
+          <td>(</td>
+          <td class="paramname"></td><td>)</td>
+          <td></td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="ac9e3b9e2f5797efbc47e3415aa204079"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> cutlass::Array&lt; T, N, false &gt;::const_reference::const_reference </td>
+          <td>(</td>
+          <td class="paramtype"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a878e152905d602bcdb98e0e6acd8bd82">Storage</a> const *&#160;</td>
+          <td class="paramname"><em>ptr</em>, </td>
+        </tr>
+        <tr>
+          <td class="paramkey"></td>
+          <td></td>
+          <td class="paramtype">int&#160;</td>
+          <td class="paramname"><em>idx</em> = <code>0</code>&#160;</td>
+        </tr>
+        <tr>
+          <td></td>
+          <td>)</td>
+          <td></td><td></td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<h2 class="groupheader">Member Function Documentation</h2>
+<a class="anchor" id="a37a90c6f1edcc3d7a916211aa7520cc1"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> const T cutlass::Array&lt; T, N, false &gt;::const_reference::get </td>
+          <td>(</td>
+          <td class="paramname"></td><td>)</td>
+          <td> const</td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="afa022bf34a7086c43b5bd45b40c2b25f"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> cutlass::Array&lt; T, N, false &gt;::const_reference::operator float </td>
+          <td>(</td>
+          <td class="paramname"></td><td>)</td>
+          <td> const</td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span><span class="mlabel">explicit</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="a61648afeb4e15881fb001611c37df1ec"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> cutlass::Array&lt; T, N, false &gt;::const_reference::operator int </td>
+          <td>(</td>
+          <td class="paramname"></td><td>)</td>
+          <td> const</td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span><span class="mlabel">explicit</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="a7c5f7d59a22d89a7dd5c923d9bcebd97"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> cutlass::Array&lt; T, N, false &gt;::const_reference::operator T </td>
+          <td>(</td>
+          <td class="paramname"></td><td>)</td>
+          <td> const</td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<hr/>The documentation for this class was generated from the following file:<ul>
+<li><a class="el" href="array__subbyte_8h_source.html">array_subbyte.h</a></li>
+</ul>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1constreverseiterator-members.html
+++ b/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1constreverseiterator-members.html
@ -0,0 +1,116 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: Member List</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
+      <li><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="annotated.html"><span>Class&#160;List</span></a></li>
+      <li><a href="classes.html"><span>Class&#160;Index</span></a></li>
+      <li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
+      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="namespacecutlass.html">cutlass</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">Array&lt; T, N, false &gt;</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html">const_reverse_iterator</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="headertitle">
+<div class="title">cutlass::Array&lt; T, N, false &gt;::const_reverse_iterator Member List</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>This is the complete list of members for <a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_reverse_iterator</a>, including all inherited members.</p>
+<table class="directory">
+  <tr class="even"><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html#aae7705a26ea52ebd18d5f5809d816ee2">const_reverse_iterator</a>()</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_reverse_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+  <tr><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html#a4bef88847b70f6bca81dd46bd883373b">const_reverse_iterator</a>(Storage const *ptr, int idx=0)</td><td class="entry"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html">cutlass::Array&lt; T, N, false &gt;::const_reverse_iterator</a></td><td class="entry"><span class="mlabel">inline</span></td></tr>
+</table></div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1constreverseiterator.html
+++ b/docs/classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1constreverseiterator.html
@ -0,0 +1,192 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: cutlass::Array&lt; T, N, false &gt;::const_reverse_iterator Class Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
+      <li><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="annotated.html"><span>Class&#160;List</span></a></li>
+      <li><a href="classes.html"><span>Class&#160;Index</span></a></li>
+      <li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
+      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="namespacecutlass.html">cutlass</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html">Array&lt; T, N, false &gt;</a></li><li class="navelem"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html">const_reverse_iterator</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#pub-methods">Public Member Functions</a> &#124;
+<a href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator-members.html">List of all members</a>  </div>
+  <div class="headertitle">
+<div class="title">cutlass::Array&lt; T, N, false &gt;::const_reverse_iterator Class Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Bidirectional constant iterator over elements.  
+</p>
+
+<p><code>#include &lt;<a class="el" href="array__subbyte_8h_source.html">array_subbyte.h</a>&gt;</code></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="pub-methods"></a>
+Public Member Functions</h2></td></tr>
+<tr class="memitem:aae7705a26ea52ebd18d5f5809d816ee2"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html#aae7705a26ea52ebd18d5f5809d816ee2">const_reverse_iterator</a> ()</td></tr>
+<tr class="separator:aae7705a26ea52ebd18d5f5809d816ee2"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:a4bef88847b70f6bca81dd46bd883373b"><td class="memItemLeft" align="right" valign="top"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a>&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html#a4bef88847b70f6bca81dd46bd883373b">const_reverse_iterator</a> (<a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a878e152905d602bcdb98e0e6acd8bd82">Storage</a> const *ptr, int idx=0)</td></tr>
+<tr class="separator:a4bef88847b70f6bca81dd46bd883373b"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+<h2 class="groupheader">Constructor &amp; Destructor Documentation</h2>
+<a class="anchor" id="aae7705a26ea52ebd18d5f5809d816ee2"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> cutlass::Array&lt; T, N, false &gt;::const_reverse_iterator::const_reverse_iterator </td>
+          <td>(</td>
+          <td class="paramname"></td><td>)</td>
+          <td></td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<a class="anchor" id="a4bef88847b70f6bca81dd46bd883373b"></a>
+<div class="memitem">
+<div class="memproto">
+<div class="memtemplate">
+template&lt;typename T , int N&gt; </div>
+<table class="mlabels">
+  <tr>
+  <td class="mlabels-left">
+      <table class="memname">
+        <tr>
+          <td class="memname"><a class="el" href="cutlass_8h.html#a28c2443a142676d3d71effdae1a986b1">CUTLASS_HOST_DEVICE</a> cutlass::Array&lt; T, N, false &gt;::const_reverse_iterator::const_reverse_iterator </td>
+          <td>(</td>
+          <td class="paramtype"><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html#a878e152905d602bcdb98e0e6acd8bd82">Storage</a> const *&#160;</td>
+          <td class="paramname"><em>ptr</em>, </td>
+        </tr>
+        <tr>
+          <td class="paramkey"></td>
+          <td></td>
+          <td class="paramtype">int&#160;</td>
+          <td class="paramname"><em>idx</em> = <code>0</code>&#160;</td>
+        </tr>
+        <tr>
+          <td></td>
+          <td>)</td>
+          <td></td><td></td>
+        </tr>
+      </table>
+  </td>
+  <td class="mlabels-right">
+<span class="mlabels"><span class="mlabel">inline</span></span>  </td>
+  </tr>
+</table>
+</div><div class="memdoc">
+
+</div>
+</div>
+<hr/>The documentation for this class was generated from the following file:<ul>
+<li><a class="el" href="array__subbyte_8h_source.html">array_subbyte.h</a></li>
+</ul>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Andrew Kerr	1ab1027954	Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. (#100 ) - Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. - Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out - Added test_examples target to build and test all CUTLASS examples - Minor edits to documentation to point to GTC 2020 webinar	2020-06-15 10:47:01 -07:00
Andrew Kerr	86931fef85	CUTLASS 2.2 (#96 ) Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended.	2020-06-08 16:17:35 -07:00
Vijay Thakkar	e33d90b361	update tools/library/CMakeLists to require python 3.6 according to #70 (#82 ) #70 only updates the documentation. This commit reflects this bump in python version to the CMake configuration as well.	2020-04-08 10:54:36 -07:00
Andrew Kerr	96dab34ad9	CUTLASS 2.1 (#83 ) CUTLASS 2.1 contributes: - BLAS-style host-side API added to CUTLASS Library - Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores - Minor enhancements and bug fixes	2020-04-07 13:51:25 -07:00
Andrew Kerr	7c0cd26d13	Need Python 3.6 to use enum.auto() (#70 )	2019-11-22 09:39:12 -08:00
Andrew Kerr	45ecbc885b	Removed redundant conjugation operations from matrix_traits. (#65 )	2019-11-20 11:27:13 -08:00
Andrew Kerr	8aca98f9a7	Improved formatting, clarity, and content of several documents. (#64 ) * Improved formatting, clarity, and content of several documents.	2019-11-20 10:42:15 -08:00
Dustyn Blasig	f4d9c8f755	Clang GPU compilation requires explicit CUDACC version flags (#63 )	2019-11-20 09:52:11 -08:00
Andrew Kerr	fb335f6a5f	CUTLASS 2.0 (#62 ) CUTLASS 2.0 Substantially refactored for - Better performance, particularly for native Turing Tensor Cores - Robust and durable templates spanning the design space - Encapsulated functionality embodying modern C++11 programming techniques - Optimized containers and data types for efficient, generic, portable device code Updates to: - Quick start guide - Documentation - Utilities - CUTLASS Profiler Native Turing Tensor Cores - Efficient GEMM kernels targeting Turing Tensor Cores - Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands Coverage of existing CUTLASS functionality: - GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs - Volta Tensor Cores through native mma.sync and through WMMA API - Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions - Batched GEMM operations - Complex-valued GEMMs Note: this commit and all that follow require a host compiler supporting C++11 or greater.	2019-11-19 16:55:34 -08:00
Andrew Kerr	b5cab177a9	Performance enhancement for Volta Tensor Cores TN layout (#53 ) * Fixed performance defect with indirect access to pointer array for Volta TensorCores TN arrangement. * Updated patch version and changelog. * Updated patch version and changelog. * Added link to changelog in readme. * Fixed markdown link	2019-07-10 10:54:12 -07:00
Timmy	eb41735933	Merge pull request #47 from Artem-B/cutlass-1.3-clang Make CUTLASS compileable with Clang.	2019-05-13 10:52:45 -07:00
Artem Belevich	fb8b3a98b7	Addressed code review comments.	2019-05-10 10:24:52 -07:00
gthomascollignon	d9d357877f	Added missing file (#48 )	2019-05-09 14:07:52 -07:00
Artem Belevich	e18292db46	Make CUTLASS compileable with Clang. Requires a recent clang build (r359248 or newer). Enable compilation with clang with these options: cmake -DCUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=/path/to/clang++	2019-05-02 11:00:22 -07:00
Timmy	fe3438a3c1	cutlass 1.3.1 (#46 ) CUTLASS 1.3.1 patch resolves failing text with NVRTC.	2019-04-19 16:54:52 -07:00
Andrew Kerr	877bdcace6	Cutlass 1.3 Release (#42 ) CUTLASS 1.3 Release - Efficient GEMM kernel targeting Volta Tensor Cores via mma.sync instruction added in CUDA 10.1.	2019-03-20 10:49:17 -07:00
Andrew Kerr	19a9d64e3c	Removed patch version from README. Removed patch version from README.	2018-12-19 15:20:43 -08:00
Andrew Kerr	80e6f7c860	Merge pull request #38 from NVIDIA/resolve_maxwell Resolved issue for incorrect SGEMM on Maxwell architecture.	2018-12-19 15:17:41 -08:00
akerr	822b0952cd	Resolved issue for incorrect SGEMM on Maxwell architecture.	2018-12-19 15:07:16 -08:00
Andrew Kerr	ed2ed4d667	Merge pull request #33 from NVIDIA/cutlass_1.2 CUTLASS 1.2	2018-10-26 14:59:50 -07:00
Andrew Kerr	4db423c40f	Minor edit to CHANGELOG.	2018-10-26 14:58:31 -07:00
Andrew Kerr	b2bc0d3b79	Updating Doxygen docs	2018-10-26 14:54:58 -07:00
akerr	74df0331f2	CUTLASS 1.2	2018-10-26 14:38:46 -07:00
Andrew Kerr	2332df492e	Merge pull request #30 from NVIDIA/fix_utilities_example Fixed cutlass_utilities example.	2018-09-29 15:09:18 -07:00
akerr	cfe4b933ef	CUDA 9 lacks host-side conversions from float=>half. Instead, we must reinterpret_cast<> from cutlass::half_t => half.	2018-09-29 15:04:20 -07:00
Andrew Kerr	6877595a5e	Merge pull request #28 from NVIDIA/cutlass_1.1 Fixed typeo	2018-09-28 12:59:49 -07:00
Andrew Kerr	69e3709da4	Fixed typeo Fixed typeo	2018-09-28 12:59:20 -07:00
Andrew Kerr	d419094c28	Merge pull request #26 from NVIDIA/cutlass_1.1 Clarification to README	2018-09-21 11:44:47 -07:00
akerr	1a7ac522f8	Clarification to README	2018-09-20 11:04:03 -07:00
Andrew Kerr	bf6eec53eb	Merge pull request #25 from NVIDIA/cutlass_1.1 Updated CUTLASS.md	2018-09-19 21:33:04 -07:00
akerr	206e38dac5	Updated copyright of CUTLASS.md	2018-09-19 21:31:12 -07:00
Andrew Kerr	d85f6a1cec	Merge pull request #24 from NVIDIA/cutlass_1.1 Cutlass 1.1	2018-09-19 21:16:53 -07:00
akerr	0826572c4c	Reduced range of random values to avoid bit-level inconsistencies for large matrices.	2018-09-19 21:11:48 -07:00
akerr	77d1e0ca81	Updated README and CHANGELOG.	2018-09-19 20:42:51 -07:00
akerr	d7137f9c0a	Updated doxygen	2018-09-19 14:02:08 -07:00
akerr	461f417b9d	Checkpointing CUTLASS 1.1 release.	2018-09-18 16:58:03 -07:00
Andrew Kerr	cf0301e00f	Merge pull request #15 from NVIDIA/release_1.0.1_edits Minor edits to README and changelog pursuant CUTLASS 1.0.1 patch.	2018-06-26 13:59:01 -07:00
akerr	b9bb0d1a49	Edits to README and changelog pursuant CUTLASS 1.0.1 patch.	2018-06-26 13:57:39 -07:00
Andrew Kerr	e1c4ba501b	Merge pull request #13 from NVIDIA/cutlass_v1.0.1 Cutlass v1.0.1	2018-06-12 08:25:56 -07:00
akerr	c566e83e6d	Updated changelog.	2018-06-11 14:54:07 -07:00
akerr	374882be53	Replaced GoogleTest copy with submodule. Added updates to support intra-threadblock reductions. Added tests for same.	2018-06-11 11:47:15 -07:00
akerr	2c496c3e9e	Replaced GoogleTest copy with Git submodule.	2018-06-11 11:32:41 -07:00
Andrew Kerr	9fd55460c6	Merge pull request #10 from NVIDIA/cutlass_v1.0_rel Minor updates to usage and README.	2018-05-18 12:27:31 -07:00
akerr	480732c2e8	Minor updates to usage and readme.	2018-05-17 15:10:55 -07:00
Andrew Kerr	68aaee8773	Merge pull request #9 from NVIDIA/cutlass_v1.0_rel Updated URL to Doxygen and modified usage statement	2018-05-17 11:12:37 -07:00
akerr	acb90e962a	Updated url to Doxygen and modified usage statement in performance test program.	2018-05-17 11:11:05 -07:00
Andrew Kerr	96bc3f227f	Merge pull request #8 from NVIDIA/cutlass_v1.0_rel Configured Github Pages	2018-05-16 15:26:55 -07:00
akerr	25ff282403	Moved Doxygen documents.	2018-05-16 15:25:24 -07:00
Andrew Kerr	9d5726a568	Set theme jekyll-theme-minimal	2018-05-16 13:49:06 -07:00
Andrew Kerr	6f0d271d8d	CUTLASS v1.0 CUTLASS v1.0 released.	2018-05-16 13:47:13 -07:00
akerr	923dfb42ce	Updated README.md	2018-05-16 12:50:10 -07:00
akerr	6f6f269a0a	Updated README.md	2018-05-16 12:47:07 -07:00
akerr	2028ebe120	CUTLASS v1.0 release	2018-05-16 11:44:56 -07:00
Andrew Kerr	84377249a1	Merge pull request #2 from Artem-B/clang-fixes Merging "Clang fixes" into master.	2018-01-04 15:52:53 -08:00
akerr	901287175f	Merge branch 'Artem-B-clang-fixes'	2018-01-04 15:46:08 -08:00
Artem Belevich	1c9b54df16	Whitespace fix.	2018-01-03 16:42:51 -08:00
Artem Belevich	39616514d0	Reworked CUDA_LOG macro to print location&the message with one printf. This replies on the fact that clang allows using device-side features from __host__/__device__ functions from __host__ ones as long as we don't have to generate code for that. Wrapping thread/blockIdx in __host__ __device__ function allows using CUDA_LOG everywhere during host and device compilation.	2018-01-03 16:36:50 -08:00
Artem Belevich	df4b4e4bb6	Added _cuda_ to the name of the executable to indicate that it's not clang's version.	2017-12-11 16:34:10 -08:00
Artem Belevich	81957b3a3d	Force inlining of few functions that rely on that for performance. Clang is less agressive than nvccnvcc, so number of functions did not getn inlined into the kernel by default. That prevented SROA from eliminating loads/stores to temporary buffers and resulted in abysmal performance. Replaced inline with __forceinline__ to ensure that we do inline the functions necessary for optimal performance.	2017-12-11 14:52:30 -08:00
Artem Belevich	ce2b3f695d	Fixed debug macros for clang. Unlike nvcc, clang always sees both host and device-side code during compilation. CUDA_LOG macro is used in both host and device code, so when it expanded to contain device-only code, that resulted in errors when it was used from the host-side functions. In order to make CUDA_LOG work with clang it was split into two parts -- a pair of target-attribute-based overloaded functions that perform host or device specific parts of logging, and a printf which works on both sides.	2017-12-11 14:52:30 -08:00
Artem Belevich	e9e7cd4d44	Make cutlass compilable with clang. E.g: PATH=/nvcc/path/bin:/clang/path/bin:$PATH make sm=35,60 compiler=clang all	2017-12-11 14:52:30 -08:00
Duane Merrill	95b0578d34	Update license info	2017-12-06 10:00:59 -05:00
Duane Merrill	f4b48c7669	Update README.md	2017-12-05 22:58:46 -05:00
Duane Merrill	6cb88d53eb	Update README.md	2017-12-05 22:58:12 -05:00
Duane Merrill	537a4bcedf	Update README.md	2017-12-05 22:54:49 -05:00
Duane Merrill	5bd3f09312	Update README.md	2017-12-05 22:53:11 -05:00
Duane Merrill	6f091f5620	Update README.md	2017-12-05 22:44:01 -05:00
dumerrill	0428c89fd5	Updating readme with relative per chart	2017-12-05 22:40:47 -05:00
Duane Merrill	e2bf51c3fe	Update README.md	2017-12-05 22:25:42 -05:00
Duane Merrill	57747e382e	Update README.md	2017-12-05 21:32:06 -05:00
Duane Merrill	dd4dd4cebf	Update README.md	2017-12-05 20:58:01 -05:00
Duane Merrill	6565b48747	Update README.md	2017-12-05 20:56:49 -05:00
Duane Merrill	73211bbb88	Update README.md	2017-12-05 20:55:54 -05:00
Duane Merrill	9dcb2b4c7d	Update README.md	2017-12-05 20:55:03 -05:00
Duane Merrill	f30abfc00a	Update README.md	2017-12-05 20:50:15 -05:00
dumerrill	8ebd6b06d0	Replace svg with png+text	2017-12-05 20:20:25 -05:00
dumerrill	04ffa156e8	Adding figure to readme.md	2017-12-05 20:15:33 -05:00
Duane Merrill	24d0ba65c5	Update code formatting	2017-12-05 15:51:01 -05:00